Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Sridhar N Bamandlapally
Just for information,

PG current behavior,

"\set AUTOCOMMIT OFF" implicitly does/open "BEGIN;" block

So, "\set AUTOCOMMIT ON" has no effect once "\set AUTOCOMMIT OFF" is issued
until "END;" or "COMMIT;" or "ROLLBACK;"

however, I think if exit session release the transactions then change
session should also release the transactions

Thanks
Sridhar




On Mon, Aug 8, 2016 at 10:34 PM, Vik Fearing  wrote:

> On 08/08/16 17:02, Robert Haas wrote:
> > On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed 
> wrote:
> >> Thank you for inputs everyone.
> >>
> >> The opinions on this thread can be classified into following
> >> 1. Commit
> >> 2. Rollback
> >> 3. Error
> >> 4. Warning
> >>
> >> As per opinion upthread, issuing implicit commit immediately after
> switching
> >> autocommit to ON, can be unsafe if it was not desired.  While I agree
> that
> >> its difficult to judge users intention here, but if we were to base it
> on
> >> some assumption, the closest would be implicit COMMIT in my
> opinion.There is
> >> higher likelihood of a user being happy with issuing a commit when
> setting
> >> autocommit ON than a transaction being rolled back.  Also there are
> quite
> >> some interfaces which provide this.
> >>
> >> As mentioned upthread, issuing a warning on switching back to autocommit
> >> will not be effective inside a script. It won't allow subsequent
> commands to
> >> be committed as set autocommit to ON is not committed. Scripts will
> have to
> >> be rerun with changes which will impact user friendliness.
> >>
> >> While I agree that issuing an ERROR and rolling back the transaction
> ranks
> >> higher in safe behaviour, it is not as common (according to instances
> stated
> >> upthread) as immediately committing any open transaction when switching
> back
> >> to autocommit.
> >
> > I think I like the option of having psql issue an error.  On the
> > server side, the transaction would still be open, but the user would
> > receive a psql error message and the autocommit setting would not be
> > changed.  So the user could type COMMIT or ROLLBACK manually and then
> > retry changing the value of the setting.
>
> This is my preferred action.
>
> > Alternatively, I also think it would be sensible to issue an immediate
> > COMMIT when the autocommit setting is changed from off to on.  That
> > was my first reaction.
>
> I don't care for this very much.
>
> > Aborting the server-side transaction - with or without notice -
> > doesn't seem very reasonable.
>
> Agreed.
> --
> Vik Fearing  +33 6 46 75 15 36
> http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] dsm_unpin_segment

2016-08-08 Thread Thomas Munro
On Tue, Aug 9, 2016 at 12:53 PM, Tom Lane  wrote:
> The larger picture here is that Robert is exhibiting a touching but
> unfounded faith that extensions using this feature will contain zero bugs.
> IMO there needs to be some positive defense against mistakes in using the
> pin/unpin API.  As things stand, multiple pin requests don't have any
> fatal consequences (especially not on non-Windows), so I have little
> confidence that it's not happening in the field.  I have even less
> confidence that there wouldn't be too many unpin requests.

Ok, here is a version that defends against invalid sequences of
pin/unpin calls.  I had to move dsm_impl_pin_segment into the block
protected by DynamicSharedMemoryControlLock, so that it could come
after the already-pinned check, but before updating any state, since
it makes a Windows syscall that can fail.  That said, I've only tested
on Unix and will need to ask someone to test on Windows.

> What exactly
> is an extension going to be doing to ensure that it doesn't do too many of
> one or the other?

An extension that manages segment lifetimes like this needs a
carefully designed protocol to get it right, probably involving state
in another shared memory area and some interlocking, not to mention a
lot of thought about cleanup.

Here's one use case: I have a higher level object, a
multi-segment-backed shared memory allocator, which owns any number of
segments that together form a shared memory area.  The protocol is
that the allocator always pins segments when it needs to create new
ones, because they need to survive as long as the control segment,
even though no one backend is guaranteed to have all of the auxiliary
segments mapped in (since they're created and attached on demand).
But when the control segment is detached by all backends and is due to
be destroyed, then we need to unpin all the auxiliary segments so they
can also be destroyed, and that can be done from an on_dsm_detach
callback on the control segment.  So I'm riding on the coat tails of
the existing cleanup mechanism for the control segment, while making
sure that the auxiliary segments get pinned and unpinned exactly once.
I'll have more to say about that when I post that patch...

-- 
Thomas Munro
http://www.enterprisedb.com


dsm-unpin-segment-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> I used to think of that this kind of features should be enabled by default,
> because when I was working at the previous company, I had only few features
> to understand what is happening inside PostgreSQL by observing production
> databases. I needed those features enabled in the production databases when
> I was called.
> 
> However, now I have another opinion. When we release the next major release
> saying 10.0 with the wait monitoring, many people will start their benchmark
> test with a configuration with *the default values*, and if they see some
> performance decrease, for example around 10%, they will be talking about
> it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will
> be facing difficult reputation.
> 
> So, I agree with the features should be disabled by default for a while.

I understand your feeling well.  This is a difficult decision.  Let's hope for 
trivial overhead.

Regards
Takayuki Tsunakawa
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> If you want to know why people are against enabling this monitoring by
> default, above is the reason.  What percentage of people do you think would
> be willing to take a 10% performance penalty for monitoring like this?  I
> would bet very few, but the argument above doesn't seem to address the fact
> it is a small percentage.
> 
> In fact, the argument above goes even farther, saying that we should enable
> it all the time because people will be unwilling to enable it on their own.
> I have to question the value of the information if users are not willing
> to enable it.  And the solution proposed is to force the 10% default overhead
> on everyone, whether they are currently doing debugging, whether they will
> ever do this level of debugging, because people will be too scared to enable
> it.  (Yes, I think Oracle took this
> approach.)
> 
> We can talk about this feature all we want, but if we are not willing to
> be realistic in how much performance penalty the _average_ user is willing
> to lose to have this monitoring, I fear we will make little progress on
> this feature.

OK, 10% was an overstatement.  Anyway, As Amit said, we can discuss the default 
value based on the performance evaluation before release.

As another idea, we can stand on the middle ground.  Interestingly, MySQL also 
enables their event monitoring (Performance Schema) by default, but not all 
events are collected.  I guess highly encountered events are not collected by 
default to minimize the overhead.

http://dev.mysql.com/doc/refman/5.7/en/performance-schema-quick-start.html
--
Assuming that the Performance Schema is available, it is enabled by default.
...
[mysqld]
performance_schema=ON
...
Initially, not all instruments and consumers are enabled, so the performance 
schema does not collect all events. To turn all of these on and enable event 
timing, execute two statements (the row counts may differ depending on MySQL 
version): 
mysql> UPDATE setup_instruments SET ENABLED = 'YES', TIMED = 'YES';
Query OK, 560 rows affected (0.04 sec)
mysql> UPDATE setup_consumers SET ENABLED = 'YES';
Query OK, 10 rows affected (0.00 sec)
--


BTW, I remember EnterpriseDB has a wait event monitoring feature.  Is it 
disabled by default?  What was the overhead?

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-08-08 Thread Amit Langote
On 2016/08/09 6:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
>  wrote:
>>> +1, if we could do it. It will need a change in the way Amit's patch stores
>>> partitioning scheme in PartitionDesc.
>>
>> Okay, I will try to implement this in the next version of the patch.
>>
>> One thing that comes to mind is what if a user wants to apply hash
>> operator class equality to list partitioned key by specifying a hash
>> operator class for the corresponding column.  In that case, we would not
>> have the ordering procedure with an hash operator class, hence any
>> ordering based optimization becomes impossible to implement.  The current
>> patch rejects a column for partition key if its type does not have a btree
>> operator class for both range and list methods, so this issue doesn't
>> exist, however it could be seen as a limitation.
> 
> Yes, I think you should expect careful scrutiny of that issue.  It
> seems clear to me that range partitioning requires a btree opclass,
> that hash partitioning requires a hash opclass, and that list
> partitioning requires at least one of those things.  It would probably
> be reasonable to pick one or the other and insist that list
> partitioning always requires exactly that, but I can't see how it's
> reasonable to insist that you must have both types of opclass for any
> type of partitioning.

So because we intend to implement optimizations for list partition
metadata that presuppose existence of corresponding btree operator class,
we should just always require user to specify one (or error out if user
doesn't specify and a default one doesn't exist).  That way, we explicitly
do not support specify hash equality operator for list partitioning.

>> So, we have 3 choices for the internal representation of list partitions:
>>
>> Choice 1 (the current approach):  Load them in the same order as they are
>> found in the partition catalog:
>>
>> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>>
>> In this case, mismatch on the first list would make the two tables
>> incompatibly partitioned, whereas they really aren't incompatible.
> 
> Such a limitation seems clearly unacceptable.  We absolutely must be
> able to match up compatible partitioning schemes without getting
> confused by little details like the order of the partitions.

Agreed.  Will change my patch to adopt the below method.

>> Choice 2: Representation with 2 arrays:
>>
>> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>>
>> It still doesn't help the case of pairwise joins because it's hard to tell
>> which value belongs to which partition (the 2nd array carries the original
>> partition numbers).  Although it might still work for tuple-routing.
> 
> It's very good for tuple routing.  It can also be used to match up
> partitions for pairwise joins.  Compare the first arrays.  If they are
> unequal, stop.  Else, compare the second arrays, incrementally
> building a mapping between them and returning false if the mapping
> turns out to be non-bijective.  For example, in this case, we look at
> index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
> We look at index 2 and decide 2 -> 1.  We look at index 4 and find
> that we already have a mapping for 2, but it's compatible because we
> need 2 -> 1 and that's what is already there.  Similarly for the
> remaining entries.  This is really a pretty easy loop to write and it
> should run very quickly.

I see, it does make sense to try to implement this way.

>> Choice 3: Order all lists' elements for each list individually and then
>> order the lists themselves on their first values:
>>
>> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>>
>> This representation makes pairing partitions for pairwise joining
>> convenient but for tuple-routing we still need to visit each partition in
>> the worst case.
> 
> I think this is clearly not good enough for tuple routing.  If the
> algorithm I proposed above turns out to be too slow for matching
> partitions, then we could keep both this representation and the
> previous one.  We are not limited to just one.  But I don't think
> that's likely to be the case.

I agree.  Let's see how the option 2 turns out.

> Also, note that all of this presupposes we're doing range
> partitioning, or perhaps list partitioning with a btree opclass.  For
> partitioning based on a hash opclass, you'd organize the data based on
> the hash values rather than range comparisons.

Yes, the current patch does not implement hash partitioning, although I
have to think about how to support the hash case when designing the
internal data structures.


By the way, I am planning to start a new thread with the latest set of
patches which I will post in a day or two.  I have tried to implement all

Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Satoshi Nagayasu
2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky
:
> I've summarized Wait events monitoring discussion at Developer unconference
> in Ottawa this year on wiki:
>
> https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
>
> (Thanks to Alexander Korotkov for patiently pushing me to make this thing
> finally done)

Thanks for your effort to make us move forward.

> If you attended, fill free to point me out if I missed something, I will put
> it on the wiki too.
>
> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction. The
> importance of the topic is beyond discussion now, if you talk to any
> PostgreSQL person about implementing such a tool in Postgres and if the
> person does not get exited, probably you talk to a full-time PostgreSQL
> developer;-) Obviously it needs a better design, both the user interface and
> implementation, and perhaps this is why full-time developers are still
> sceptical.
>
> In order to move forward, imho we need at least some steps, whose steps can
> be done in parallel
>
> 1. Further requirements need to be collected from DBAs.
>
>If you are a PostgreSQL DBA with Oracle experience and use perf for
> troubleshooting Postgres - you are an ideal person to share your experience,
> but everyone is welcome.
>
> 2. Further pg_wait_sampling performance testing needed and in different
> environments.
>
>According to developers, overhead is small, but many people have doubts
> that it can be much more significant for intensive workloads. Obviously, it
> is not an easy task to test, because you need to put doubtfully
> non-production ready code into mission-critical production for such tests.
>As a result it will be clear if this design should be abandoned  and we
> need to think about less-invasive solutions or this design is acceptable.
>
> Any thoughts?

Seems a good starting point. I'm interested in both, and I would like
to contribute
with running (or writing) several tests.

Regards,
-- 
Satoshi Nagayasu 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Satoshi Nagayasu
2016-08-09 11:49 GMT+09:00 Joshua D. Drake :
> On 08/08/2016 07:37 PM, Bruce Momjian wrote:
>>
>> On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
>>>
>>> I hope wait event monitoring will be on by default even if the overhead
>>> is not
>>> almost zero, because the data needs to be readily available for faster
>>> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
>>> you
>>> disable it by default because of overhead, how can we convince users to
>>> enable
>>> it in production systems to solve some performance problem?  I’m afraid
>>> severe
>>> users would say “we can’t change any setting that might cause more
>>> trouble, so
>>> investigate the cause with existing information.”
>>
>>
>> If you want to know why people are against enabling this monitoring by
>> default, above is the reason.  What percentage of people do you think
>> would be willing to take a 10% performance penalty for monitoring like
>> this?  I would bet very few, but the argument above doesn't seem to
>> address the fact it is a small percentage.
>
>
> I would argue it is zero. There are definitely users for this feature but to
> enable it by default is looking for trouble. *MOST* users do not need this.

I used to think of that this kind of features should be enabled by default,
because when I was working at the previous company, I had only few features
to understand what is happening inside PostgreSQL by observing production
databases. I needed those features enabled in the production databases
when I was called.

However, now I have another opinion. When we release the next major release
saying 10.0 with the wait monitoring, many people will start their
benchmark test
with a configuration with *the default values*, and if they see some performance
decrease, for example around 10%, they will be talking about it as the
performance
decrease in PostgreSQL 10.0. It means PostgreSQL will be facing
difficult reputation.

So, I agree with the features should be disabled by default for a while.

Regards,
-- 
Satoshi Nagayasu 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Joshua D. Drake

On 08/08/2016 07:37 PM, Bruce Momjian wrote:

On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:

I hope wait event monitoring will be on by default even if the overhead is not
almost zero, because the data needs to be readily available for faster
troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you
disable it by default because of overhead, how can we convince users to enable
it in production systems to solve some performance problem?  I’m afraid severe
users would say “we can’t change any setting that might cause more trouble, so
investigate the cause with existing information.”


If you want to know why people are against enabling this monitoring by
default, above is the reason.  What percentage of people do you think
would be willing to take a 10% performance penalty for monitoring like
this?  I would bet very few, but the argument above doesn't seem to
address the fact it is a small percentage.


I would argue it is zero. There are definitely users for this feature 
but to enable it by default is looking for trouble. *MOST* users do not 
need this.


Sincerely,

JD

--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Bruce Momjian
On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
> I hope wait event monitoring will be on by default even if the overhead is not
> almost zero, because the data needs to be readily available for faster
> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you
> disable it by default because of overhead, how can we convince users to enable
> it in production systems to solve some performance problem?  I’m afraid severe
> users would say “we can’t change any setting that might cause more trouble, so
> investigate the cause with existing information.”

If you want to know why people are against enabling this monitoring by
default, above is the reason.  What percentage of people do you think
would be willing to take a 10% performance penalty for monitoring like
this?  I would bet very few, but the argument above doesn't seem to
address the fact it is a small percentage.

In fact, the argument above goes even farther, saying that we should
enable it all the time because people will be unwilling to enable it on
their own.  I have to question the value of the information if users are
not willing to enable it.  And the solution proposed is to force the 10%
default overhead on everyone, whether they are currently doing
debugging, whether they will ever do this level of debugging, because
people will be too scared to enable it.  (Yes, I think Oracle took this
approach.)

We can talk about this feature all we want, but if we are not willing to
be realistic in how much performance penalty the _average_ user is
willing to lose to have this monitoring, I fear we will make little
progress on this feature.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Ilya Kosmodemiansky
I've summarized Wait events monitoring discussion at Developer unconference in 
Ottawa this year on wiki:

https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring

I hope wait event monitoring will be on by default even if the overhead is not 
almost zero, because the data needs to be readily available for faster 
troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you 
disable it by default because of overhead, how can we convince users to enable 
it in production systems to solve some performance problem?  I’m afraid severe 
users would say “we can’t change any setting that might cause more trouble, so 
investigate the cause with existing information.”

We should positively consider the performance with wait event monitoring on as 
the new normal.  Then, we should develop more features that leverage the wait 
event data, so that wait event data is crucial.  The manual explains to users 
that wait event monitoring can be turned off for maximal performance but it’s 
not recommended.

BTW, taking advantage of this chance, why don’t we enrich the content of 
performance tuning in the manual?  At least it needs to be explained how to 
analyze the wait event data and tune the system.

Performance Tips
https://www.postgresql.org/docs/devel/static/performance-tips.html

Regards
Takayuki Tsunakawa









Re: [HACKERS] dsm_unpin_segment

2016-08-08 Thread Tom Lane
Thomas Munro  writes:
> Yeah, I was considering unbalanced pin/unpin requests to be a
> programming error.  To be more defensive about that, how about I add a
> boolean 'pinned' to dsm_control_item, and elog(ERROR, ...) if it's not
> in the expected state when you try to pin or unpin?

Well, what you have there is a one-bit-wide pin request counter.
I do not see why that's better than an actual counter, but if that's
what you want to do, fine.

The larger picture here is that Robert is exhibiting a touching but
unfounded faith that extensions using this feature will contain zero bugs.
IMO there needs to be some positive defense against mistakes in using the
pin/unpin API.  As things stand, multiple pin requests don't have any
fatal consequences (especially not on non-Windows), so I have little
confidence that it's not happening in the field.  I have even less
confidence that there wouldn't be too many unpin requests.  What exactly
is an extension going to be doing to ensure that it doesn't do too many of
one or the other?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm_unpin_segment

2016-08-08 Thread Thomas Munro
On Tue, Aug 9, 2016 at 12:22 PM, Robert Haas  wrote:
> On Mon, Aug 8, 2016 at 7:33 PM, Tom Lane  wrote:
>> Thomas Munro  writes:
>>> Please find attached a patch to add a corresponding operation
>>> 'dsm_unpin_segment'.  This gives you a way to ask for the segment to
>>> survive only until you decide to unpin it, at which point the usual
>>> reference counting semantics apply again.  It decrements the reference
>>> count, undoing the effect of dsm_pin_segment and destroying the
>>> segment if appropriate.
>>
>> What happens if dsm_unpin_segment is called more times than
>> dsm_pin_segment?  Seems like you could try to destroy a segment that
>> still has processes attached.
>
> Calling dsm_pin_segment more than once is not supported and has never
> been supported.  As the comments explain:
>
>  * This function should not be called more than once per segment;
>  * on Windows, doing so will create unnecessary handles which will
>  * consume system resources to no benefit.
>
> Therefore, I don't see the problem.  You can pin a segment that is not
> pinned, and you can unpin a segment that is pinned.  You may not
> re-pin a segment that is already pinned, nor unpin a segment that is
> not pinned.  If you try to do so, you are using the API contrary to
> specification, and if it breaks (as it will) you get to keep both
> pieces.
>
> We could add the reference counting behavior for which you are asking,
> but that seems to be an entirely new feature for which I know of no
> demand.

Yeah, I was considering unbalanced pin/unpin requests to be a
programming error.  To be more defensive about that, how about I add a
boolean 'pinned' to dsm_control_item, and elog(ERROR, ...) if it's not
in the expected state when you try to pin or unpin?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm_unpin_segment

2016-08-08 Thread Robert Haas
On Mon, Aug 8, 2016 at 7:33 PM, Tom Lane  wrote:
> Thomas Munro  writes:
>> DSM segments have a concept of 'pinning'.  Normally, segments are
>> destroyed when they are no longer mapped by any backend, using a
>> reference counting scheme.  If you call dsm_pin_segment(segment), that
>> is disabled so that the segment won't be destroyed until the cluster
>> is shut down.  It works by incrementing the reference count an extra
>> time.
>
>> Please find attached a patch to add a corresponding operation
>> 'dsm_unpin_segment'.  This gives you a way to ask for the segment to
>> survive only until you decide to unpin it, at which point the usual
>> reference counting semantics apply again.  It decrements the reference
>> count, undoing the effect of dsm_pin_segment and destroying the
>> segment if appropriate.
>
> What happens if dsm_unpin_segment is called more times than
> dsm_pin_segment?  Seems like you could try to destroy a segment that
> still has processes attached.

Calling dsm_pin_segment more than once is not supported and has never
been supported.  As the comments explain:

 * This function should not be called more than once per segment;
 * on Windows, doing so will create unnecessary handles which will
 * consume system resources to no benefit.

Therefore, I don't see the problem.  You can pin a segment that is not
pinned, and you can unpin a segment that is pinned.  You may not
re-pin a segment that is already pinned, nor unpin a segment that is
not pinned.  If you try to do so, you are using the API contrary to
specification, and if it breaks (as it will) you get to keep both
pieces.

We could add the reference counting behavior for which you are asking,
but that seems to be an entirely new feature for which I know of no
demand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm_unpin_segment

2016-08-08 Thread Tom Lane
Thomas Munro  writes:
> DSM segments have a concept of 'pinning'.  Normally, segments are
> destroyed when they are no longer mapped by any backend, using a
> reference counting scheme.  If you call dsm_pin_segment(segment), that
> is disabled so that the segment won't be destroyed until the cluster
> is shut down.  It works by incrementing the reference count an extra
> time.

> Please find attached a patch to add a corresponding operation
> 'dsm_unpin_segment'.  This gives you a way to ask for the segment to
> survive only until you decide to unpin it, at which point the usual
> reference counting semantics apply again.  It decrements the reference
> count, undoing the effect of dsm_pin_segment and destroying the
> segment if appropriate.

What happens if dsm_unpin_segment is called more times than
dsm_pin_segment?  Seems like you could try to destroy a segment that
still has processes attached.

I don't object to the concept, but you need a less half-baked
implementation if you want to add this.  I'd suggest separate counters for
process attaches and pin requests, with code in dsm_unpin_segment to
disallow decrementing the pin request count below zero, and segment
destruction only when both counters go to zero.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tatsuo Ishii
>> On the other hand, usage of some well-defined statement name to trigger
>> the special case would be fine: all pgbouncer versions would pass those
>> parse/bind/exec message as if it were regular messages.
> 
> I do not accept this idea that retroactively defining special semantics
> for certain statement names is not a protocol break.  If it causes any
> change in what the server's response would be, then it is a protocol
> break.

+1. It definitely is a protocol break.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] dsm_unpin_segment

2016-08-08 Thread Thomas Munro
Hi hackers,

DSM segments have a concept of 'pinning'.  Normally, segments are
destroyed when they are no longer mapped by any backend, using a
reference counting scheme.  If you call dsm_pin_segment(segment), that
is disabled so that the segment won't be destroyed until the cluster
is shut down.  It works by incrementing the reference count an extra
time.

Please find attached a patch to add a corresponding operation
'dsm_unpin_segment'.  This gives you a way to ask for the segment to
survive only until you decide to unpin it, at which point the usual
reference counting semantics apply again.  It decrements the reference
count, undoing the effect of dsm_pin_segment and destroying the
segment if appropriate.

I think this is very useful for any core or extension code that wants
to store data in dynamic shared memory that survives even when no
backends are running, without having to commit to keeping the segment
forever.  We have several projects in the 10.x pipeline which make use
of DSM segments, and we ran into the need for this finer grained
control of their lifetime.

Thanks to my colleague Amit Kapila for the Windows part, and also to
Robert Haas for internal feedback on an earlier version.  The Windows
implementation has an extra quirk:  Windows has its own reference
counting scheme that destroys mapped memory when there are no attached
processes, so pinning is implemented by sending a duplicate of the
handle into the Postmaster process to keep the segment alive.  This
patch needs to close that handle when unpinning.  Amazingly, that can
be done without any cooperation from the postmaster.

I'd be grateful for any feedback or thoughts, and will add this to the
commitfest.

-- 
Thomas Munro
http://www.enterprisedb.com


dsm-unpin-segment.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2016 at 11:47:11PM +0200, Ilya Kosmodemiansky wrote:
> On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian  wrote:
> > It seems asking users to run pg_test_timing before deploying to check
> > the overhead would be sufficient.
> 
> I'am not sure. Time measurement for waits is slightly more complicated
> than a time measurement for explain analyze: a good workload plus
> using gettimeofday in a straightforward manner can cause huge
> overhead. Thats why a proper testing is important - if we can see a
> significant performance drop if we have for example large
> shared_buffers with the same concurrency,  that shows gettimeofday is
> too expensive to use. Am I correct, that we do not have such accurate
> tests now?

Well, if we find that pg_test_timing is insufficient, we can perhaps add
a parallel test option to that utility.

> My another concern is, that it is a bad idea to release a feature,
> which allegedly has huge performance impact even if it is not turned
> on by default. I often meet people who do not use exceptions in
> plpgsql because a tip "A block containing an EXCEPTION clause is
> significantly more expensive to enter ..." in PostgreSQL documentation

Well, if we document that is can be slow, it is up to the user to decide
if they want to use it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Ilya Kosmodemiansky
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian  wrote:
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.

I'am not sure. Time measurement for waits is slightly more complicated
than a time measurement for explain analyze: a good workload plus
using gettimeofday in a straightforward manner can cause huge
overhead. Thats why a proper testing is important - if we can see a
significant performance drop if we have for example large
shared_buffers with the same concurrency,  that shows gettimeofday is
too expensive to use. Am I correct, that we do not have such accurate
tests now?

My another concern is, that it is a bad idea to release a feature,
which allegedly has huge performance impact even if it is not turned
on by default. I often meet people who do not use exceptions in
plpgsql because a tip "A block containing an EXCEPTION clause is
significantly more expensive to enter ..." in PostgreSQL documentation


-- 
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
i...@postgresql-consulting.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-08-08 Thread Robert Haas
On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
 wrote:
>> +1, if we could do it. It will need a change in the way Amit's patch stores
>> partitioning scheme in PartitionDesc.
>
> Okay, I will try to implement this in the next version of the patch.
>
> One thing that comes to mind is what if a user wants to apply hash
> operator class equality to list partitioned key by specifying a hash
> operator class for the corresponding column.  In that case, we would not
> have the ordering procedure with an hash operator class, hence any
> ordering based optimization becomes impossible to implement.  The current
> patch rejects a column for partition key if its type does not have a btree
> operator class for both range and list methods, so this issue doesn't
> exist, however it could be seen as a limitation.

Yes, I think you should expect careful scrutiny of that issue.  It
seems clear to me that range partitioning requires a btree opclass,
that hash partitioning requires a hash opclass, and that list
partitioning requires at least one of those things.  It would probably
be reasonable to pick one or the other and insist that list
partitioning always requires exactly that, but I can't see how it's
reasonable to insist that you must have both types of opclass for any
type of partitioning.

>> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b',
>> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are
>> essentially same, are represented in different ways. It makes matching
>> partitions for partition-wise join a bit tedius. We have to make sure that
>> the first array matches for both the joining relations and then make sure
>> that all the values belonging to the same partition for one table also
>> belong to the same partition in the other table. Some more complex logic
>> for matching subsets of lists for partition-wise join.
>
> So, we have 3 choices for the internal representation of list partitions:
>
> Choice 1 (the current approach):  Load them in the same order as they are
> found in the partition catalog:
>
> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>
> In this case, mismatch on the first list would make the two tables
> incompatibly partitioned, whereas they really aren't incompatible.

Such a limitation seems clearly unacceptable.  We absolutely must be
able to match up compatible partitioning schemes without getting
confused by little details like the order of the partitions.

> Choice 2: Representation with 2 arrays:
>
> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>
> It still doesn't help the case of pairwise joins because it's hard to tell
> which value belongs to which partition (the 2nd array carries the original
> partition numbers).  Although it might still work for tuple-routing.

It's very good for tuple routing.  It can also be used to match up
partitions for pairwise joins.  Compare the first arrays.  If they are
unequal, stop.  Else, compare the second arrays, incrementally
building a mapping between them and returning false if the mapping
turns out to be non-bijective.  For example, in this case, we look at
index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
We look at index 2 and decide 2 -> 1.  We look at index 4 and find
that we already have a mapping for 2, but it's compatible because we
need 2 -> 1 and that's what is already there.  Similarly for the
remaining entries.  This is really a pretty easy loop to write and it
should run very quickly.

> Choice 3: Order all lists' elements for each list individually and then
> order the lists themselves on their first values:
>
> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>
> This representation makes pairing partitions for pairwise joining
> convenient but for tuple-routing we still need to visit each partition in
> the worst case.

I think this is clearly not good enough for tuple routing.  If the
algorithm I proposed above turns out to be too slow for matching
partitions, then we could keep both this representation and the
previous one.  We are not limited to just one.  But I don't think
that's likely to be the case.

Also, note that all of this presupposes we're doing range
partitioning, or perhaps list partitioning with a btree opclass.  For
partitioning based on a hash opclass, you'd organize the data based on
the hash values rather than range comparisons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Claudio Freire
On Mon, Aug 8, 2016 at 5:24 PM, Bruce Momjian  wrote:
> On Mon, Aug  8, 2016 at 03:36:12PM -0300, Claudio Freire wrote:
>> I think I prefer a more thorough approach.
>>
>> Increment/decrement may get very complicated with custom opclasses,
>> for instance. A column-change bitmap won't know how to handle
>> funcional indexes, etc.
>
> Hot does HOT handle them?  If it does, it has to look at all the columns
> passes to the expression index, so it seems to be the same amount of
> work.

The way HOT handles it IIRC induces false positives (as in: HOT is
forbidden when it might be ok) because it may flag as changed
expressions that don't change. Think date_trunc('day', timestamp), if
the timestamp changes within a day, there's no change in reality, but
the column changed so HOT is forbidden. But the logic won't create
false negatives (allow HOT when it's not allowed).

But for WARM that might be the case, because the guarantee that you
need is that no key already present in the WARM chain will be
re-added. An index over (a - b) could have both a and b in increasing
order yet repeat keys and violate the WARM chain invariant.

>> What I intend to try, is modify btree to allow efficient search of a
>> key-ctid pair, by adding the ctid to the sort order. Only inner pages
>> need to be changed, to include the ctid in the pointers, leaf pages
>> already have the ctid there, so they don't need any kind of change. A
>> bit in the metapage could indicate whether it's a new format or an old
>> one, and yes, only new indices will be able to use WARM.
>>
>> But with efficient key-ctid searches, you handle all cases, and not
>> just a few common ones.
>
> True.  I am worried about page spills caused by having to insert a rows
> into an existing page and and index entry having to be pushed to an
> adjacent page to maintain ctid index order.

I don't think it would be a concern, as inserting a serial column shouldn't.

Maybe some pages will split that wouldn't have, but the split will add
room to the new leaf pages and some splits that would have split
before won't happen afterwards.

Think of it as equivalent to adding the oid to the index - it's some
immutable attribute of the row being inserted, the fact that it is a
tid shouldn't make a difference to the performance of the btree, aside
from the extra comparisons perhaps.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2016 at 03:36:12PM -0300, Claudio Freire wrote:
> I think I prefer a more thorough approach.
> 
> Increment/decrement may get very complicated with custom opclasses,
> for instance. A column-change bitmap won't know how to handle
> funcional indexes, etc.

Hot does HOT handle them?  If it does, it has to look at all the columns
passes to the expression index, so it seems to be the same amount of
work.

> What I intend to try, is modify btree to allow efficient search of a
> key-ctid pair, by adding the ctid to the sort order. Only inner pages
> need to be changed, to include the ctid in the pointers, leaf pages
> already have the ctid there, so they don't need any kind of change. A
> bit in the metapage could indicate whether it's a new format or an old
> one, and yes, only new indices will be able to use WARM.
> 
> But with efficient key-ctid searches, you handle all cases, and not
> just a few common ones.

True.  I am worried about page spills caused by having to insert a rows
into an existing page and and index entry having to be pushed to an
adjacent page to maintain ctid index order.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2016 at 11:22:49PM +0530, Pavan Deolasee wrote:
> What I am currently trying to do is to reuse at least the BlockNumber field in
> t_ctid. For HOT/WARM chains, that field is really unused (except the last 
> tuple
> when regular update needs to store block number of the new block). My idea is
> to use one free bit in t_infomask2 to tell us that t_ctid is really not a 
> CTID,
> but contains new information (for pg_upgrade's sake). For example, one bit in
> bi_hi can tell us that this is the last tuple in the chain (information today
> conveyed by t_ctid pointing to self). Another bit can tell us that this tuple
> was WARM updated. We will still have plenty of bits to store additional
> information about WARM chains.

Yes, that works too, though it probably only makes sense if we can make
use of more than one bit in bi_hi.

> My guess is we would need one bit to mark a WARM chain, and perhaps
> reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or
> decrement-only. 
> 
> 
> I am not convinced that the checking for increment/decrement adds a lot of
> value. Sure, we might be able to address some typical work load, but is that
> really a common use case? Instead, what I am looking at storing a bitmap which

I have no idea, but it guarantees that the first WARM update works
because there is no direction set.  Then, if the direction changes, you
create a new chain and hope the changes stay in that direction for a
while.

> shows us which table columns have changed so far in the WARM chain. We only
> have limited bits, so we can track only limited columns. This will help the
> cases where different columns are updated, but not so much if the same column
> is updated repeatedly.

Well, I don't think in 15 bits we have enough space to store many column
numbers, let alone column numbers and _values_.  You would need four
bits (1-16) to exceed what you can store in a simple bitmap of the first
15 columns.  If you want to extend that range, you can use 8 bits (2^8)
to record one of the first 256 column numbers.  You could do 7 bits (2^7
= 128) and use another bit per column to record the increment/decrement
direction, meaning that repeated changes to the same column in the same
direction would be allowed in the same WARM chain.  I think it is more
likely that the same column is going to be changed in the same WARM
chain, than changes in different columns.

Frankly, with only 16 bits, I can't see how recording specific columns
really buys us much because we have to limit the column number storage.
Plus, if a column changes twice, you need to create a new WARM chain
unless you record the increment/decrement direction.

What we could do is to record the first two changed columns in the
16-bit field, with direction, then record a bit for direction of all
columns not in the first two that change.  That allows you to record
three sets of directions in the same HOT chain.  It does not allow you
to change the direction of any column previously recorded in the WARM
chain.

You could say you are going to scan the WARM chain for changes, but that
limits pruning.  You could try storing just the changes for pruned rows,
but then you are going to have a lot of overhead scanning the WARM chain
looking for changes.

It would be interesting to store the change _direction_ for the first 15
columns in the bitmap, and then use the 16th bit for the rest of the
columns, but I can't figure out how to record which bits are set with a
direction and which are the default/unused.  You really need two bits
per column, so that only records the first seven or eight columns.

> What will help, and something I haven't yet applied any thoughts, is when we
> can turn WARM chains back to HOT by removing stale index entries.

I can't see how we can ever do that because we have multiple indexes
pointing to the chain, and keys that might be duplicated if we switched
to HOT.  Seems only VACUUM can fix that.

> Some heuristics and limits on amount of work done to detect duplicate index
> entries will help too.

Yeah, I have kind of given up on that.

> > We can't use the bits LP_REDIRECT lp_len because we need to create WARM
> > chains before pruning, and I don't think walking the pre-pruned chain is
> > worth it.  (As I understand HOT, LP_REDIRECT is only created during
> > pruning.)
> 
> That's correct. But lp_len provides us some place to stash information from
> heap tuples when they are pruned.

Right.  However, I see storing information at prune time as only useful
if you are willing to scan the chain, and frankly, I have given up on
chain scanning (with column comparisons) as being too expensive for 
its limited value.  What we could do is to use the LP_REDIRECT lp_len to
store information of two more changed column numbers, with directions. 
However, once you store one bit that records the direction of all other
columns, you can't go back and record the changes, unless you do a chain
scan at prune time.


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Kevin Grittner
On Sun, Aug 7, 2016 at 11:53 PM, Alvaro Herrera
 wrote:

>> Building on the has-property approach Andrew suggested, I wonder if
>> we need something like pg_index_column_has_property(indexoid, colno,
>> propertyname) with properties like "sortable", "desc", "nulls first".
>
> This seems simple enough, on the surface.  Why not run with this design?

Andrew's patch, plus this, covers everything I can think of.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-08 Thread Peter Geoghegan
Over on the "Parallel tuplesort (for parallel B-Tree index creation)"
thread [1], there has been some discussion of merging vs.
partitioning. There is a concern about the fact the merge of the
tuplesort used to build a B-Tree is not itself parallelized. There is
a weak consensus that we'd be better off with partitioning rather than
merging anyway.

I've started this new thread to discuss where we want to be with
partitioning (and further parallel merging). In short: How can we
maximize the benefit of parallel tuplesort?

I've said or at least implied that I favor keeping the CREATE INDEX
merge serial, while pursuing partitioning (for other uses) at a later
date. Further parallelization is useful for parallel query much more
so than parallel CREATE INDEX.

To summarize, I favor not parallelizing merging because:

* The scalability of parallel CREATE INDEX as implemented is
apparently comparable to the scalability of similar features in other
systems [2] (systems with very mature implementations). ISTM that we
cannot really expect to do too much better than that.

* We already do some merging in parallel to get runs for the leader to
merge in serial, if and only if workers each produce more than one
initial run (not unlikely). That's still pretty effective, and ensures
that the final leader merge is cache efficient. I think that I'll
probably end up changing the patch to share maintenance_work_mem among
workers (not have it be per-worker), making it common for workers to
merge, while the leader merge ends up needed fairly few rounds of
preloading to merge the final output.

* Even if it helps a bit, parallel merging is not the future;
partitioning is (more on this later).

I don't think partitioning is urgent for CREATE INDEX, and may be
inappropriate for CREATE INDEX under any circumstances, because:

* Possible problems with parallel infrastructure and writes.

Can the parallel infrastructure be even made to write and WAL log an
index in parallel, too? Partitioning more or less forces this, at
least if you expect any benefit at all -- reading the concatenated
output files in a serial pass where the index is written seems likely
to hurt more than it helps. This is not much of an advantage when
you're likely to be I/O bound anyway, and when the tuple startup cost
*per worker* is not of particular concern.

* Unbalanced B-Trees (or the risk thereof).

This is the really big one for me. We can only write leaf pages out in
parallel; we have to "merge together sub B-Trees" to produce internal
pages that make up a finished B-Tree. That risks creating a set of
internal pages that reflect a skew that any partition-to-sort approach
happened to have, due to whatever problem might emerge in the choice
of separators in tuplesort.c (this is a documented issue with SQL
Server Enterprise Edition, in fact). Can we tolerate the risk of
ending up with an unbalanced final B-Tree in the event of a
misestimation? Seems like that's quite a lot worse than poor query
performance (due to poor load balancing among parallel workers). DBAs
are more or less used to the latter variety of problems. They are not
used to the former, especially because the problem might go unnoticed
for a long time.

* What I've come up with is minimally divergent from the existing
approach to tuplesorting.

This is useful because of the code complexity of having multiple
workers consuming final output (writing an index) in parallel looks to
be fairly high. Consider B-Tree related features that tuplesort must
care about today, which have further special considerations in a world
where this simple "consume all output at once in one process" model is
replaced for parallel CREATE INDEX. You'd get all kinds of subtle
problems at various boundaries in the final keyspace for things like
CREATE UNIQUE INDEX CONCURRENTLY. That seems like something that I'd
be quite happy to avoid worrying about (while still allowing parallel
CREATE UNIQUE INDEX CONCURRENTLY). Note that some other systems don't
allow the equivalent of parallel CREATE INDEX CONCURRENTLY at all,
presumably because of similar concerns. My patch supports CIC, and
does so almost automatically (although the second pass isn't performed
in parallel).

I bet that CREATE INDEX won't be the last case where that simplicity
and generality clearly matters.

Suggested partitioning algorithm


I think a hybrid partitioning + merging approach would work well for
us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
Probabilistic Splitting" [3] has influenced my thinking here (this was
written by prominent researchers from the influential UW-Madison
Wisconsin database group). Currently, I have in mind something that is
closer to what they call exact splitting to what they call
probabilistic splitting, because I don't think it's going to be
generally possible to have good statistics on partition boundaries
immediately available (e.g., through something like their

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Vladimir Sitnikov
I'm sorry, we are discussing technical details with no real-life use case
to cover that.
I do not want to suck time for no reason. Please accept my sincere
apologies for not asking the real-life case earlier.

Shay, can you come up with a real-life use case when those "I claim the
statement will be used only once" is would indeed improve performance?
Or, to put it in another way: "do you have a real-life case when simple
protocol is faster than extended protocol with statement reuse"?

I do have a couple of java applications and it turns out there's a huge win
of reusing server-prepared statements.
There's a problem that "generic plan after 5th execution might be much
worse than a specific one", however those statements are not often and I
just put hints to the SQL (limit 0, +0, CTE, those kind of things).

Tom Lane :

> I do not accept this idea that retroactively defining special semantics
> for certain statement names is not a protocol break.


Sir, any new SQL keyword is what you call a "retroactively defining special
semantics".
It's obvious that very little current clients do use named server-prepared
statements.
Statement names are not something that is provided by the end-user in a web
page, so it is not a rocket science to come up with a statement name that
is both short and "never ever used in the wild".

Tom Lane :

> If it causes any
> change in what the server's response would be, then it is a protocol
> break.
>

I see no changes except "backend would report a protocol violation for the
case when special statement is used and message sequence is wrong".


> > Note: it is quite easy to invent a name that is not yet used in the wild,
> > so it is safe.
>
> Sir, that is utter nonsense.



Tom Lane :

> And even if it were true, why is it that
> this way would safely pass through existing releases of pgbouncer when
> other ways would not?  Either pgbouncer needs to understand what it's
> passing through, or it doesn't.
>

Once again: exiting pgbouncer versions know how to parse
Parse/Bind/Exec/Deallocate messages, so if we bless some well-defined
statement name with a semantics that "it is forbidden to reuse that name
for multiple executions in a row", then that is completely transparent for
pgbouncer.  Pgbouncer would just think that "the application is dumb since
it reparses the same statement again and againt", but it would work just
fine.

On contrary, if a new statement name is added, then pgbouncer would fail to
understand the new message.

Vladimir


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Vladimir Sitnikov
Shay Rojansky :

> Ah, I understand the proposal better now - you're not proposing encoding a
> new message type in an old one, but rather a magic statement name in Parse
> which triggers an optimized processing path in PostgreSQL, that wouldn't go
> through the query cache etc.
>

Exactly.


> If so, isn't that what the empty statement is already supposed to do? I
> know there's already some optimizations in place around the scenario of
> empty statement name (and empty portal).
>

The problem with "empty statement name" is statements with empty name can
be reused (for instance, for batch insert executions), so the server side
has to do a defensive copy (it cannot predict how many times this unnamed
statement will be used).

Shay Rojansky :

> Also, part of the point here is to reduce the number of protocol messages
> needed in order to send a parameterized query - not to have to do
> Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from
> that (although I'm not sure how much). Your proposal keeps the 5 messages.
>

As my benchmarks show, notable overhead is due to "defensive copying of the
execution plan". So I would measure first, and only then would claim where
the overhead is.

Some more profiling is required to tell which part is a main time consumer.
Technically speaking, I would prefer to have a more real-life looking
example instead of SELECT 1.
Do you have something in mind?
For instance, for more complex queries, "Parse/Plan" could cost much more
than we shave by adding "a special non-cached statement name" or by
reducing "5 messages into 1".

There's a side problem: describe message requires full roundtrip since
there are cases when client needs to know how to encode parameters.
Additional roundtrip hurts much worse than just an additional message that
is pipelined (e.g. sent in the same TCP packet).

Shay Rojansky :

> But people seem to be suggesting that a significant part of the overhead
> comes from the fact that there are 5 messages, meaning there's no way to
> optimize this without a new message type.
>

Of course 5 messages are slower than 1 message.
However, that does not mean "there's no way to optimize without a new
message type".
Profiling can easily reveal time consumer parts, then we can decide if
there's a solution.
Note: if we improve "SELECT 1" by 10%, it does not mean we improved
statement execution by 10%. Real-life statements matter for proper
profiling/discussion.

 Shay Rojansky :

> Note: it is quite easy to invent a name that is not yet used in the wild,
>> so it is safe.
>>
>
> That's problematic, how do you know what's being used in the wild and what
> isn't? The protocol has a specification, it's very problematic to get up
> one day and to change it retroactively. But again, the empty statement
> seems to already be there for that.
>

Empty statement has different semantics, and it is wildly used.
For instance, pgjdbc uses unnamed statements a lot.
On the other hand, statement name of "!pq@#!@#42" is rather safe to use as
a special case.
Note: statement names are not typically created by humans (statement name
is not a SQL), and very little PG clients do support named statements.

Vladimir


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tom Lane
Shay Rojansky  writes:
> Ah, I understand the proposal better now - you're not proposing encoding a
> new message type in an old one, but rather a magic statement name in Parse
> which triggers an optimized processing path in PostgreSQL, that wouldn't go
> through the query cache etc.

> If so, isn't that what the empty statement is already supposed to do? I
> know there's already some optimizations in place around the scenario of
> empty statement name (and empty portal).

Right, but the problem is that per the current protocol spec, those
optimizations are not allowed to change any protocol-visible behavior.
We might be able to do a bit more optimization than we're doing now, but
we still have to be able to cache the statement in case it's executed more
than once, we still have to do planning in response to a separate 'bind'
message, etc.  AFAICT most of the added overhead comes from having to
allow for the possibility of re-execution: that forces at least one extra
copy of the parse tree to be made, as well as adding manipulations of the
plan cache.  We can't get out from under that without a protocol change.

> Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform
> close to Query, e.g. when specifying empty statement/portal, that's
> obviously the best thing here. But people seem to be suggesting that a
> significant part of the overhead comes from the fact that there are 5
> messages, meaning there's no way to optimize this without a new message
> type.

The rather crude measurements I've done put most of the blame on the
cacheing part, although the number of messages might matter more on
platforms with slow process-title-update support (I'm looking at you,
Windows).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Claudio Freire
On Mon, Aug 8, 2016 at 2:52 PM, Pavan Deolasee  wrote:
>>
>> My guess is we would need one bit to mark a WARM chain, and perhaps
>> reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or
>> decrement-only.
>
>
> I am not convinced that the checking for increment/decrement adds a lot of
> value. Sure, we might be able to address some typical work load, but is that
> really a common use case? Instead, what I am looking at storing a bitmap
> which shows us which table columns have changed so far in the WARM chain. We
> only have limited bits, so we can track only limited columns. This will help
> the cases where different columns are updated, but not so much if the same
> column is updated repeatedly.
>
> What will help, and something I haven't yet applied any thoughts, is when we
> can turn WARM chains back to HOT by removing stale index entries.
>
> Some heuristics and limits on amount of work done to detect duplicate index
> entries will help too.

I think I prefer a more thorough approach.

Increment/decrement may get very complicated with custom opclasses,
for instance. A column-change bitmap won't know how to handle
funcional indexes, etc.

What I intend to try, is modify btree to allow efficient search of a
key-ctid pair, by adding the ctid to the sort order. Only inner pages
need to be changed, to include the ctid in the pointers, leaf pages
already have the ctid there, so they don't need any kind of change. A
bit in the metapage could indicate whether it's a new format or an old
one, and yes, only new indices will be able to use WARM.

But with efficient key-ctid searches, you handle all cases, and not
just a few common ones.

A lot of the work for the key-ctid search can be shared with the
update itself, so it's probably not such a big performance impact.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Tom Lane
I wrote:
> Having said all that, it is unfortunate that 9.6 is going to go out
> without any good solution to this need.  But as Robert already pointed
> out, trying to fix it now would force delaying 9.6rc1 by several weeks
> (and that's assuming that it doesn't take very long to get consensus
> on a solution).  There's not, AFAICT, desire on the part of the release
> team to do that.  We'd like to ship 9.6 on time for a change.

After some back-and-forth among the release team, there's been a decision
to change this position: today's wrap will be 9.6beta4, and we'll try to
get something done about the AM property access issue before issuing rc1.

We only have a week or two to get this done without starting to impact
v10 development as well as hurting our chances of shipping 9.6.0 in
September.  So let's start getting down to details.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Shay Rojansky
Vladimir wrote:


> On the other hand, usage of some well-defined statement name to trigger
>>> the special case would be fine: all pgbouncer versions would pass those
>>> parse/bind/exec message as if it were regular messages.
>>>
>>
>> Can you elaborate on what that means exactly? Are you proposing to
>> somehow piggyback on an existing message (e.g. Parse) but use some special
>> statement name that would make PostgreSQL interpret it as a different
>> message type?
>>
>
> Exactly.
> For instance: if client sends Parse(statementName=I_swear_
> the_statement_will_be_used_only_once), then the subsequent message must
> be BindMessage, and the subsequent must be ExecMessage for exactly the same
> statement id.
>

Ah, I understand the proposal better now - you're not proposing encoding a
new message type in an old one, but rather a magic statement name in Parse
which triggers an optimized processing path in PostgreSQL, that wouldn't go
through the query cache etc.

If so, isn't that what the empty statement is already supposed to do? I
know there's already some optimizations in place around the scenario of
empty statement name (and empty portal).

Also, part of the point here is to reduce the number of protocol messages
needed in order to send a parameterized query - not to have to do
Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from
that (although I'm not sure how much). Your proposal keeps the 5 messages.

Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform
close to Query, e.g. when specifying empty statement/portal, that's
obviously the best thing here. But people seem to be suggesting that a
significant part of the overhead comes from the fact that there are 5
messages, meaning there's no way to optimize this without a new message
type.

Note: it is quite easy to invent a name that is not yet used in the wild,
> so it is safe.
>

That's problematic, how do you know what's being used in the wild and what
isn't? The protocol has a specification, it's very problematic to get up
one day and to change it retroactively. But again, the empty statement
seems to already be there for that.


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tom Lane
Vladimir Sitnikov  writes:
> The point is "adding a message to current v3 protocol is not a backward
> compatible change".
> The problem with adding new message types is not only "client support", but
> deployment issues as well: new message would require simultaneous upgrade
> of both backend, client, and pgbouncer.

Right ...

> It could make sense to add some kind of transparent extensibility to the
> protocol, so clients can just ignore unknown message types.

I'm not really sure what use that particular behavior would be.  We
certainly want to try to have some incremental extensibility in there
when we do v4, but I think it would more likely take the form of a way
for client and server to agree on which extensions they support.

> On the other hand, usage of some well-defined statement name to trigger
> the special case would be fine: all pgbouncer versions would pass those
> parse/bind/exec message as if it were regular messages.

I do not accept this idea that retroactively defining special semantics
for certain statement names is not a protocol break.  If it causes any
change in what the server's response would be, then it is a protocol
break.

> Note: it is quite easy to invent a name that is not yet used in the wild,
> so it is safe.

Sir, that is utter nonsense.  And even if it were true, why is it that
this way would safely pass through existing releases of pgbouncer when
other ways would not?  Either pgbouncer needs to understand what it's
passing through, or it doesn't.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Pavan Deolasee
On Mon, Aug 8, 2016 at 11:08 PM, Bruce Momjian  wrote:

> On Sun, Aug  7, 2016 at 12:55:01PM -0400, Bruce Momjian wrote:
> > On Sun, Aug  7, 2016 at 10:49:45AM -0400, Bruce Momjian wrote:
> > > OK, crazy idea time --- what if we only do WARM chain additions when
> all
> > > indexed values are increasing (with NULLs higher than all values)?  (If
> > > a key is always-increasing, it can't match a previous value in the
> > > chain.)  That avoids the problem of having to check the WARM chain,
> > > except for the previous tuple, and the problem of pruning removing
> > > changed rows.  It avoids having to check the index for matching
> key/ctid
> > > values, and it prevents CREATE INDEX from having to index WARM chain
> > > values.
> > >
> > > Any decreasing value would cause a normal tuple be created.
> >
> > Actually, when we add the first WARM tuple, we can mark the HOT/WARM
> > chain as either all-incrementing or all-decrementing.  We would need a
> > bit to indicate that.
>
> FYI, is see at least two available tuple header bits here, 0x0800 and
> 0x1000:
>
> /*
>  * information stored in t_infomask2:
>  */
> #define HEAP_NATTS_MASK 0x07FF  /* 11 bits for number of
> attributes */
> /* bits 0x1800 are available */
> #define HEAP_KEYS_UPDATED   0x2000  /* tuple was updated and
> key cols
>  * modified, or tuple
> deleted */
> #define HEAP_HOT_UPDATED0x4000  /* tuple was HOT-updated */
> #define HEAP_ONLY_TUPLE 0x8000  /* this is heap-only tuple
> */
>
> #define HEAP2_XACT_MASK 0xE000  /* visibility-related bits
> */
>

What I am currently trying to do is to reuse at least the BlockNumber field
in t_ctid. For HOT/WARM chains, that field is really unused (except the
last tuple when regular update needs to store block number of the new
block). My idea is to use one free bit in t_infomask2 to tell us that
t_ctid is really not a CTID, but contains new information (for pg_upgrade's
sake). For example, one bit in bi_hi can tell us that this is the last
tuple in the chain (information today conveyed by t_ctid pointing to self).
Another bit can tell us that this tuple was WARM updated. We will still
have plenty of bits to store additional information about WARM chains.


> My guess is we would need one bit to mark a WARM chain, and perhaps
> reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or
> decrement-only.


I am not convinced that the checking for increment/decrement adds a lot of
value. Sure, we might be able to address some typical work load, but is
that really a common use case? Instead, what I am looking at storing a
bitmap which shows us which table columns have changed so far in the WARM
chain. We only have limited bits, so we can track only limited columns.
This will help the cases where different columns are updated, but not so
much if the same column is updated repeatedly.

What will help, and something I haven't yet applied any thoughts, is when
we can turn WARM chains back to HOT by removing stale index entries.

Some heuristics and limits on amount of work done to detect duplicate index
entries will help too.


>
> We can't use the bits LP_REDIRECT lp_len because we need to create WARM
> chains before pruning, and I don't think walking the pre-pruned chain is
> worth it.  (As I understand HOT, LP_REDIRECT is only created during
> pruning.)
>
>
That's correct. But lp_len provides us some place to stash information from
heap tuples when they are pruned.

Thanks,
Pavan

 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Vladimir Sitnikov
Shay Rojansky :

> That's definitely a valid point. But do you think it's a strong enough
> argument to avoid ever adding new messages?
>

The point is "adding a message to current v3 protocol is not a backward
compatible change".
The problem with adding new message types is not only "client support", but
deployment issues as well: new message would require simultaneous upgrade
of both backend, client, and pgbouncer.

It could make sense to add some kind of transparent extensibility to the
protocol, so clients can just ignore unknown message types.

For instance, consider a new message "ExtensibleMessage" is added: '#',
int16 (message id), int32 (message length), contents.
For the pgbouncer case above, pgbouncer could just proxy "unknown requests"
as is and it would cover many of the cases out of the box.
That message has a well-defined message length, so it is easy to skip.

If new "messages" required, a new message_id can be allocated and assigned
to the new message type. Old clients would be able to skip the message as
they know its length.

Of course, adding that '#' message does require support from pgbouncer,
etc, etc, however I'm sure it would simplify protocol evolution in the
future.

Of course there might appear a need for a message that cannot be ignored by
the client, but I think "even a bit of flexibility" is better than no
flexibility at all.

Technically speaking, there's a "NotificationResponse" message, however it
is not that good since it cannot have 0 bytes in the contents.

Shay Rojansky :

>
>> On the other hand, usage of some well-defined statement name to trigger
>> the special case would be fine: all pgbouncer versions would pass those
>> parse/bind/exec message as if it were regular messages.
>>
>
> Can you elaborate on what that means exactly? Are you proposing to somehow
> piggyback on an existing message (e.g. Parse) but use some special
> statement name that would make PostgreSQL interpret it as a different
> message type?
>

Exactly.
For instance: if client sends
Parse(statementName=I_swear_the_statement_will_be_used_only_once), then the
subsequent message must be BindMessage, and the subsequent must be
ExecMessage for exactly the same statement id.

Then backend could recognize the pattern and perform the optimization.
Note: it is quite easy to invent a name that is not yet used in the wild,
so it is safe.
>From security point of view (e.g. if a client would want to exploit
use-after-free kind of issues), backend could detect deviations from this
parse-bind-exec sequence and just drop the mic off.

Shay Rojansky :

> Apart from being a pretty horrible hack,
>

I would not call it a horrible hack. That is just a clever use of existing
bits, and it does not break neither backward nor forward compatibility.
Backward compatibility: new clients would be compatible with old PG
versions (even 8.4).
Forward compatibility: even if the support of that special statement name
would get dropped for some reason, there will be no application issues (it
would just result in a slight performance degradation).

Shay Rojansky :

> it would still break pgbouncer, which has to actually inspect and
> understand SQL being sent to the database (e.g. to know when transactions
> start and stop).
>

Note: I do not suggest to change message formats. The message itself is
just fine and existing pgbouncer versions can inspect the SQL. The
difference is a special statementName, and I see no reasons for that kind
of change to break pgbouncer.

Vladimir


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Bruce Momjian
On Sun, Aug  7, 2016 at 12:55:01PM -0400, Bruce Momjian wrote:
> On Sun, Aug  7, 2016 at 10:49:45AM -0400, Bruce Momjian wrote:
> > OK, crazy idea time --- what if we only do WARM chain additions when all
> > indexed values are increasing (with NULLs higher than all values)?  (If
> > a key is always-increasing, it can't match a previous value in the
> > chain.)  That avoids the problem of having to check the WARM chain,
> > except for the previous tuple, and the problem of pruning removing
> > changed rows.  It avoids having to check the index for matching key/ctid
> > values, and it prevents CREATE INDEX from having to index WARM chain
> > values.
> > 
> > Any decreasing value would cause a normal tuple be created.
> 
> Actually, when we add the first WARM tuple, we can mark the HOT/WARM
> chain as either all-incrementing or all-decrementing.  We would need a
> bit to indicate that.

FYI, is see at least two available tuple header bits here, 0x0800 and
0x1000:

/*
 * information stored in t_infomask2:
 */
#define HEAP_NATTS_MASK 0x07FF  /* 11 bits for number of 
attributes */
/* bits 0x1800 are available */
#define HEAP_KEYS_UPDATED   0x2000  /* tuple was updated and key 
cols
 * modified, or tuple deleted */
#define HEAP_HOT_UPDATED0x4000  /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000  /* this is heap-only tuple */

#define HEAP2_XACT_MASK 0xE000  /* visibility-related bits */

My guess is we would need one bit to mark a WARM chain, and perhaps
reuse obsolete pre-9.0 HEAP_MOVED_OFF to indicate increment-only or
decrement-only.  These flags have to be passed to all forward tuples in
the chain so an addition to the chain knows quickly if it is WARM chain,
and which direction.

We can't use the bits LP_REDIRECT lp_len because we need to create WARM
chains before pruning, and I don't think walking the pre-pruned chain is
worth it.  (As I understand HOT, LP_REDIRECT is only created during
pruning.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-08-08 Thread Anastasia Lubennikova

06.08.2016 19:51, Andrew Borodin:

Anastasia , thank you for your attentive code examine.

2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova :

First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
Although, I'm quite sure that it was already aligned somewhere before.
I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
I'd rather add  the check: (offset+size_diff < pd_lower).

Actually, that's a tricky question. There may be different code expectations.
1. If we expect that not-maxaligned tuple may be placed by any other
routine, we should remove check (size_diff != MAXALIGN(size_diff)),
since it will fail for not-maxaligned tuple.
2. If we expect that noone should use PageIndexTupleOverwrite with
not-maxaligned tuples, than current checks are OK: we will break
execution if we find non-maxaligned tuples. With an almost correct
message.
I suggest that this check may be debug-only assertion: in a production
code routine will work with not-maxaligned tuples if they already
reside on the page, in a dev build it will inform dev that something
is going wrong. Is this behavior Postgres-style compliant?


I agree that pd_lower check makes sense.


Pointing out to this comparison I worried about possible call of
MAXALIGN for negative size_diff value. Although I don't sure if it can 
cause any problem.


Tuples on a page are always maxaligned.
But, as far as I recall,  itemid->lp_len can contain non-maxaligned value.

So, I'd suggest you to perform MAXALIGN(oldsize) before computing size_diff:
 size_diff = oldsize - alignednewsize;

Since both arguments are maxaligned the check of size_diff is not needed.


BTW, I'm very surprised that it improves performance so much.
And also size is reduced significantly. 89MB against 289MB without patch.
Could you explain in details, why does it happen?

Size reduction is unexpected for me.
There might be 4 plausible explanations. I'll list them ordered by
descending of probability:
1. Before this patch every update was throwing recently updated tuple
to the end of a page. Thus, in some-how ordered data, recent best-fit
will be discovered last. This can cause increase of MBB's overlap in
spatial index and slightly higher tree. But 3 times size decrease is
unlikely.
How did you obtained those results? Can I look at a test case?


Nothing special, I've just run the test you provided in this thread.
And check index size via select pg_size_pretty(pg_relation_size('idx'));


2. Bug in PageIndexTupleDelete causing unused space emersion. I've
searched for it, unsuccessfully.
3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
bug. May be we are not placing something not very important and big on
a page?
4. Magic.
I do not see what one should do with the R-tree to change it's size by
a factor of 3. First three explanations are not better that forth,
actually.
Those 89 MB, they do not include WAL, right?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Jeff Janes
On Mon, Aug 8, 2016 at 10:03 AM, Bruce Momjian  wrote:
> On Mon, Aug  8, 2016 at 04:43:40PM +0530, Amit Kapila wrote:
>> >According to developers, overhead is small, but many people have doubts
>> > that it can be much more significant for intensive workloads. Obviously, it
>> > is not an easy task to test, because you need to put doubtfully
>> > non-production ready code into mission-critical production for such tests.
>> >As a result it will be clear if this design should be abandoned  and we
>> > need to think about less-invasive solutions or this design is acceptable.
>> >
>>
>> I think here main objection was that gettimeofday can cause
>> performance regression which can be taken care by using configurable
>> knob.  I am not aware if any other part of the design has been
>> discussed in detail to conclude whether it has any obvious problem.
>
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.

They should also run it in parallel, as sometimes the real overhead is
in synchronization between multiple CPUs and doesn't show up when only
a single CPU is involved.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Tom Lane
Vik Fearing  writes:
> On 08/08/16 15:38, Tom Lane wrote:
>> Well, the next step is to fill in the blanks: what properties need to be
>> queryable?

> That's less urgent; adding missing properties does not require a
> catversion bump.

If the complaint is "I can do X before 9.6.0 and after 9.6.0, but not in
9.6.0", it doesn't really matter whether it would take a catversion bump
to resolve; the problem is lack of a time machine.  Once there's a
released 9.6.0 out there that is missing something important, applications
that care about version portability are going to have to deal with it.
So we need to get this right the first time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Vik Fearing
On 08/08/16 17:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed  wrote:
>> Thank you for inputs everyone.
>>
>> The opinions on this thread can be classified into following
>> 1. Commit
>> 2. Rollback
>> 3. Error
>> 4. Warning
>>
>> As per opinion upthread, issuing implicit commit immediately after switching
>> autocommit to ON, can be unsafe if it was not desired.  While I agree that
>> its difficult to judge users intention here, but if we were to base it on
>> some assumption, the closest would be implicit COMMIT in my opinion.There is
>> higher likelihood of a user being happy with issuing a commit when setting
>> autocommit ON than a transaction being rolled back.  Also there are quite
>> some interfaces which provide this.
>>
>> As mentioned upthread, issuing a warning on switching back to autocommit
>> will not be effective inside a script. It won't allow subsequent commands to
>> be committed as set autocommit to ON is not committed. Scripts will have to
>> be rerun with changes which will impact user friendliness.
>>
>> While I agree that issuing an ERROR and rolling back the transaction ranks
>> higher in safe behaviour, it is not as common (according to instances stated
>> upthread) as immediately committing any open transaction when switching back
>> to autocommit.
> 
> I think I like the option of having psql issue an error.  On the
> server side, the transaction would still be open, but the user would
> receive a psql error message and the autocommit setting would not be
> changed.  So the user could type COMMIT or ROLLBACK manually and then
> retry changing the value of the setting.

This is my preferred action.

> Alternatively, I also think it would be sensible to issue an immediate
> COMMIT when the autocommit setting is changed from off to on.  That
> was my first reaction.

I don't care for this very much.

> Aborting the server-side transaction - with or without notice -
> doesn't seem very reasonable.

Agreed.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2016 at 04:43:40PM +0530, Amit Kapila wrote:
> >According to developers, overhead is small, but many people have doubts
> > that it can be much more significant for intensive workloads. Obviously, it
> > is not an easy task to test, because you need to put doubtfully
> > non-production ready code into mission-critical production for such tests.
> >As a result it will be clear if this design should be abandoned  and we
> > need to think about less-invasive solutions or this design is acceptable.
> >
> 
> I think here main objection was that gettimeofday can cause
> performance regression which can be taken care by using configurable
> knob.  I am not aware if any other part of the design has been
> discussed in detail to conclude whether it has any obvious problem.

It seems asking users to run pg_test_timing before deploying to check
the overhead would be sufficient.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Vik Fearing
On 08/08/16 15:38, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> Building on the has-property approach Andrew suggested, I wonder if
>>> we need something like pg_index_column_has_property(indexoid, colno,
>>> propertyname) with properties like "sortable", "desc", "nulls first".
> 
>> This seems simple enough, on the surface.  Why not run with this design?
> 
> Well, the next step is to fill in the blanks: what properties need to be
> queryable?

That's less urgent; adding missing properties does not require a
catversion bump.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2016 at 06:34:46PM +0530, Amit Kapila wrote:
> I think here expensive part would be recheck for the cases where the
> index value is changed to a different value (value which doesn't exist
> in WARM chain).   You anyway have to add the new entry (key,TID) in
> index, but each time traversing the WARM chain would be an additional
> effort.  For cases, where there are just two index entries and one
> them is being updated, it might regress as compare to what we do now.

Yes, I think the all-increment or all-decrement WARM chain is going to
be the right approach.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Robert Haas
On Mon, Aug 8, 2016 at 11:17 AM, David G. Johnston
 wrote:
> On Mon, Aug 8, 2016 at 11:02 AM, Robert Haas  wrote:
>>
>> On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed 
>> wrote:
>> > Thank you for inputs everyone.
>> >
>> > The opinions on this thread can be classified into following
>> > 1. Commit
>> > 2. Rollback
>> > 3. Error
>> > 4. Warning
>> >
>> > As per opinion upthread, issuing implicit commit immediately after
>> > switching
>> > autocommit to ON, can be unsafe if it was not desired.  While I agree
>> > that
>> > its difficult to judge users intention here, but if we were to base it
>> > on
>> > some assumption, the closest would be implicit COMMIT in my
>> > opinion.There is
>> > higher likelihood of a user being happy with issuing a commit when
>> > setting
>> > autocommit ON than a transaction being rolled back.  Also there are
>> > quite
>> > some interfaces which provide this.
>> >
>> > As mentioned upthread, issuing a warning on switching back to autocommit
>> > will not be effective inside a script. It won't allow subsequent
>> > commands to
>> > be committed as set autocommit to ON is not committed. Scripts will have
>> > to
>> > be rerun with changes which will impact user friendliness.
>> >
>> > While I agree that issuing an ERROR and rolling back the transaction
>> > ranks
>> > higher in safe behaviour, it is not as common (according to instances
>> > stated
>> > upthread) as immediately committing any open transaction when switching
>> > back
>> > to autocommit.
>>
>> I think I like the option of having psql issue an error.  On the
>> server side, the transaction would still be open, but the user would
>> receive a psql error message and the autocommit setting would not be
>> changed.  So the user could type COMMIT or ROLLBACK manually and then
>> retry changing the value of the setting.
>>
>> Alternatively, I also think it would be sensible to issue an immediate
>> COMMIT when the autocommit setting is changed from off to on.  That
>> was my first reaction.
>>
>> Aborting the server-side transaction - with or without notice -
>> doesn't seem very reasonable.
>>
>
> Best of both worlds?
>
> IF (ON_ERROR_STOP == 1)
> THEN (treat this like any other error)
> ELSE (send COMMIT; to server)

No, that's not a good idea.  The purpose of ON_ERROR_STOP is something
else entirely, and we shouldn't reuse it for an unrelated purpose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread David G. Johnston
On Mon, Aug 8, 2016 at 11:02 AM, Robert Haas  wrote:

> On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed 
> wrote:
> > Thank you for inputs everyone.
> >
> > The opinions on this thread can be classified into following
> > 1. Commit
> > 2. Rollback
> > 3. Error
> > 4. Warning
> >
> > As per opinion upthread, issuing implicit commit immediately after
> switching
> > autocommit to ON, can be unsafe if it was not desired.  While I agree
> that
> > its difficult to judge users intention here, but if we were to base it on
> > some assumption, the closest would be implicit COMMIT in my
> opinion.There is
> > higher likelihood of a user being happy with issuing a commit when
> setting
> > autocommit ON than a transaction being rolled back.  Also there are quite
> > some interfaces which provide this.
> >
> > As mentioned upthread, issuing a warning on switching back to autocommit
> > will not be effective inside a script. It won't allow subsequent
> commands to
> > be committed as set autocommit to ON is not committed. Scripts will have
> to
> > be rerun with changes which will impact user friendliness.
> >
> > While I agree that issuing an ERROR and rolling back the transaction
> ranks
> > higher in safe behaviour, it is not as common (according to instances
> stated
> > upthread) as immediately committing any open transaction when switching
> back
> > to autocommit.
>
> I think I like the option of having psql issue an error.  On the
> server side, the transaction would still be open, but the user would
> receive a psql error message and the autocommit setting would not be
> changed.  So the user could type COMMIT or ROLLBACK manually and then
> retry changing the value of the setting.
>
> Alternatively, I also think it would be sensible to issue an immediate
> COMMIT when the autocommit setting is changed from off to on.  That
> was my first reaction.
>
> Aborting the server-side transaction - with or without notice -
> doesn't seem very reasonable.
>
>
​Best of both worlds?​

​IF (ON_ERROR_STOP == 1)
THEN (treat this like any other error)
ELSE (send COMMIT; to server)

David J.
​


Re: [HACKERS] Refactoring of heapam code.

2016-08-08 Thread Alvaro Herrera
Anastasia Lubennikova wrote:

> By the way, I thought about looking at the mentioned patch and
> probably reviewing it, but didn't find any of your patches on the
> current commitfest. Could you point out the thread?

Sorry, I haven't posted anything yet.

> >Agreed.  But changing its name while keeping it in heapam.c does not
> >really improve things enough.  I'd rather have it moved elsewhere that's
> >not specific to "heaps" (somewhere in access/common perhaps).  However,
> >renaming the functions would break third-party code for no good reason.
> >I propose that we only rename things if we also break it for other
> >reasons (say because the API changes in a fundamental way).
> 
> Yes, I agree that it should be moved to another file.
> Just to be more accurate: it's not in heapam.c now, it is in
> "src/backend/catalog/heap.c" which requires much more changes
> than I did.

Argh.  Clearly, the code organization in this area is not good at all.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Robert Haas
On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed  wrote:
> Thank you for inputs everyone.
>
> The opinions on this thread can be classified into following
> 1. Commit
> 2. Rollback
> 3. Error
> 4. Warning
>
> As per opinion upthread, issuing implicit commit immediately after switching
> autocommit to ON, can be unsafe if it was not desired.  While I agree that
> its difficult to judge users intention here, but if we were to base it on
> some assumption, the closest would be implicit COMMIT in my opinion.There is
> higher likelihood of a user being happy with issuing a commit when setting
> autocommit ON than a transaction being rolled back.  Also there are quite
> some interfaces which provide this.
>
> As mentioned upthread, issuing a warning on switching back to autocommit
> will not be effective inside a script. It won't allow subsequent commands to
> be committed as set autocommit to ON is not committed. Scripts will have to
> be rerun with changes which will impact user friendliness.
>
> While I agree that issuing an ERROR and rolling back the transaction ranks
> higher in safe behaviour, it is not as common (according to instances stated
> upthread) as immediately committing any open transaction when switching back
> to autocommit.

I think I like the option of having psql issue an error.  On the
server side, the transaction would still be open, but the user would
receive a psql error message and the autocommit setting would not be
changed.  So the user could type COMMIT or ROLLBACK manually and then
retry changing the value of the setting.

Alternatively, I also think it would be sensible to issue an immediate
COMMIT when the autocommit setting is changed from off to on.  That
was my first reaction.

Aborting the server-side transaction - with or without notice -
doesn't seem very reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Rahila Syed
Thank you for inputs everyone.

The opinions on this thread can be classified into following
1. Commit
2. Rollback
3. Error
4. Warning

As per opinion upthread, issuing implicit commit immediately after
switching autocommit to ON, can be unsafe if it was not desired.  While I
agree that its difficult to judge users intention here, but if we were to
base it on some assumption, the closest would be implicit COMMIT in my
opinion.There is higher likelihood of a user being happy with issuing a
commit when setting autocommit ON than a transaction being rolled back.
Also there are quite some interfaces which provide this.

As mentioned upthread, issuing a warning on switching back to autocommit
will not be effective inside a script. It won't allow subsequent commands
to be committed as set autocommit to ON is not committed. Scripts will have
to be rerun with changes which will impact user friendliness.

While I agree that issuing an ERROR and rolling back the transaction ranks
higher in safe behaviour, it is not as common (according to instances
stated upthread) as immediately committing any open transaction when
switching back to autocommit.

Thank you,
Rahila Syed


On Sun, Aug 7, 2016 at 4:42 AM, Matt Kelly  wrote:

> Its worth noting that the JDBC's behavior when you switch back to
> autocommit is to immediately commit the open transaction.
>
> Personally, I think committing immediately or erroring are unsurprising
> behaviors.  The current behavior is surprising and obviously wrong.
> Rolling back without an error would be very surprising (no other database
> API I know of does that) and would take forever to debug issues around the
> behavior.  And committing after the next statement is definitely the most
> surprising behavior suggested.
>
> IMHO, I think committing immediately and erroring are both valid.  I think
> I'd prefer the error in principle, and per the law of bad code I'm sure,
> although no one has ever intended to use this behavior, there is probably
> some code out there that is relying on this behavior for "correctness".  I
> think a hard failure and making the dev add an explicit commit is least
> likely to cause people serious issues.  As for the other options, consider
> me opposed.
>
> - Matt K.
>


[HACKERS] extract text from XML

2016-08-08 Thread Chris Pacejo
Hi, I have found a basic use case which is supported by the xml2 module,
but is unsupported by the new XML API.

It is not possible to correctly extract text (either from text nodes or
attribute values) which contains the characters '<', '&', or '>'. 
xpath() (correctly) returns XML text nodes for queries targeting these
node types, and there is no inverse to xmlelement().  For example:

=> select (xpath('/a/text()', xmlelement(name a, '<&>')))[1]::text;
   xpath   
---
 
(1 row)

Again, not a bug; but there is no way to specify my desired intent.  The
xml2 module does provide such a function, xpath_string:

=> select xpath_string(xmlelement(name a, '<&>')::text, '/a/text()');
 xpath_string 
--
 <&>
(1 row)

One workaround is to return the node's text value by serializing the XML
value, and textually replacing those three entities with the characters
they represent, but this relies on the xpath() function not generating
other entities.

(My use case is importing data in XML format, and processing with
Postgres into a relational format.)

Perhaps a function xpath_value(text, xml) -> text[] would close the gap?
 (I did search and no such function seems to exist currently, outside
xml2.)

Thanks,
Chris


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Building on the has-property approach Andrew suggested, I wonder if
>> we need something like pg_index_column_has_property(indexoid, colno,
>> propertyname) with properties like "sortable", "desc", "nulls first".

> This seems simple enough, on the surface.  Why not run with this design?

Well, the next step is to fill in the blanks: what properties need to be
queryable?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-08 Thread Amit Kapila
On Fri, Aug 5, 2016 at 9:57 AM, Pavan Deolasee  wrote:
>
>
> On Fri, Aug 5, 2016 at 8:23 AM, Claudio Freire 
> wrote:
>>
>> On Thu, Aug 4, 2016 at 11:15 PM, Bruce Momjian  wrote:
>>
>> >
>> > OK, that's a lot of text, and I am confused.  Please tell me the
>> > advantage of having an index point to a string of HOT chains, rather
>> > than a single one?  Is the advantage that the index points into the
>> > middle of the HOT chain rather than at the beginning?  I can see some
>> > value to that, but having more ctid HOT headers that can't be removed
>> > except by VACUUM seems bad, plus the need for index rechecks as you
>> > cross to the next HOT chain seems bad.
>> >
>> > The value of WARM is to avoid index bloat --- right now we traverse the
>> > HOT chain on a single page just fine with no complaints about speed so I
>> > don't see a value to optimizing that traversal, and I think the key
>> > rechecks and ctid bloat will make it all much slower.
>> >
>> > It also seems much more complicated.
>>
>> The point is avoiding duplicate rows in the output of index scans.
>>
>> I don't think you can avoid it simply by applying index condition
>> rechecks as the original proposal implies, in this case:
>>
>> CREATE TABLE t (id integer not null primary key, someid integer, dat
>> integer);
>> CREATE INDEX i1 ON t (someid);
>>
>> INSERT INTO t (id, someid, dat) VALUES (1, 2, 100);
>> UPDATE t SET dat = dat + 1 where id = 1;
>> UPDATE t SET dat = dat + 1, id = 2 where id = 1;
>> UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;
>> UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;
>> SELECT * FROM t WHERE someid = 2;
>>
>> That, I believe, will cause the problematic chains where the condition
>> recheck passes both times the last visible tuple is visited during the
>> scan. It will return more than one tuple when in reality there is only
>> one.
>
>
> Hmm. That seems like a real problem. The only way to address that is to
> ensure that for a given WARM chain and given index key, there must exists
> only a single index entry. There can and will be multiple entries if the
> index key changes, but if the index key changes to the original value (or
> any other value already in the WARM chain) again, we must not add another
> index entry. Now that does not seem like a very common scenario and skipping
> a duplicate index entry does have its own benefit, so may be its not a
> terrible idea to try that. You're right, it may be expensive to search for
> an existing matching index key for the same root line pointer. But if we
> could somehow short-circuit the more common case, it might still be worth
> trying.
>

I think here expensive part would be recheck for the cases where the
index value is changed to a different value (value which doesn't exist
in WARM chain).   You anyway have to add the new entry (key,TID) in
index, but each time traversing the WARM chain would be an additional
effort.  For cases, where there are just two index entries and one
them is being updated, it might regress as compare to what we do now.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Dave Cramer
On 8 August 2016 at 03:49, Vladimir Sitnikov 
wrote:

>
>
> Tom Lane :
>
>> FWIW, this thread started on 25-July, less than two weeks ago.
>
>
> Technically speaking, there was a pgsql-jdbc thread started on May 14:
> https://www.postgresql.org/message-id/nh72v6%24582%241%40ger.gmane.org
>
> 9.6beta1 was released on May 12
>
> The fact that it wasn't raised
>> till more than 6 months after we committed the pg_am changes
>
>
> This means that nobody was testing compatibility of "postgresql's master
> branch with existing third-party clients".
> Testing against well-known clients makes sense to catch bugs early.
>
> I've added "build postgresql from master branch" test to the pgjdbc's
> regression suite a week ago, so I hope it would highlight issues early
> (even before the official postgresql beta is released).
>
> However, pgjdbc tests are executed only for pgjdbc commits, so if there's
> no pgjdbc changes, then there is no logic to trigger "try newer postgres
> with current pgjdbc".
>
> Ideally, postgresql's regression suite should validate well-known clients
> as well.
> I've no idea how long would it take to add something to postgresql's
> buildfarm, so I just went ahead and created Travis test configuration at
> https://github.com/vlsi/postgres
> Do you think those Travis changes can be merged to the upstream?
>
> I mean the following:
> 1) Activate TravisCI integration for https://github.com/postgres/postgres
>  mirror.
> 2) Add relevant Travis CI file so it checks postgresql's regression suite,
> and the other
> 3) Add build badge to the readme (link to travis ci build) for simplified
> navigation to test results.
>
> Here's the commit: https://github.com/vlsi/postgres/commit/
> 4841f8bc00b7c6717d91f51c98979ce84b4f7df3
> Here's how test results look like: https://travis-ci.org/vlsi/postgres
>
>
Nice work +!



Dave Cramer

da...@postgresintl.com
www.postgresintl.com


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-08 Thread Dilip Kumar
On Mon, Aug 8, 2016 at 5:23 PM, Amit Kapila  wrote:

> >
> > Your other options and the option you choose are same.
> >
>

Sorry typo, I meant ERRCODE_INVALID_OBJECT_DEFINITION.



> Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages
> like: "type %u does not exit" or "type id %u does not exit"? Errorcode
> ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure
> cases at certain places in code.


But I think, OBJECT will be more appropriate than COLUMN at this place.

However I can change error message to "type id %u does not exit" if this
seems better ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-08 Thread Amit Kapila
On Mon, Aug 8, 2016 at 5:11 PM, Amit Kapila  wrote:
> On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar  wrote:
>>
>> On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas  wrote:
>>>
>>> I think that's just making life difficult.  If nothing else, sqlsmith
>>> hunts around for functions it can call that return internal errors,
>>> and if we refuse to fix all of them to return user-facing errors, then
>>> it's just crap for the people running sqlsmith to sift through and
>>> it's a judgment call whether to fix each particular case.  Even aside
>>> from that, I think it's much better to have a clear and unambiguous
>>> rule that elog is only for can't-happen things, not
>>> we-don't-recommend-it things.
>>
>>
>> I have changed for all these function to report more appropriate error with
>> ereport.
>>
>> I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors.
>> I think this is closest error code among all existing error codes, other
>> options can be (ERRCODE_WRONG_OBJECT_TYPE).
>>
>
> Your other options and the option you choose are same.
>

Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages
like: "type %u does not exit" or "type id %u does not exit"? Errorcode
ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure
cases at certain places in code.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-08 Thread Amit Kapila
On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar  wrote:
>
> On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas  wrote:
>>
>> I think that's just making life difficult.  If nothing else, sqlsmith
>> hunts around for functions it can call that return internal errors,
>> and if we refuse to fix all of them to return user-facing errors, then
>> it's just crap for the people running sqlsmith to sift through and
>> it's a judgment call whether to fix each particular case.  Even aside
>> from that, I think it's much better to have a clear and unambiguous
>> rule that elog is only for can't-happen things, not
>> we-don't-recommend-it things.
>
>
> I have changed for all these function to report more appropriate error with
> ereport.
>
> I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors.
> I think this is closest error code among all existing error codes, other
> options can be (ERRCODE_WRONG_OBJECT_TYPE).
>

Your other options and the option you choose are same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-08 Thread Dilip Kumar
On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas  wrote:

> I think that's just making life difficult.  If nothing else, sqlsmith
> hunts around for functions it can call that return internal errors,
> and if we refuse to fix all of them to return user-facing errors, then
> it's just crap for the people running sqlsmith to sift through and
> it's a judgment call whether to fix each particular case.  Even aside
> from that, I think it's much better to have a clear and unambiguous
> rule that elog is only for can't-happen things, not
> we-don't-recommend-it things.
>

I have changed for all these function to report more appropriate error with
ereport.

I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors.
I think this is closest error code among all existing error codes, other
options can be (ERRCODE_WRONG_OBJECT_TYPE).

But I think ERRCODE_WRONG_OBJECT_TYPE is better option.

Patch attached for the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


cache_lookup_failure_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Amit Kapila
On Sun, Aug 7, 2016 at 5:33 PM, Ilya Kosmodemiansky
 wrote:
> Hi,
>
> I've summarized Wait events monitoring discussion at Developer unconference
> in Ottawa this year on wiki:
>
> https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
>
>
> (Thanks to Alexander Korotkov for patiently pushing me to make this thing
> finally done)
>
> If you attended, fill free to point me out if I missed something, I will put
> it on the wiki too.
>

Thanks for summarization.

> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction. The
> importance of the topic is beyond discussion now, if you talk to any
> PostgreSQL person about implementing such a tool in Postgres and if the
> person does not get exited, probably you talk to a full-time PostgreSQL
> developer;-) Obviously it needs a better design, both the user interface and
> implementation, and perhaps this is why full-time developers are still
> sceptical.
>
> In order to move forward, imho we need at least some steps, whose steps can
> be done in parallel
>
> 1. Further requirements need to be collected from DBAs.
>
>If you are a PostgreSQL DBA with Oracle experience and use perf for
> troubleshooting Postgres - you are an ideal person to share your experience,
> but everyone is welcome.
>
> 2. Further pg_wait_sampling performance testing needed and in different
> environments.
>

I think it is better to first go with a knob whose default value will
be off.  We can do the performance testing as well and if by end of
release nobody reported any visible regression, then we can discuss
for changing the default to on.

>According to developers, overhead is small, but many people have doubts
> that it can be much more significant for intensive workloads. Obviously, it
> is not an easy task to test, because you need to put doubtfully
> non-production ready code into mission-critical production for such tests.
>As a result it will be clear if this design should be abandoned  and we
> need to think about less-invasive solutions or this design is acceptable.
>

I think here main objection was that gettimeofday can cause
performance regression which can be taken care by using configurable
knob.  I am not aware if any other part of the design has been
discussed in detail to conclude whether it has any obvious problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Shay Rojansky
>
> We could call this "protocol 3.1" since it doesn't break backwards
>> compatibility (no incompatible server-initiated message changes, but it
>> does include a feature that won't be supported by servers which only
>> support 3.0. This could be a sort of "semantic versioning" for the protocol
>> - optional new client-initiated features are a minor version bump, others
>> are a major version bump...
>>
>
> Adding a new message is not backward compatible since it will fail in
> pgbouncer kind of deployments.
> Suppose there's "new backend", "old pgbouncer", "new client" deployment.
> If the client tries to send the new message, it will fail since pgbouncer
> would have no idea what to do with that new message.
>

That's definitely a valid point. But do you think it's a strong enough
argument to avoid ever adding new messages?


> On the other hand, usage of some well-defined statement name to trigger
> the special case would be fine: all pgbouncer versions would pass those
> parse/bind/exec message as if it were regular messages.
>

Can you elaborate on what that means exactly? Are you proposing to somehow
piggyback on an existing message (e.g. Parse) but use some special
statement name that would make PostgreSQL interpret it as a different
message type? Apart from being a pretty horrible hack, it would still break
pgbouncer, which has to actually inspect and understand SQL being sent to
the database (e.g. to know when transactions start and stop).


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Victor Wagner
On Mon, 08 Aug 2016 18:28:57 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI  wrote:

> 
> I don't see charset compatibility to be easily detectable,

In the worst case we can hardcode explicit compatibility table.
There is limited set of languages, which have translated error messages,
and limited (albeit wide) set of encodings, supported by PostgreSQL. So
it is possible to define complete list of encodings, compatible with
some translation. And fall back to untranslated messages if client
encoding is not in this list.

> because locale (or character set) is not a matter of PostgreSQL
> (except for some encodings bound to one particular character
> set)... So the conversion-fallback might be a only available
> solution.

Conversion fallback may be a solution for data. For NLS-messages I think
it is better to fall back to English (untranslated) messages than use of
transliteration or something alike.

I think that for now we can assume that the best effort is already done
for the data, and think how to improve situation with messages.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring of heapam code.

2016-08-08 Thread Anastasia Lubennikova

08.08.2016 03:51, Michael Paquier:

On Sat, Aug 6, 2016 at 2:56 AM, Alvaro Herrera  wrote:

Anastasia Lubennikova wrote:

So there is a couple of patches. They do not cover all mentioned problems,
but I'd like to get a feedback before continuing.

I agree that we could improve things in this general area, but I do not
endorse any particular changes you propose in these patches; I haven't
reviewed your patches.

Well, I am interested in the topic. And just had a look at the patches proposed.

+ /*
+ * Split PageGetItem into set of different macros
+ * in order to make code more readable.
+ */
+#define PageGetItemHeap(page, itemId) \
+( \
+   AssertMacro(PageIsValid(page)), \
+   AssertMacro(ItemIdHasStorage(itemId)), \
+   (HeapTupleHeader)(((char *)(page)) + ItemIdGetOffset(itemId)) \
+)
Placing into bufpage.h macros that are specific to heap and indexes is
not that much a good idea. I'd suggest having the heap ones in
heapam.h, and the index ones in a more generic place, as
src/access/common as already mentioned by Alvaro.

+   onpage_tup = PageGetItemHeapHeaderOnly(page, newitemid);
Just PageGetItemHeapHeader would be fine.

@@ -2324,7 +2087,6 @@ FreeBulkInsertState(BulkInsertState bistate)
 pfree(bistate);
  }

-
  /*
   * heap_insert - insert tuple into a heap
Those patches have some noise.

Patch 0002 is doing two things: reorganizing the order of the routines
in heapam.c and move some routines from heapam.c to hio.c. Those two
could be separate patches/

If the final result is to be able to extend custom AMs for tables, we
should have a structure like src/access/common/index.c,
src/access/common/table.c and src/access/common/relation.c for
example, and have headers dedicated to them. But yes, there is
definitely a lot of room for refactoring as we are aiming, at least I
guess so, at having more various code paths for access methods.


Thank you for the review, I'll fix these problems in final version.

Posting the first message I intended to raise the discussion. Patches
were attached mainly to illustrate the problem and to be more concrete.

Thank you for attention and feedback.
Since there are no objections to the idea in general, I'll write an 
exhaustive

README about current state of the code and then propose API design
to discuss details.

Stay tuned.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Victor Wagner
On Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI  wrote:

> At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in
> <20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp>
> 
> Somewhat wrong. The core problem is the procedures offered by
> PrepareClientEncoding is choosed only by encoding->encoding
> basis, not counting character set compatibility. So, currently
> this is not detectable before actually doing conversion of a
> character stream.

Yes, my idea was to check language/encoding compatibility. Make sure
that NLS messages can be represented in the client-specified encoding
in a readable way. As far, as I know, there is no platform-independent
bulletproof way to do so. 

On Unix you can try to initialize locale with given language and given
encoding, but it can fail even if encoding is compatible with language,
simply because corresponding locale is not generated on this system.

But this seems to be a problem of system administration and can be left
out to local sysadmins.

Once you have correctly initialized LC_MESSAGES, you don't need
encoding conversion routines for the NLS messages. You can use
bind_textdomain_codeset function to provide messages in the
client-desired encoding. (but this can cause problems with server logs,
where messages from different sessions would come in different
encodings)

On Windows things are more complicated. There is just one ANSI code
page, associated to given language, and locale initialization would
fail with any other codepage, including utf-8.



 
> regards,
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Kyotaro HORIGUCHI
At Mon, 08 Aug 2016 18:11:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160808.181154.252052789.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp>
> > Looking at check_client_encoding(), the comment says as following.
> > 
> > | * If we are not within a transaction then PrepareClientEncoding will not
> > | * be able to look up the necessary conversion procs.  If we are still
> > | * starting up, it will return "OK" anyway, and InitializeClientEncoding
> > | * will fix things once initialization is far enough along.  After
> > 
> > We shold overcome this to realize startup-time check for
> > conversion procs.
> 
> Somewhat wrong. The core problem is the procedures offered by
> PrepareClientEncoding is choosed only by encoding->encoding
> basis, not counting character set compatibility. So, currently
> this is not detectable before actually doing conversion of a
> character stream.
> 
> Conversely, providing a means to check character-set
> compatibility will naturally fixes this. Check at session-startup
> (out-of-transaction check?) is still another problem.

I don't see charset compatibility to be easily detectable,
because locale (or character set) is not a matter of PostgreSQL
(except for some encodings bound to one particular character
set)... So the conversion-fallback might be a only available
solution.

Thougts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring of heapam code.

2016-08-08 Thread Anastasia Lubennikova

05.08.2016 20:56, Alvaro Herrera:

Anastasia Lubennikova wrote:

Working on page compression and some other issues related to
access methods, I found out that the code related to heap
looks too complicated. Much more complicated, than it should be.
Since I anyway got into this area, I want to suggest a set of improvements.

Hm.  I'm hacking on heapam.c pretty heavily ATM.  Your first patch
causes no problem I think, or if it does it's probably pretty minor, so
that's okay.  I'm unsure about the second one -- I think it's okay too,
because it mostly seems to be about moving stuff from heapam.c to hio.c
and shuffling some code around that I don't think I'm modifying.


I'm sure these patches will not cause any problems. The second one is
mostly about moving unrelated stuff from heapam.c to hio.c and renaming
couple of functions in heap.c.
Anyway, the are not a final solution just proof of concept.

By the way, I thought about looking at the mentioned patch and
probably reviewing it, but didn't find any of your patches on the
current commitfest. Could you point out the thread?


Also I think that it's quite strange to have a function heap_create(), that
is called
from index_create(). It has absolutely nothing to do with heap access
method.

Agreed.  But changing its name while keeping it in heapam.c does not
really improve things enough.  I'd rather have it moved elsewhere that's
not specific to "heaps" (somewhere in access/common perhaps).  However,
renaming the functions would break third-party code for no good reason.
I propose that we only rename things if we also break it for other
reasons (say because the API changes in a fundamental way).


Yes, I agree that it should be moved to another file.
Just to be more accurate: it's not in heapam.c now, it is in
"src/backend/catalog/heap.c" which requires much more changes
than I did.

Concerning to the third-party code. It's a good observation and we
definitely should keep it in mind. Nevertheless, I doubt that there's a 
lot of

external calls to these functions. And I hope this thread will end up with
fundamental changes introducing clear API for all mentioned problems.




One more thing that requires refactoring is "RELKIND_RELATION" name.
We have a type of relation called "relation"...

I don't see us fixing this bit, really.  Historical baggage and all
that.  For example, a lot of SQL queries use "WHERE relkind = 'r'".  If
we change the name, we should probably also change the relkind constant,
and that would break the queries.  If we change the name and do not
change the constant, then we have to have a comment "we call them
RELKIND_TABLE but the char is r because it was called RELATION
previously", which isn't any better.


Good point.
I'd rather change it to RELKIND_BASIC_RELATION or something like that,
which is more specific and still goes well with 'r' constant. But minor 
changes

like that can wait until we clarify the API in general.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Victor Wagner
On Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI  wrote:
> 
> I'm not sure what messages may be raised before authentication
> but it can be a more generic-solution. (Adding check during
> on-session.)

Definitely, there can be authentication error message, which is sent if
authentication didn't happen. Also, as far as I understand, message
"Database ... doesn't exists" is also send before authentication.


Also, there are CancelRequests, where normal authentication is not
used, and server key, provided in another session used instead.





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Kyotaro HORIGUCHI
At Mon, 08 Aug 2016 17:18:21 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160808.171821.100221089.horiguchi.kyot...@lab.ntt.co.jp>
> Looking at check_client_encoding(), the comment says as following.
> 
> | * If we are not within a transaction then PrepareClientEncoding will not
> | * be able to look up the necessary conversion procs.  If we are still
> | * starting up, it will return "OK" anyway, and InitializeClientEncoding
> | * will fix things once initialization is far enough along.  After
> 
> We shold overcome this to realize startup-time check for
> conversion procs.

Somewhat wrong. The core problem is the procedures offered by
PrepareClientEncoding is choosed only by encoding->encoding
basis, not counting character set compatibility. So, currently
this is not detectable before actually doing conversion of a
character stream.

Conversely, providing a means to check character-set
compatibility will naturally fixes this. Check at session-startup
(out-of-transaction check?) is still another problem.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-08 Thread Anastasia Lubennikova

05.08.2016 19:41, Robert Haas:


2. This inserts additional code in a bunch of really low-level places
like heap_hot_search_buffer, heap_update, heap_delete, etc.  I think
what you've basically done here is create a new, in-memory heap AM and
then, because we don't have an API for adding new storage managers,
you've bolted it onto the existing heapam layer.  That's certainly a
reasonable approach for creating a PoC, but I think we actually need a
real API here.  Otherwise, when the next person comes along and wants
to add a third heap implementation, they've got to modify all of these
same places again.  I don't think this code is reasonably maintainable
in this form.


As I can see, you recommend to clean up the API of storage
management code. I strongly agree that it's high time to do it.

So, I started the discussion about refactoring and improving API
of heapam and heap relations.
You can find it on commitfest:
https://commitfest.postgresql.org/10/700/

I'll be glad to see your thoughts on the thread.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] per-statement-level INSTEAD OF triggers

2016-08-08 Thread Yugo Nagata
Hi hackers,

I'm asking out of curiosity, do anyone know why we don't have
per-statement-level INSTEAD OF triggers? I looked into the 
standard SQL (20xx draft), but I can't find the restriction
such that INSTEAD OF triggers must be row-level. Is there
any technical difficulties, or other reasons for the current
implementation?

Best regards,

-- 
Yugo Nagata 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Kyotaro HORIGUCHI
At Mon, 8 Aug 2016 10:19:10 +0300, Victor Wagner  wrote in 
<20160808101910.49bee...@fafnir.local.vm>
> On Fri, 5 Aug 2016 11:21:44 -0400
> Peter Eisentraut  wrote:
> 
> > On 8/4/16 2:45 AM, Victor Wagner wrote:
> > > 4. At the session startup try to reinitializie LC_MESSAGES locale
> > > category with the combination
> > > of the server (or better client-send) language and region and
> > > client-supplied encoding, and if this failed, use untranslated error
> > > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
> > > fail. so, if client would ask server with ru_RU.UTF-8 default
> > > locale to use LATIN1 encoding, server would fallback to
> > > untranslated messages.  
> 
> > I think this is basically my solution (1), with the same problems.
> 
> I think, that there is a big difference from server point of view.
> 
> You propose that both translated and untranslated message should be
> passed around inside backend. It has some benefits, but requires
> considerable reworking of server internals.

Agreed.

> My solution doesn't require keeping both original message and
> translated one during all call stack unwinding. It just checks if
> combination of language and encoding is supported by the NLS subsystem,
> and if not, falls back to untranslated message  for entire session.

Looking at check_client_encoding(), the comment says as following.

| * If we are not within a transaction then PrepareClientEncoding will not
| * be able to look up the necessary conversion procs.  If we are still
| * starting up, it will return "OK" anyway, and InitializeClientEncoding
| * will fix things once initialization is far enough along.  After

We shold overcome this to realize startup-time check for
conversion procs.

> It is much more local change and is comparable by complexity with one,
> proposed by Tom Lane.

I'm not sure what messages may be raised before authentication
but it can be a more generic-solution. (Adding check during
on-session.)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-08 Thread Vladimir Sitnikov
Tom Lane :

> FWIW, this thread started on 25-July, less than two weeks ago.


Technically speaking, there was a pgsql-jdbc thread started on May 14:
https://www.postgresql.org/message-id/nh72v6%24582%241%40ger.gmane.org

9.6beta1 was released on May 12

The fact that it wasn't raised
> till more than 6 months after we committed the pg_am changes


This means that nobody was testing compatibility of "postgresql's master
branch with existing third-party clients".
Testing against well-known clients makes sense to catch bugs early.

I've added "build postgresql from master branch" test to the pgjdbc's
regression suite a week ago, so I hope it would highlight issues early
(even before the official postgresql beta is released).

However, pgjdbc tests are executed only for pgjdbc commits, so if there's
no pgjdbc changes, then there is no logic to trigger "try newer postgres
with current pgjdbc".

Ideally, postgresql's regression suite should validate well-known clients
as well.
I've no idea how long would it take to add something to postgresql's
buildfarm, so I just went ahead and created Travis test configuration at
https://github.com/vlsi/postgres
Do you think those Travis changes can be merged to the upstream?

I mean the following:
1) Activate TravisCI integration for https://github.com/postgres/postgres
 mirror.
2) Add relevant Travis CI file so it checks postgresql's regression suite,
and the other
3) Add build badge to the readme (link to travis ci build) for simplified
navigation to test results.

Here's the commit:
https://github.com/vlsi/postgres/commit/4841f8bc00b7c6717d91f51c98979ce84b4f7df3
Here's how test results look like: https://travis-ci.org/vlsi/postgres

Vladimir


Re: [HACKERS] Possible duplicate release of buffer lock.

2016-08-08 Thread Kyotaro HORIGUCHI
Hello,

At Sat, 06 Aug 2016 12:45:32 -0400, Tom Lane  wrote in 
<16748.1470501...@sss.pgh.pa.us>
> Amit Kapila  writes:
> > Isn't the problem here is that due to some reason (like concurrent
> > split), the code in question (loop --
> > while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases
> > the lock on target buffer and the caller again tries to release the
> > lock on target buffer when false is returned.
> 
> Yeah.  I do not think there is anything wrong with the loop-to-find-
> current-leftsib logic.  The problem is lack of care about what
> resources are held on failure return.  The sole caller thinks it
> should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on
> what _bt_unlink_halfdead_page calls the leafbuf.  But that function
> already dropped the lock (line 1582 in 9.4), and won't have reacquired
> it unless target != leafblkno.  Another problem is that if the target
> is different from leafblkno, we'll hold a pin on the target page, which
> is leaked entirely in this code path.
> 
> The caller knows nothing of the "target" block so it can't reasonably
> deal with all these cases.  I'm inclined to change the function's API
> spec to say that on failure return, it's responsible for dropping
> lock and pin on the passed buffer. We could alternatively reacquire
> lock before returning, but that seems pointless.

Agreed.

> Another thing that looks fishy is at line 1611:
> 
>   leftsib = opaque->btpo_prev;
> 
> At this point we already released lock on leafbuf so it seems pretty
> unsafe to fetch its left-link.  And it's unnecessary cause we already
> did; the line should be just
> 
>   leftsib = leafleftsib;

Right. And I found that it is already committed. Thanks.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Victor Wagner
On Fri, 5 Aug 2016 11:21:44 -0400
Peter Eisentraut  wrote:

> On 8/4/16 2:45 AM, Victor Wagner wrote:
> > 4. At the session startup try to reinitializie LC_MESSAGES locale
> > category with the combination
> > of the server (or better client-send) language and region and
> > client-supplied encoding, and if this failed, use untranslated error
> > message. Obvoisly, attempt to set locale to ru_RU.ISO8859-1 would
> > fail. so, if client would ask server with ru_RU.UTF-8 default
> > locale to use LATIN1 encoding, server would fallback to
> > untranslated messages.  
> 
> I think this is basically my solution (1), with the same problems.
> 



I think, that there is a big difference from server point of view.

You propose that both translated and untranslated message should be
passed around inside backend. It has some benefits, but requires
considerable reworking of server internals.

My solution doesn't require keeping both original message and
translated one during all call stack unwinding. It just checks if
combination of language and encoding is supported by the NLS subsystem,
and if not, falls back to untranslated message  for entire session.

It is much more local change and is comparable by complexity with one,
proposed by Tom Lane.





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] handling unconvertible error messages

2016-08-08 Thread Victor Wagner
On Fri, 5 Aug 2016 11:23:37 -0400
Peter Eisentraut  wrote:

> On 8/4/16 9:42 AM, Tom Lane wrote:
> > I'm inclined to think that we should reset the message locale
> > to C as soon as we've forked away from the postmaster, and leave it
> > that way until we've absorbed settings from the startup packet.
> > Sending messages of this sort in English isn't great, but it's
> > better than sending completely-unreadable ones.  Or is that just my
> > English-centricity showing?  
> 
> Well, most of the time this all works, only if there are different
> client and server settings you might have problems.  We wouldn't want
> to partially disable the NLS feature for the normal case.
> 
There are cases, where client cannot tell server which encoding it
wants to use, and server cannot tell which encoding it uses, but it can
send error messages. For example, CancelRequest.

The only way to ensure that message is readable in this case is to fall
back to some encoding, definitely known by both client and server.
And for now it is US-ASCII. 

It is, as far as I understand, what Tom is proposing:
Fall back to the untranslated message at the beginning of session, and
return to NLS only when encoding is successfully negotiated between
client and server.

May be, there can be other solution - prepare client to be able to
accept UTF-8 messages from server regardless of encoding, i.e. if
message starts with BOM marker (0xFEFF unicode char, EF BB BF byte
sequence in utf-8), interpret it as UTF-8. It would require client to
support some kind of encoding conversion, and in some 8-bit
environments pose problems with displaying these messages.

-- 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Vladimir Sitnikov
Shay Rojansky :

>
> That sounds right to me. As you say, the server version is sent early in
> the startup phase, before any queries are sent to the backend, so frontends
> know which server they're communicating with.
>
> We could call this "protocol 3.1" since it doesn't break backwards
> compatibility (no incompatible server-initiated message changes, but it
> does include a feature that won't be supported by servers which only
> support 3.0. This could be a sort of "semantic versioning" for the protocol
> - optional new client-initiated features are a minor version bump, others
> are a major version bump...
>

Adding a new message is not backward compatible since it will fail in
pgbouncer kind of deployments.
Suppose there's "new backend", "old pgbouncer", "new client" deployment.
If the client tries to send the new message, it will fail since pgbouncer
would have no idea what to do with that new message.

On the other hand, usage of some well-defined statement name to trigger the
special case would be fine: all pgbouncer versions would pass those
parse/bind/exec message as if it were regular messages.

Vladimir

>


Re: [HACKERS] Declarative partitioning

2016-08-08 Thread Ashutosh Bapat
> IIUC, this seems like a combination of 2 and 3 above:
>
> So, we have the following list partitions (as read from the catalog)
>
> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>
> By applying the method of 3:
>
> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>
> Then applying 2:
>
> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2]
> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1]
>

We should arrange the OID arrays to follow canonical order. Assuming that
OIDs of p1, pt2, and p3 of table 1 are t1o1, t1o2 and t1o3 respectively,
and those of p1, p2 and p3 of table 2 are t2o1, t2o2, t2o3 resp. the last
arrays should be [t1o3, t1o2, t1o1] and [t2o2, t2o1, t2o3].Thus the last
arrays from both representation give the OIDs of children that should be
joined pair-wise. IOW, OID array should just follow the canonical order
instead of specification order. AFAIU, your patch arranges the range
partition OIDs in the canonical order and not specification order.


>
> This is user-specification independent representation wherein the
> partition numbers in the 2nd array are based on canonical representation
> (ordered lists).  To check pairwise join compatibility, simply compare the
> first two arrays.  As to which partitions (think OIDs, RTEs whatever) pair
> with each other, simply pair corresponding elements of the 3rd array which
> are original partitions numbers (or OIDs).  Also when routing a tuple, we
> find partition number in the array 2 and then look up the array 3 to get
> the actual partition to insert the tuple.
>
> Neither of these representations make the logic of checking pairwise-join
> compatibility and pairing a subset of partitions (on either side) any
> easier, although I haven't given it a good thought yet.
>
> With the small change suggested above, it should be easy to check
partition-wise join compatibilty for simplest case. I agree that for
generic case it will be difficult. E.g. Table 1#: p3 {'a', 'e'}, p4{'a#',
'l'}, p2 {'b', 'f'}, p1 {'c', 'd'} being (INNER) joined with Table 2: p2
{'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} assuming 'a'< 'a#' < 'b'; in this
case, we should be able to match p3-p2, p2-p1, p1-p3 for partition-wise
join, even though canonical representations of both partitions differ,
because of an extra partition in between.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company