[HACKERS] Support for N synchronous standby servers

2014-08-08 Thread Michael Paquier
Hi all,

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really
straight-forward, and has just needed a couple of modifications in
walsender.c for pg_stat_get_wal_senders and syncrep.c.

When a process commit is cancelled manually by user or when
ProcDiePending shows up, the message returned to user does not show
the list of walsenders where the commit has not been confirmed as it
partially confirmed. I have not done anything for that but let me know
if that would be useful. This would need a scan of the walsenders to
get their application_name.

Thanks,
-- 
Michael
From 3dfff90032c38daba43e1e0c4d3221053d6386ac Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sat, 9 Aug 2014 14:49:24 +0900
Subject: [PATCH] Add parameter synchronous_standby_num

This makes possible support of synchronous replication on a number of standby
nodes equal to the new parameter. The synchronous standbys are chosen in the
order they are listed in synchronous_standby_names.
---
 doc/src/sgml/config.sgml| 32 ---
 doc/src/sgml/high-availability.sgml | 18 -
 src/backend/replication/syncrep.c   | 81 ++---
 src/backend/replication/walsender.c | 74 -
 src/backend/utils/misc/guc.c| 10 +
 src/include/replication/syncrep.h   |  1 +
 6 files changed, 175 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index be5c25b..c40de16 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2586,12 +2586,13 @@ include_dir 'conf.d'
 Specifies a comma-separated list of standby names that can support
 synchronous replication, as described in
 .
-At any one time there will be at most one active synchronous standby;
-transactions waiting for commit will be allowed to proceed after
-this standby server confirms receipt of their data.
-The synchronous standby will be the first standby named in this list
-that is both currently connected and streaming data in real-time
-(as shown by a state of streaming in the
+At any one time there will be at a number of active synchronous standbys
+defined by synchronous_standby_num; transactions waiting
+for commit will be allowed to proceed after those standby servers
+confirms receipt of their data. The synchronous standbys will be
+the first entries named in this list that are both currently connected
+and streaming data in real-time (as shown by a state of
+streaming in the
 
 pg_stat_replication view).
 Other standby servers appearing later in this list represent potential
@@ -2627,6 +2628,25 @@ include_dir 'conf.d'
   
  
 
+ 
+  synchronous_standby_num (integer)
+  
+   synchronous_standby_num configuration parameter
+  
+  
+  
+   
+Specifies the number of standbys that support
+synchronous replication, as described in
+, and listed as the first
+elements of .
+   
+   
+Default value is 1.
+   
+  
+ 
+
  
   vacuum_defer_cleanup_age (integer)
   
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d249959..085d51b 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot'
 WAL record is then sent to the standby. The standby sends reply
 messages each time a new batch of WAL data is written to disk, unless
 wal_receiver_status_interval is set to zero on the standby.
-If the standby is the first matching standby, as specified in
-synchronous_standby_names on the primary, the reply
-messages from that standby will be used to wake users waiting for
-confirmation that the commit record has been received. These parameters
-allow the administrator to specify which standby servers should be
-synchronous standbys. Note that the configuration of synchronous
+If the standby is the first synchronous_standby_num matching
+standbys, as specified in synchronous_standby_names on the
+primary, the reply messages from that standby will be used to wake users
+waiting for confirmation that the commit record has been received. These
+parameters allow the administrator to specify which standby servers should
+be synchronous standbys. Note that the configuration of synchronous
 replication is mainly on the master. Named standbys must be directly
 connected to the master; the master knows nothing about downstream
 standby servers using c

Re: [HACKERS] 9.4 pg_restore --help changes

2014-08-08 Thread David G Johnston
Peter Eisentraut-2 wrote
> 9.3 pg_restore --help output:
> 
>   -I, --index=NAME restore named index
>   -n, --schema=NAMErestore only objects in this schema
>   -P, --function=NAME(args)restore named function
>   -t, --table=NAME restore named table(s)
>   -T, --trigger=NAME   restore named trigger
>   --section=SECTIONrestore named section (pre-data, data, or
> post-data)
> 
> 9.4 pg_restore --help output:
> 
>   -I, --index=NAME restore named indexes
>   -n, --schema=NAMErestore only objects in these schemas
>   -P, --function=NAME(args)restore named functions
>   -t, --table=NAME restore named tables
>   -T, --trigger=NAME   restore named triggers
>   --section=SECTIONrestore named sections (pre-data, data, or
> post-data)
> 
> This is a valuable feature change, but I think the help output is
> unhelpful.  The left side has a singular placeholder, but the right side
> talks about objects in plural.  How do I actually specify multiple
> indexes?  Is is --index=foo,bar?  It's --index=foo --index=bar, but
> that's unclear from the help.
> 
> I think it would be clearer to write something like
> 
>   -I, --index=NAME restore named indexes (repeat for multiple)
> 
> or perhaps make a note at the bottom
> 
> The options -I, -n, -P, -t, -T, --section can be combined and
> specified multiple times to select multiple objects.

"each* restores named "

For the schema version how about:

"each* restores named schema - including objects"

Then at the end: 

* You may specify a flag multiple times to restore multiple entries of the
given type.

Though to be fair is there some platform standard that we can attempt to
emulate here? It seems pointless to create our own design if we borrow
something more universal.

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/9-4-pg-restore-help-changes-tp5814288p5814304.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread David G Johnston
akapila wrote
> On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <

> tgl@.pa

> > wrote:
>>
>> Stephen Frost <

> sfrost@

> > writes:
>> > What about considering how large the object is when we are analyzing if
>> > it compresses well overall?
>>
>> Hmm, yeah, that's a possibility: we could redefine the limit at which
>> we bail out in terms of a fraction of the object size instead of a fixed
>> limit.  However, that risks expending a large amount of work before we
>> bail, if we have a very large incompressible object --- which is not
>> exactly an unlikely case.  Consider for example JPEG images stored as
>> bytea, which I believe I've heard of people doing.  Another issue is
>> that it's not real clear that that fixes the problem for any fractional
>> size we'd want to use.  In Larry's example of a jsonb value that fails
>> to compress, the header size is 940 bytes out of about 12K, so we'd be
>> needing to trial-compress about 10% of the object before we reach
>> compressible data --- and I doubt his example is worst-case.
>>
>> >> 1. The real problem here is that jsonb is emitting quite a bit of
>> >> fundamentally-nonrepetitive data, even when the user-visible input is
> very
>> >> repetitive.  That's a compression-unfriendly transformation by
>> anyone's
>> >> measure.
>>
>> > I disagree that another algorithm wouldn't be able to manage better on
>> > this data than pglz.  pglz, from my experience, is notoriously bad a
>> > certain data sets which other algorithms are not as poorly impacted by.
>>
>> Well, I used to be considered a compression expert, and I'm going to
>> disagree with you here.  It's surely possible that other algorithms would
>> be able to get some traction where pglz fails to get any,
> 
> During my previous work in this area, I had seen that some algorithms
> use skipping logic which can be useful for incompressible data followed
> by compressible data or in general as well. 

Random thought from the sideline...

This particular data type has the novel (within PostgreSQL) design of both a
(feature oriented - and sizeable) header and a payload.  Is there some way
to add that model into the storage system so that, at a higher level,
separate attempts are made to compress each section and then the compressed
(or not) results and written out adjacently and with a small header
indicating the length of the stored header and other meta data like whether
each part is compressed and even the type that data represents?  For reading
back into memory the header-payload generic type is populated and then the
header and payload decompressed - as needed - then the two parts are fed
into the appropriate type constructor that understands and accepts the two
pieces.

Just hoping to spark an idea here - I don't know enough about the internals
to even guess how close I am to something feasible.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/jsonb-format-is-pessimal-for-toast-compression-tp5814162p5814299.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread David Rowley
On Sat, Aug 9, 2014 at 3:34 PM, Tom Lane  wrote:

> David Rowley  writes:
> > On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane  wrote:
> >> So I'm thinking you're right: we should rewrite this code so that only
> >> indexes having all columns distinct can match, thereby ensuring that
> there
> >> is only one possible interpretation of the equality semantics for the
> >> foreign key.
>
> > I've attached a patch which disallows these, though I was not quite sure
> > about the new error message since likely this is something that should be
> > backpacked? I wasn't sure on the policy for new translations in a minor
> > release.
>
> There's no need for a new error message I think, because we should just
> ignore such indexes.  After all, there might be a valid matching index
> later on.
>
>
hmm, but if the user attempts to define the foreign key that contains a
duplicate column in the REFERENCES part, then we'll never "find" any
indexes, so there's no point in looking at all. That's why I did the
duplicate checks as a pre-check before the loop over the indexes. Though
that's not to say that we couldn't throw the "there is no unique constraint
matching given keys for referenced table ..." error there.

I've attached a version of the patch that's a little smarter when it comes
to doing the duplicate checks in the attnums array... For what it's worth.

Regards

David Rowley


transformFkeyCheckAttrs_fix_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-08 Thread Ramirez, Danilo
Thanks to all for the great info.  We are new to postgresql and this discussion 
has both instructed us and increased our respect for the database and the 
community.

I am seeing a behavior that I don’t understand and hopefully you guys can clear 
it up.

I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18 comparing 
against AWS oracle on db.m3.2xlarge using sql developer and TOAD.

I am running a query with 30 tables in the from clause, getting 137 columns 
back (this is our most basic query, they get a lot more more complex).   It 
returns back 4800 rows.

In oracle 1st run takes 3.92 seconds, 2nd .38 seconds.  Scrolling to end takes 
and extra 1.5 seconds for total of 5.5.

Using pgadmin, I run the query.  Looking at the lower right hand I can see the 
time going up.  It stops at 8200 ms or close to it every time, then it takes an 
extra 6 seconds before it displays the rows on the screen.  2nd, 3rd, etc. runs 
all take about  same amount of time 8 sec plus 6 sec

I then changed it to return only 1 column back.   In oracle/sqldeveloper 
identical behavior as before, same time.  In postgresql it now goes down to 1.8 
seconds for 1st, 2nd, etc. runs.

I then change it so that I am asking for the sum of 1 column.  In oracle time 
goes down to .2 seconds and postgresql now goes down to .2 seconds also.

I then change it back to get the full result set and behavior goes back to 
original, oracle .38 since its cached, postgresql 8 seconds.

Of the 30 tables 6 are 10-50 gigs in size.  Our  setting are

"shared_buffers";"7639832kB"
"effective_cache_size";"15279664kB"
"allow_system_table_mods";"off"
"application_name";"pgAdmin III - Query Tool"
"archive_command";"/etc/rds/dbbin/pgscripts/rds_wal_archive %p"
"archive_mode";"on"
"archive_timeout";"5min"
"array_nulls";"on"
"authentication_timeout";"1min"
"autovacuum";"on"
"autovacuum_analyze_scale_factor";"0.1"
"autovacuum_analyze_threshold";"50"
"autovacuum_freeze_max_age";"2"
"autovacuum_max_workers";"3"
"autovacuum_multixact_freeze_max_age";"4"
"autovacuum_naptime";"1min"
"autovacuum_vacuum_cost_delay";"20ms"
"autovacuum_vacuum_cost_limit";"-1"
"autovacuum_vacuum_scale_factor";"0.2"
"autovacuum_vacuum_threshold";"50"
"backslash_quote";"safe_encoding"
"bgwriter_delay";"200ms"
"bgwriter_lru_maxpages";"100"
"bgwriter_lru_multiplier";"2"
"block_size";"8192"
"bonjour";"off"
"bonjour_name";""
"bytea_output";"escape"
"check_function_bodies";"on"
"checkpoint_completion_target";"0.9"
"checkpoint_segments";"16"
"checkpoint_timeout";"5min"
"checkpoint_warning";"30s"
"client_encoding";"UNICODE"
"client_min_messages";"notice"
"commit_delay";"0"
"commit_siblings";"5"
"constraint_exclusion";"partition"
"cpu_index_tuple_cost";"0.005"
"cpu_operator_cost";"0.0025"
"cpu_tuple_cost";"0.01"
"cursor_tuple_fraction";"0.1"
"DateStyle";"ISO, MDY"
"db_user_namespace";"off"
"deadlock_timeout";"1s"
"debug_assertions";"off"
"debug_pretty_print";"on"
"debug_print_parse";"off"
"debug_print_plan";"off"
"debug_print_rewritten";"off"
"default_statistics_target";"100"
"default_tablespace";""
"default_text_search_config";"pg_catalog.simple"
"default_transaction_deferrable";"off"
"default_transaction_isolation";"read committed"
"default_transaction_read_only";"off"
"default_with_oids";"off"
"effective_cache_size";"15279664kB"
"effective_io_concurrency";"1"
"enable_bitmapscan";"on"
"enable_hashagg";"on"
"enable_hashjoin";"on"
"enable_indexonlyscan";"on"
"enable_indexscan";"on"
"enable_material";"on"
"enable_mergejoin";"on"
"enable_nestloop";"on"
"enable_seqscan";"on"
"enable_sort";"on"
"enable_tidscan";"on"
"escape_string_warning";"on"
"event_source";"PostgreSQL"
"exit_on_error";"off"
"extra_float_digits";"0"
"from_collapse_limit";"8"
"fsync";"on"
"full_page_writes";"on"
"geqo";"on"
"geqo_effort";"5"
"geqo_generations";"0"
"geqo_pool_size";"0"
"geqo_seed";"0"
"geqo_selection_bias";"2"
"geqo_threshold";"12"
"gin_fuzzy_search_limit";"0"
"hot_standby";"off"
"hot_standby_feedback";"off"
"ignore_checksum_failure";"off"
"ignore_system_indexes";"off"
"integer_datetimes";"on"
"IntervalStyle";"postgres"
"join_collapse_limit";"8"
"krb_caseins_users";"off"
"krb_srvname";"postgres"
"lc_collate";"en_US.UTF-8"
"lc_ctype";"en_US.UTF-8"
"lc_messages";""
"lc_monetary";"C"
"lc_numeric";"C"
"lc_time";"C"
"listen_addresses";"*"
"lo_compat_privileges";"off"
"local_preload_libraries";""
"lock_timeout";"0"
"log_autovacuum_min_duration";"-1"
"log_checkpoints";"on"
"log_connections";"off"
"log_destination";"stderr"
"log_disconnections";"off"
"log_duration";"off"
"log_error_verbosity";"default"
"log_executor_stats";"off"
"log_file_mode";"0644"
"log_hostname";"on"
"log_line_prefix";"%t:%r:%u@%d:[%p]:"
"log_lock_waits";"off"
"log_min_duration_statement";"-1"
"log_min_error_statement";"error"
"log_min_messages";"warning"
"log_parser_stats";"off"
"log_planner_stats";"off"
"log_rotation_age";"1h"
"log_rotation_size";"10MB"
"log_statement";"none"
"log_statement_stats";"off"
"log_temp_files";"-1"
"lo

Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread Tom Lane
David Rowley  writes:
> On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane  wrote:
>> So I'm thinking you're right: we should rewrite this code so that only
>> indexes having all columns distinct can match, thereby ensuring that there
>> is only one possible interpretation of the equality semantics for the
>> foreign key.

> I've attached a patch which disallows these, though I was not quite sure
> about the new error message since likely this is something that should be
> backpacked? I wasn't sure on the policy for new translations in a minor
> release.

There's no need for a new error message I think, because we should just
ignore such indexes.  After all, there might be a valid matching index
later on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread David Rowley
On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane  wrote:

>
> So I'm thinking you're right: we should rewrite this code so that only
> indexes having all columns distinct can match, thereby ensuring that there
> is only one possible interpretation of the equality semantics for the
> foreign key.
>
>
I've attached a patch which disallows these, though I was not quite sure
about the new error message since likely this is something that should be
backpacked? I wasn't sure on the policy for new translations in a minor
release.

The wording of the comment above the code which disallows the duplicate
column references could likely do with some meaningful explanation about
the reasons why we disallow these, I thought your words would be a bit
better at doing this so I didn't try very hard on that part.

Is this roughly what you had in mind?

Regards

David Rowley


transformFkeyCheckAttrs_fix_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Bruce Momjian
On Fri, Aug  8, 2014 at 08:25:04PM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Fri, Aug  8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
> > > 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> > > jsonb is still within the bounds of reason.
> > 
> > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
> > betas and report the problem JSONB columns.
> 
> That is *not* a good solution..

If you change the JSONB binary format, and we can't read the old format,
that is the only option.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Amit Kapila
On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane  wrote:
>
> Stephen Frost  writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
>
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit.  However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case.  Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing.  Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use.  In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.
>
> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is
very
> >> repetitive.  That's a compression-unfriendly transformation by anyone's
> >> measure.
>
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz.  pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
>
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here.  It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any,

During my previous work in this area, I had seen that some algorithms
use skipping logic which can be useful for incompressible data followed
by compressible data or in general as well.  One of the technique could
be If we don't find any match for first 4 bytes, then skip 4 bytes
and if we don't find match again for next 8 bytes, then skip 8
bytes and keep on doing the same until we find first match in which
case it would go back to beginning of data.  Now here we could follow
this logic until we actually compare total of first_success_by bytes.
There can be caveats in this particular scheme of skipping but I
just wanted to mention in general about the skipping idea to reduce
the number of situations where we will bail out even though there is
lot of compressible data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] 9.4 pg_restore --help changes

2014-08-08 Thread Michael Paquier
On Sat, Aug 9, 2014 at 10:44 AM, Peter Eisentraut  wrote:
>   -I, --index=NAME restore named indexes (repeat for multiple)
A single --index entry restores only one index, so what about
something like that:
-I, --index=NAME restore named index (repeat for multiple entries)

> or perhaps make a note at the bottom
>
> The options -I, -n, -P, -t, -T, --section can be combined and
> specified multiple times to select multiple objects.
>
> Other ideas?
Specifying both to make it clearer to the user. The note at the bottom
is a good thing as well.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.4 pg_restore --help changes

2014-08-08 Thread Peter Eisentraut
9.3 pg_restore --help output:

  -I, --index=NAME restore named index
  -n, --schema=NAMErestore only objects in this schema
  -P, --function=NAME(args)restore named function
  -t, --table=NAME restore named table(s)
  -T, --trigger=NAME   restore named trigger
  --section=SECTIONrestore named section (pre-data, data, or 
post-data)

9.4 pg_restore --help output:

  -I, --index=NAME restore named indexes
  -n, --schema=NAMErestore only objects in these schemas
  -P, --function=NAME(args)restore named functions
  -t, --table=NAME restore named tables
  -T, --trigger=NAME   restore named triggers
  --section=SECTIONrestore named sections (pre-data, data, or 
post-data)

This is a valuable feature change, but I think the help output is
unhelpful.  The left side has a singular placeholder, but the right side
talks about objects in plural.  How do I actually specify multiple
indexes?  Is is --index=foo,bar?  It's --index=foo --index=bar, but
that's unclear from the help.

I think it would be clearer to write something like

  -I, --index=NAME restore named indexes (repeat for multiple)

or perhaps make a note at the bottom

The options -I, -n, -P, -t, -T, --section can be combined and
specified multiple times to select multiple objects.

Other ideas?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I agree that we need to avoid changing jsonb's on-disk representation.
> 
> ... post-release, I assume you mean.

Yes.

> > Have I missed where a good suggestion has been made about how to do that
> > which preserves the binary-search capabilities and doesn't make the code
> > much more difficult?
> 
> We don't have one yet, but we've only been thinking about this for a few
> hours.

Fair enough.

> > Trying to move the header to the end just for the
> > sake of this doesn't strike me as a good solution as it'll make things
> > quite a bit more complicated.  Is there a way we could interleave the
> > likely-compressible user data in with the header instead?
> 
> Yeah, I was wondering about that too, but I don't immediately see how to
> do it without some sort of preprocessing step when we read the object
> (which'd be morally equivalent to converting a series of lengths into a
> pointer array).  Binary search isn't going to work if the items it's
> searching in aren't all the same size.
> 
> Having said that, I am not sure that a preprocessing step is a
> deal-breaker.  It'd be O(N), but with a pretty darn small constant factor,
> and for plausible sizes of objects I think the binary search might still
> dominate.  Worth investigation perhaps.

For my part, I'm less concerned about a preprocessing step which happens
when we store the data and more concerned about ensuring that we're able
to extract data quickly.  Perhaps that's simply because I'm used to
writes being more expensive than reads, but I'm not alone in that
regard either.  I doubt I'll have time in the next couple of weeks to
look into this and if we're going to want this change for 9.4, we really
need someone working on it sooner than later.  (to the crowd)- do we
have any takers for this investigation?

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] psql output change in 9.4

2014-08-08 Thread Peter Eisentraut
This is 9.3:

peter=# \a
Output format is unaligned.
peter=# \a
Output format is aligned.
peter=# \x
Expanded display is on.
peter=# \x
Expanded display is off.

This is new in 9.4:

peter=# \a
Output format (format) is unaligned.
peter=# \a
Output format (format) is aligned.
peter=# \x
Expanded display (expanded) is on.
peter=# \x
Expanded display (expanded) is off.

What is the point of that change?

I suppose it is so that you can use \pset without arguments to show all
settings:

peter=# \pset
Border style (border) is 1.
Target width (columns) unset.
Expanded display (expanded) is off.
...

But those are unrelated features, and the changed output doesn't make
any sense in the contexts I show above.

I think this should be reverted, and the \pset output should be
implemented separately.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org

2014-08-08 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> We only bump the SO version when we make a low-level ABI break; but even
> without doing that we could potentially have changed library behavior in
> ways that could negatively impact some applications.  

We should definitely be paying attention for such cases as I'd generally
feel that they're bug fixes we need to address..

> So I think there's
> some merit in Josh's position: he doesn't want to have to re-QA his
> applications against some new version of libpq, but if you force him to
> move to 9.3 libpq, he's going to need to do that if he wants to be strict
> about it.  

"Force" is a bit strong here, in my opinion.  There's a (really, rather
trivial) way to get the libpq version which is shipped with a specific
PG major version- simple add that major version to the end of the 'deb'
line in your sources.list file.

> If the Debian guidelines think that only SO major version need
> be considered, they're wrong, at least for the way we've been treating
> that.

The SO major version should be sufficient for applications to operate
normally.  If that isn't the case then I agree that we need to review
the changes we are making to see if the SO should be bumped.  Note that
Debian's viewpoint on this is more along the lines of "why build against
an old version if the latest one, whose major SO is the same, exists?"
This is largely to avoid the hell of versioned Build-Depends and having
to support multiple sets of -dev packages concurrently (consider that
builds generally only look for the unversioned '.so' file also..).

> At the same time, there would be more merit in Josh's position if he
> could point to any *actual* libpq changes that might pose application
> compatibility problems.  I don't recall that we've made many such,
> so the above argument might just be hypothetical.

I don't believe it's hypothetical from Josh's perspective, but I didn't
follow the threads completely to confirm that there was a real issue.
If there is a real issue here, I'd most likely vote to fix it and
backpatch it as a bug, though it's not clear if that would be considered
'good enough' for this case.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
Stephen Frost  writes:
> I agree that we need to avoid changing jsonb's on-disk representation.

... post-release, I assume you mean.

> Have I missed where a good suggestion has been made about how to do that
> which preserves the binary-search capabilities and doesn't make the code
> much more difficult?

We don't have one yet, but we've only been thinking about this for a few
hours.

> Trying to move the header to the end just for the
> sake of this doesn't strike me as a good solution as it'll make things
> quite a bit more complicated.  Is there a way we could interleave the
> likely-compressible user data in with the header instead?

Yeah, I was wondering about that too, but I don't immediately see how to
do it without some sort of preprocessing step when we read the object
(which'd be morally equivalent to converting a series of lengths into a
pointer array).  Binary search isn't going to work if the items it's
searching in aren't all the same size.

Having said that, I am not sure that a preprocessing step is a
deal-breaker.  It'd be O(N), but with a pretty darn small constant factor,
and for plausible sizes of objects I think the binary search might still
dominate.  Worth investigation perhaps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org

2014-08-08 Thread Tom Lane
Stephen Frost  writes:
> * Joshua D. Drake (j...@commandprompt.com) wrote:
>> On 08/07/2014 10:12 PM, Stephen Frost wrote:
>>> If you want the specific version, update your deb line.  Don't complain
>>> because you used the Debian repo that follows the Debian guidelines and
>>> didn't like what you got- that's not going to change.

>> May I have a link? Because I would argue that the "latest" version
>> of the library for 9.2, is the library that ships with 9.2.9, not
>> the one that ships with 9.3.5.

> Really?  The latest version of libpq SO version 5 that we ship is 9.2.9?
> I certainly don't feel that way.

> Next you'll be asking us to bump the major SO for every major release of
> PG.

> Note that the last time we changed the SO major version was back in 2006
> and that was primairly because we were removing symbols that people
> weren't supposed to be using anyway.  Prior to that, it was 2005, or the
> 8.0/8.1 timeframe...

We only bump the SO version when we make a low-level ABI break; but even
without doing that we could potentially have changed library behavior in
ways that could negatively impact some applications.  So I think there's
some merit in Josh's position: he doesn't want to have to re-QA his
applications against some new version of libpq, but if you force him to
move to 9.3 libpq, he's going to need to do that if he wants to be strict
about it.  If the Debian guidelines think that only SO major version need
be considered, they're wrong, at least for the way we've been treating
that.

At the same time, there would be more merit in Josh's position if he
could point to any *actual* libpq changes that might pose application
compatibility problems.  I don't recall that we've made many such,
so the above argument might just be hypothetical.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
> 
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit.  However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case.  Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing.  Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use.  In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.

Agreed- I tried to allude to that in my prior mail, there's clearly a
concern that we'd make things worse in certain situations.  Then again,
at least for that case, we could recommend changing the storage type to
EXTERNAL.

> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is very
> >> repetitive.  That's a compression-unfriendly transformation by anyone's
> >> measure.
> 
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz.  pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
> 
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here.  It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any, but that doesn't
> mean that presenting them with hard-to-compress data in the first place is
> a good design decision.  There is no scenario in which data like this is
> going to be friendly to a general-purpose compression algorithm.  It'd
> be necessary to have explicit knowledge that the data consists of an
> increasing series of four-byte integers to be able to do much with it.
> And then such an assumption would break down once you got past the
> header ...

I've wondered previously as to if we, perhaps, missed the boat pretty
badly by not providing an explicitly versioned per-type compression
capability, such that we wouldn't be stuck with one compression
algorith for all types, and would be able to version compression types
in a way that would allow us to change them over time, provided the
newer code always understands how to decode X-4 (or whatever) versions
back.

I do agree that it'd be great to represent every type in a highly
compressable way for the sake of the compression algorithm, but
I've not seen any good suggestions for how to make that happen and I've
got a hard time seeing how we could completely change the jsonb storage
format, retain the capabilities it has today, make it highly
compressible, and get 9.4 out this calendar year.

I expect we could trivially add padding into the jsonb header to make it
compress better, for the sake of this particular check, but then we're
going to always be compression jsonb, even when the user data isn't
actually terribly good for compression, spending a good bit of CPU time
while we're at it.

> > Perhaps another options would be a new storage type which basically says
> > "just compress it, no matter what"?  We'd be able to make that the
> > default for jsonb columns too, no?
> 
> Meh.  We could do that, but it would still require adding arguments to
> toast_compress_datum() that aren't there now.  In any case, this is a
> band-aid solution; and as Josh notes, once we ship 9.4 we are going to
> be stuck with jsonb's on-disk representation pretty much forever.

I agree that we need to avoid changing jsonb's on-disk representation.
Have I missed where a good suggestion has been made about how to do that
which preserves the binary-search capabilities and doesn't make the code
much more difficult?  Trying to move the header to the end just for the
sake of this doesn't strike me as a good solution as it'll make things
quite a bit more complicated.  Is there a way we could interleave the
likely-compressible user data in with the header instead?  I've not
looked, but it seems like that's the only reasonable approach to address
this issue in this manner.  If that's simply done, then great, but it
strikes me as unlikely to be..

I'll just throw out a bit of a counter-point to all this also though- we
don't try to focus on making our on-disk representation of data,
generally, very compressible even though there are filesystems, such as
ZFS, which might benefit from certain rearrangements of our on-disk
formats (no, I don't have any specific recommendations in this vein, but
I certainly don't see anyone else asking after it or as

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Andrew Dunstan


On 08/08/2014 08:45 PM, Tom Lane wrote:

Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"?  We'd be able to make that the
default for jsonb columns too, no?

Meh.  We could do that, but it would still require adding arguments to
toast_compress_datum() that aren't there now.  In any case, this is a
band-aid solution; and as Josh notes, once we ship 9.4 we are going to
be stuck with jsonb's on-disk representation pretty much forever.




Yeah, and almost any other solution is likely to mean non-jsonb users 
potentially paying a penalty for fixing this for jsonb. So if we can 
adjust the jsonb layout to fix this problem I think we should do so.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
Stephen Frost  writes:
> What about considering how large the object is when we are analyzing if
> it compresses well overall?

Hmm, yeah, that's a possibility: we could redefine the limit at which
we bail out in terms of a fraction of the object size instead of a fixed
limit.  However, that risks expending a large amount of work before we
bail, if we have a very large incompressible object --- which is not
exactly an unlikely case.  Consider for example JPEG images stored as
bytea, which I believe I've heard of people doing.  Another issue is
that it's not real clear that that fixes the problem for any fractional
size we'd want to use.  In Larry's example of a jsonb value that fails
to compress, the header size is 940 bytes out of about 12K, so we'd be
needing to trial-compress about 10% of the object before we reach
compressible data --- and I doubt his example is worst-case.

>> 1. The real problem here is that jsonb is emitting quite a bit of
>> fundamentally-nonrepetitive data, even when the user-visible input is very
>> repetitive.  That's a compression-unfriendly transformation by anyone's
>> measure.

> I disagree that another algorithm wouldn't be able to manage better on
> this data than pglz.  pglz, from my experience, is notoriously bad a
> certain data sets which other algorithms are not as poorly impacted by.

Well, I used to be considered a compression expert, and I'm going to
disagree with you here.  It's surely possible that other algorithms would
be able to get some traction where pglz fails to get any, but that doesn't
mean that presenting them with hard-to-compress data in the first place is
a good design decision.  There is no scenario in which data like this is
going to be friendly to a general-purpose compression algorithm.  It'd
be necessary to have explicit knowledge that the data consists of an
increasing series of four-byte integers to be able to do much with it.
And then such an assumption would break down once you got past the
header ...

> Perhaps another options would be a new storage type which basically says
> "just compress it, no matter what"?  We'd be able to make that the
> default for jsonb columns too, no?

Meh.  We could do that, but it would still require adding arguments to
toast_compress_datum() that aren't there now.  In any case, this is a
band-aid solution; and as Josh notes, once we ship 9.4 we are going to
be stuck with jsonb's on-disk representation pretty much forever.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Stephen Frost
* Josh Berkus (j...@agliodbs.com) wrote:
> On 08/08/2014 08:02 AM, Tom Lane wrote:
> > 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> > jsonb is still within the bounds of reason.
> > 
> > Considering all the hype that's built up around jsonb, shipping a design
> > with a fundamental performance handicap doesn't seem like a good plan
> > to me.  We could perhaps band-aid around it by using different compression
> > parameters for jsonb, although that would require some painful API changes
> > since toast_compress_datum() doesn't know what datatype it's operating on.
> 
> I would rather ship late than ship a noncompressable JSONB.
> 
> One we ship 9.4, many users are going to load 100's of GB into JSONB
> fields.  Even if we fix the compressability issue in 9.5, those users
> won't be able to fix the compression without rewriting all their data,
> which could be prohibitive.  And we'll be  in a position where we have
> to support the 9.4 JSONB format/compression technique for years so that
> users aren't blocked from upgrading.

Would you accept removing the binary-search capability from jsonb just
to make it compressable?  I certainly wouldn't.  I'd hate to ship late
also, but I'd be willing to support that if we can find a good solution
to keep both compressability and binary-search (and provided it doesn't
delay us many months..).

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Stephen Frost
* Bruce Momjian (br...@momjian.us) wrote:
> On Fri, Aug  8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
> > 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> > jsonb is still within the bounds of reason.
> 
> FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
> betas and report the problem JSONB columns.

That is *not* a good solution..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org

2014-08-08 Thread Stephen Frost
* Joshua D. Drake (j...@commandprompt.com) wrote:
> On 08/07/2014 10:12 PM, Stephen Frost wrote:
> >If you want the specific version, update your deb line.  Don't complain
> >because you used the Debian repo that follows the Debian guidelines and
> >didn't like what you got- that's not going to change.
> 
> May I have a link? Because I would argue that the "latest" version
> of the library for 9.2, is the library that ships with 9.2.9, not
> the one that ships with 9.3.5.

Really?  The latest version of libpq SO version 5 that we ship is 9.2.9?
I certainly don't feel that way.

Next you'll be asking us to bump the major SO for every major release of
PG.

Note that the last time we changed the SO major version was back in 2006
and that was primairly because we were removing symbols that people
weren't supposed to be using anyway.  Prior to that, it was 2005, or the
8.0/8.1 timeframe...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> I looked into the issue reported in bug #11109.  The problem appears to be
> >> that jsonb's on-disk format is designed in such a way that the leading
> >> portion of any JSON array or object will be fairly incompressible, because
> >> it consists mostly of a strictly-increasing series of integer offsets.
> >> This interacts poorly with the code in pglz_compress() that gives up if
> >> it's found nothing compressible in the first first_success_by bytes of a
> >> value-to-be-compressed.  (first_success_by is 1024 in the default set of
> >> compression parameters.)
> 
> > I haven't looked at this in any detail, so take this with a grain of
> > salt, but what about teaching pglz_compress about using an offset
> > farther into the data, if the incoming data is quite a bit larger than
> > 1k?  This is just a test to see if it's worthwhile to keep going, no?
> 
> Well, the point of the existing approach is that it's a *nearly free*
> test to see if it's worthwhile to keep going; there's just one if-test
> added in the outer loop of the compression code.  (cf commit ad434473ebd2,
> which added that along with some other changes.)  AFAICS, what we'd have
> to do to do it as you suggest would to execute compression on some subset
> of the data and then throw away that work entirely.  I do not find that
> attractive, especially when for most datatypes there's no particular
> reason to look at one subset instead of another.

Ah, I see- we were using the first block as it means we can reuse the
work done on it if we decide to continue with the compression.  Makes
sense.  We could possibly arrange to have the amount attempted depend on
the data type, but you point out that we can't do that without teaching
lower components about types, which is less than ideal.

What about considering how large the object is when we are analyzing if
it compresses well overall?  That is- for a larger object, make a larger
effort to compress it.  There's clearly a pessimistic case which could
arise from that, but it may be better than the current situation.
There's a clear risk that such an algorithm may well be very type
specific, meaning that we make things worse for some types (eg: bytea's
which end up never compressing well we'd likely spend more CPU time
trying than we do today).

> 1. The real problem here is that jsonb is emitting quite a bit of
> fundamentally-nonrepetitive data, even when the user-visible input is very
> repetitive.  That's a compression-unfriendly transformation by anyone's
> measure.  Assuming that some future replacement for pg_lzcompress() will
> nonetheless be able to compress the data strikes me as mostly wishful
> thinking.  Besides, we'd more than likely have a similar early-exit rule
> in any substitute implementation, so that we'd still be at risk even if
> it usually worked.

I agree that jsonb ends up being nonrepetitive in part, which is why
I've been trying to push the discussion in the direction of making it
more likely for the highly-compressible data to be considered rather
than the start of the jsonb object.  I don't care for our compression
algorithm having to be catered to in this regard in general though as
the exact same problem could, and likely does, exist in some real life
bytea-using PG implementations.

I disagree that another algorithm wouldn't be able to manage better on
this data than pglz.  pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.

> 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> jsonb is still within the bounds of reason.

I'd really hate to ship 9.4 without a fix for this, but I have a similar
hard time with shipping 9.4 without the binary search component..

> Considering all the hype that's built up around jsonb, shipping a design
> with a fundamental performance handicap doesn't seem like a good plan
> to me.  We could perhaps band-aid around it by using different compression
> parameters for jsonb, although that would require some painful API changes
> since toast_compress_datum() doesn't know what datatype it's operating on.

I don't like the idea of shipping with this handicap either.

Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"?  We'd be able to make that the
default for jsonb columns too, no?  Again- I'll admit this is shooting
from the hip, but I wanted to get these thoughts out and I won't have
much more time tonight.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread Tom Lane
I wrote:
> David Rowley  writes:
>> The attached seems to fix the problem, but the whole thing makes me wonder
>> if this is even meant to be allowed? I was thinking that this might be a
>> good time to disallow this altogether, since it's already broken and looks
>> like it has been for about 11 years

> We've gone out of our way in the past to allow duplicate index columns
> (eg commit cfc5008a51f4), so I'm not sure why we'd not allow such indexes
> to be used as foreign key references.

I poked at this some more and changed my mind.  Per SQL standard, the
referent of a foreign key has to match some unique or pkey constraint,
and *we do not allow duplicate columns in constraint syntax*:

regression=# create temp table pp (f1 int, unique(f1,f1));
ERROR:  column "f1" appears twice in unique constraint
LINE 1: create temp table pp (f1 int, unique(f1,f1));
  ^

which restriction I believe is also per spec.  It's true that using
CREATE UNIQUE INDEX syntax, you can make an index with duplicate
columns, but there seems no a-priori reason why we have to allow such
an index to be the referent of a foreign key.  The main reason *not* to
allow it is that such an index might conceivably have different opclasses
for the index columns containing the same attnum, and in that case it
becomes ambiguous which opclass (ie, which definition of equality)
applies to each column of the foreign key.  We could make some arbitrary
definition of which one to select but I think that would be stretching
generality well past the point of usefulness.  Worse, if the different
opclasses do have different definitions of equality and we pick the wrong
one, it's not real clear that it mightn't be possible for a referencing
row to be considered equal to multiple rows in the referenced table (that
are distinct according to the rules used by the index).

So I'm thinking you're right: we should rewrite this code so that only
indexes having all columns distinct can match, thereby ensuring that there
is only one possible interpretation of the equality semantics for the
foreign key.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-08-08 Thread Peter Geoghegan
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan  wrote:
> Robert pointed out a case where varying character case of an English
> word did not alter the primary level bytes (and thus the poor man's
> normalized key was the same). He acknowledged that most of the entropy
> of the first 8 bytes of the string went into the first 8 bytes of the
> blob/key. This can actually be very useful to the optimization in some
> cases. In particular, with most Latin alphabets you'll see the same
> pattern when diacritics are used. This means that even though the
> original string has (say) an accented character that would take 2
> bytes to store in UTF-8, the weight in the primary level is the same
> as an equivalent unaccented character (and so only takes one byte to
> store at that level, with differences only in subsequent levels).
> Whole strxfrm() blobs are the same length regardless of how many
> accents appear in otherwise equivalent original Latin string, and you
> get exactly the same high concentrations of entropy in the first 8
> bytes in pretty much all Latin alphabets (the *absence* of diacritics
> is stored in later weight levels too, even with the "en_US.UTF-8"
> collation).

There are many other interesting cases where en_US.UTF-8, and
presumably all other collations concentrate much more entropy into
leading bytes than might be expected. Consider:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc"
"abc" -> 0c0d0e0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc"
"# abc" -> 0c0d0e01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "* abc"
"* abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes)

and, to show you what this looks like when the primary
weights/original codepoints appear backwards:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba"
"cba" -> 0e0d0c0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba"
"# cba" -> 0e0d0c01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "* cba"
"* cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes)

Evidently, the implementation always places primary weights first,
corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and
later "\0e\0d\0c") - no matter how many "extraneous" characters are
placed in front. They're handled later. Space don't appear in the
primary weight level at all:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c"
"a b c" -> 0c0d0e01090909010909090102350235 (16 bytes)

Lots of punctuation-type characters will not affect the primary weight level:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%@!()\/#-+,:^~? a b c"
"%@!()\/#-+,:^~? a b c" ->
0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235
(48 bytes)

Some non-alphabetic ASCII characters will affect the primary level,
though. For example:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c"
"1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes)

There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e",
a primary weight "\03" corresponding to '1'.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Larry White
I was not complaining; I think JSONB is awesome.

But I am one of those people who would like to put 100's of GB (or more)
JSON files into Postgres and I am concerned about file size and possible
future changes to the format.


On Fri, Aug 8, 2014 at 7:10 PM, Peter Geoghegan  wrote:

> On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus  wrote:
> > One we ship 9.4, many users are going to load 100's of GB into JSONB
> > fields.  Even if we fix the compressability issue in 9.5, those users
> > won't be able to fix the compression without rewriting all their data,
> > which could be prohibitive.  And we'll be  in a position where we have
> > to support the 9.4 JSONB format/compression technique for years so that
> > users aren't blocked from upgrading.
>
> FWIW, if we take the delicious JSON data as representative, a table
> storing that data as jsonb is 1374 MB in size. Whereas an equivalent
> table with the data typed using the original json datatype (but with
> white space differences more or less ignored, because it was created
> using a jsonb -> json cast), the same data is 1352 MB.
>
> Larry's complaint is valid; this is a real problem, and I'd like to
> fix it before 9.4 is out. However, let us not lose sight of the fact
> that JSON data is usually a poor target for TOAST compression. With
> idiomatic usage, redundancy is very much more likely to appear across
> rows, and not within individual Datums. Frankly, we aren't doing a
> very good job there, and doing better requires an alternative
> strategy.
>
> --
> Peter Geoghegan
>


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Peter Geoghegan
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus  wrote:
> One we ship 9.4, many users are going to load 100's of GB into JSONB
> fields.  Even if we fix the compressability issue in 9.5, those users
> won't be able to fix the compression without rewriting all their data,
> which could be prohibitive.  And we'll be  in a position where we have
> to support the 9.4 JSONB format/compression technique for years so that
> users aren't blocked from upgrading.

FWIW, if we take the delicious JSON data as representative, a table
storing that data as jsonb is 1374 MB in size. Whereas an equivalent
table with the data typed using the original json datatype (but with
white space differences more or less ignored, because it was created
using a jsonb -> json cast), the same data is 1352 MB.

Larry's complaint is valid; this is a real problem, and I'd like to
fix it before 9.4 is out. However, let us not lose sight of the fact
that JSON data is usually a poor target for TOAST compression. With
idiomatic usage, redundancy is very much more likely to appear across
rows, and not within individual Datums. Frankly, we aren't doing a
very good job there, and doing better requires an alternative
strategy.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Peter Geoghegan
On Fri, Aug 8, 2014 at 12:41 PM, Ants Aasma  wrote:
> I don't think binary search is the main problem here. Objects are
> usually reasonably sized, while arrays are more likely to be huge. To
> make matters worse, jsonb -> int goes from O(1) to O(n).

I don't think it's true that arrays are more likely to be huge. That
regression would be bad, but jsonb -> int is not the most compelling
operator by far. The indexable operators (in particular, @>) don't
support subscripting arrays like that, and with good reason.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Hannu Krosing
On 08/08/2014 06:17 AM, Tom Lane wrote:
> I looked into the issue reported in bug #11109.  The problem appears to be
> that jsonb's on-disk format is designed in such a way that the leading
> portion of any JSON array or object will be fairly incompressible, because
> it consists mostly of a strictly-increasing series of integer offsets.
How hard and how expensive would it be to teach pg_lzcompress to
apply a delta filter on suitable data ?

So that instead of integers their deltas will be fed to the "real"
compressor


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OĂś



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Ants Aasma
On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan  wrote:
>> I took a quick look and saw that this wouldn't be that easy to get around.
>> As I'd suspected upthread, there are places that do random access into a
>> JEntry array, such as the binary search in findJsonbValueFromContainer().
>> If we have to add up all the preceding lengths to locate the corresponding
>> value part, we lose the performance advantages of binary search.  AFAICS
>> that's applied directly to the on-disk representation.  I'd thought
>> perhaps there was always a transformation step to build a pointer list,
>> but nope.
>
> It would be interesting to know what the performance hit would be if we
> calculated the offsets/pointers on the fly, especially if we could cache it
> somehow. The main benefit of binary search is in saving on comparisons,
> especially of strings, ISTM, and that could still be available - this would
> just be a bit of extra arithmetic.

I don't think binary search is the main problem here. Objects are
usually reasonably sized, while arrays are more likely to be huge. To
make matters worse, jsonb -> int goes from O(1) to O(n).

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Alexander Korotkov
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev  wrote:

> Curious idea: we could swap JEntry array and values: values in the
>> begining of type will be catched by pg_lzcompress. But we will need to
>> know offset of JEntry array, so header will grow up till 8 bytes
>> (actually, it will be a varlena header!)
>>
>
> May be I wasn't clear:jsonb type will start from string collection instead
> of JEntry array, JEntry array will be placed at the end of object/array.
> so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of
> jsonb.


Another idea I have is that store offset in each JEntry is not necessary to
have benefit of binary search. Namely what if we store offsets in each 8th
JEntry and length in others? The speed of binary search will be about the
same: overhead is only calculation offsets in the 8-entries chunk. But
lengths will probably repeat.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] psql: show only failed queries

2014-08-08 Thread Pavel Stehule
Hi


2014-08-08 13:58 GMT+02:00 Fujii Masao :

> On Fri, Aug 8, 2014 at 4:31 AM, Pavel Stehule 
> wrote:
> >
> >
> >
> > 2014-08-07 7:10 GMT+02:00 Fujii Masao :
> >
> >> On Thu, Aug 7, 2014 at 6:26 AM, Pavel Stehule 
> >> wrote:
> >> > Hello
> >> >
> >> > updated version patch in attachment
> >>
> >> Thanks! But ISTM you forgot to attached the patch.
> >
> >
> > grr .. I am sorry
>
> No problem. Thanks for the patch! Attached is the revised version of the
> patch.
>
> >> >> +/* all psql known variables are included in list by default */
> >> >> +for (known_varname = known_varnames; *known_varname;
> >> >> known_varname++)
> >> >> +varnames[nvars++] = pg_strdup(*known_varname);
> >> >>
> >> >> Don't we need to append both prefix and suffix to even known
> variables?
> >> >
> >> >
> >> > ??? I am not sure if I understand well - probably system "read only"
> >> > variables as DBNAME, USER, VERSION should not be there
> >>
> >> I had that question because complete_from_variables() is also called by
> >> the
> >> tab-completion of "\echo :" and it shows half-baked variables list. So
> >> I thought probably we need to change complete_from_variables() more to
> >> fix the problem.
> >
> >
> > I understand now.
> >
> > I fixed it.
> >
> > I have a question.\echo probably should not to show empty known variable.
> >
> > data for autocomplete for \echo should be same as result of "\set"
>
> Agreed. I think that only the variables having the set values should be
> displayed in "\echo :" case. So I modified complete_from_variables()
> so that the unset variables are not shown in that case but all the
> variables
> are shown in the tab-completion of "\set".
>
> >> >> +else if (strcmp(prev2_wd, "\\set") == 0)
> >> >> +{
> >> >> +if (strcmp(prev_wd, "AUTOCOMMIT") == 0)
> >> >>
> >> >> ISTM that some psql variables like IGNOREEOF are not there. Why not?
> >> >
> >> >
> >> > yes, there are not complete for DBNAME, ENCODING, FETCH_COUNT,
> >> > HISTCONTROL,
> >> > HISTFILE, HISTSIZE, HOST, IGNOREEOFF, PROMPT*,USER,  VERSION
> >> >
> >> > There are more reasons:
> >> >
> >> > * paremeter is not a enum (string, number or both): FETCH_COUNT,
> PROMPT,
> >> > HISTSIZE, ..
> >> >
> >> > * variable is pseudo read only  - change has not any effect: DBNAME,
> >> > ENCODING, VERSION
> >>
> >> So HISTCONTROL should be there because it doesn't have such reasons at
> >> all?
> >>
> >
> > yes
>
> ISTM that you forgot to add HISTCONTROL to your patch. So I just added
> that.
>
> I added the tab-completion for "\unset" command. Like "\echo :", only
> the variables having the set values should be displayed in "\unset" case.
>

perfect


>
> I changed complete_from_variables() so that it checks the memory size of
> the variable array even when appending the known variables. If the memory
> size is not enough, it's dynamically increased. Practically this check
> would
> not be required because the initial size of the array is enough larger than
> the number of the known variables. I added this as the safe-guard.
>
> Typo: IGNOREEOFF -> IGNOREEOF
>
> I removed the value "none" from the value list of "ECHO" because it's not
> documented and a user might get confused when he or she sees the
> undocumented
> value "none". Thought?
>

isn't better to fix doc? I don't know any reason why we should not to
support "none"

I looked to code, you removed a check against duplicate varname in list. Is
it ok?

Regards

Pavel



>
> Regards,
>
> --
> Fujii Masao
>


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Josh Berkus
On 08/08/2014 08:02 AM, Tom Lane wrote:
> 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> jsonb is still within the bounds of reason.
> 
> Considering all the hype that's built up around jsonb, shipping a design
> with a fundamental performance handicap doesn't seem like a good plan
> to me.  We could perhaps band-aid around it by using different compression
> parameters for jsonb, although that would require some painful API changes
> since toast_compress_datum() doesn't know what datatype it's operating on.

I would rather ship late than ship a noncompressable JSONB.

One we ship 9.4, many users are going to load 100's of GB into JSONB
fields.  Even if we fix the compressability issue in 9.5, those users
won't be able to fix the compression without rewriting all their data,
which could be prohibitive.  And we'll be  in a position where we have
to support the 9.4 JSONB format/compression technique for years so that
users aren't blocked from upgrading.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specifying the unit in storage parameter

2014-08-08 Thread Josh Berkus
On 08/07/2014 08:32 PM, Fujii Masao wrote:
> This is not user-friendly. I'd like to propose the attached patch which
> introduces the infrastructure which allows us to specify the unit when
> setting INTEGER storage parameter like autovacuum_vacuum_cost_delay.
> Comment? Review?

No review, but thank you for doing this!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Bruce Momjian
On Fri, Aug  8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
> 2. Are we going to ship 9.4 without fixing this?  I definitely don't see
> replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> jsonb is still within the bounds of reason.

FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
betas and report the problem JSONB columns.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Teodor Sigaev

Curious idea: we could swap JEntry array and values: values in the
begining of type will be catched by pg_lzcompress. But we will need to
know offset of JEntry array, so header will grow up till 8 bytes
(actually, it will be a varlena header!)


May be I wasn't clear:jsonb type will start from string collection 
instead of JEntry array, JEntry array will be placed at the end of 
object/array. so, pg_lzcompress will find repeatable 4-byte pieces in 
first 1024 bytes of jsonb.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org

2014-08-08 Thread Joshua D. Drake


On 08/07/2014 10:12 PM, Stephen Frost wrote:


If you want the specific version, update your deb line.  Don't complain
because you used the Debian repo that follows the Debian guidelines and
didn't like what you got- that's not going to change.


May I have a link? Because I would argue that the "latest" version of 
the library for 9.2, is the library that ships with 9.2.9, not the one 
that ships with 9.3.5.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Teodor Sigaev

value-to-be-compressed.  (first_success_by is 1024 in the default set of
compression parameters.)


Curious idea: we could swap JEntry array and values: values in the 
begining of type will be catched by pg_lzcompress. But we will need to 
know offset of JEntry array, so header will grow up till 8 bytes 
(actually, it will be a varlena header!)



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-08 Thread Fujii Masao
On Sat, Aug 9, 2014 at 12:29 AM, Robert Haas  wrote:
> On Fri, Aug 8, 2014 at 11:00 AM, Bruce Momjian  wrote:
>> On Fri, Aug  8, 2014 at 08:51:13AM -0400, Robert Haas wrote:
>>> On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen  
>>> wrote:
>>> > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote:
>>> >> That is, we log replication commands only when log_statement is set to
>>> >> all. Neither new parameter is introduced nor log_statement is
>>> >> redefined as a list.
>>> >
>>> > That sounds good to me.
>>>
>>> It sounds fairly unprincipled to me.  I liked the idea of making
>>> log_statement a list, but if we aren't gonna do that, I think this
>>> should be a separate parameter.
>>
>> I am unclear there is enough demand for a separate replication logging
>> parameter --- using log_statement=all made sense to me.
>
> Most people don't want to turn on log_statement=all because it
> produces too much log volume.
>
> See, for example:
> http://bonesmoses.org/2014/08/05/on-postgresql-logging-verbosity/
>
> But logging replication commands is quite low-volume, so it is not
> hard to imagine someone wanting to log all replication commands but
> not all SQL statements.

You can do that by executing
"ALTER ROLE  SET log_statement TO 'all'".
If you don't use the replication user to execute SQL statements,
no SQL statements are logged in that setting.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A worst case for qsort

2014-08-08 Thread Peter Geoghegan
On Fri, Aug 8, 2014 at 5:54 AM, Robert Haas  wrote:
> Well, I'm not sure why you're having a hard time imagining it.
> Presorted input is a common case in general; that's why we have a
> check for it.  That check adds overhead in the non-pre-sorted case to
> improve the pre-sorted case, and nobody's ever argued for removing it
> that I can recall.

I think that there has been a fair amount of skepticism - e.g., [1]

But that's beside the point. What I mean is that I think that the
intersection of those three things - pre-sorted input, with all
differences after the first 8 bytes, and with a user requirement to
sort using the column - is fairly rare in practice. It's not
impossible, but it is fairly rare. If that was the only case that was
actually regressed, even taking into account fmgr elision, I think
that would be quite reasonable. It wouldn't be massively regressed
either, and it's a case that's already very fast relative to other
systems anyway, if you're lucky enough to get it.

You'd better have exactly sorted tuples, though. It's been my
observation that one slight difference can drastically alter the
outcome [2].

[1] http://www.postgresql.org/message-id/18033.1361789...@sss.pgh.pa.us
[2] 
http://www.postgresql.org/message-id/caeylb_w++uhrcwprzg9tybvf7sn-c1s9olbabvavpgdep2d...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Andrew Dunstan


On 08/08/2014 11:54 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 08/08/2014 11:18 AM, Tom Lane wrote:

That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.

It would certainly be worth validating that changing this would fix the
problem.
I don't know how invasive that would be - I suspect (without looking
very closely) not terribly much.

I took a quick look and saw that this wouldn't be that easy to get around.
As I'd suspected upthread, there are places that do random access into a
JEntry array, such as the binary search in findJsonbValueFromContainer().
If we have to add up all the preceding lengths to locate the corresponding
value part, we lose the performance advantages of binary search.  AFAICS
that's applied directly to the on-disk representation.  I'd thought
perhaps there was always a transformation step to build a pointer list,
but nope.





It would be interesting to know what the performance hit would be if we 
calculated the offsets/pointers on the fly, especially if we could cache 
it somehow. The main benefit of binary search is in saving on 
comparisons, especially of strings, ISTM, and that could still be 
available - this would just be a bit of extra arithmetic.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Andrew Dunstan


On 08/08/2014 12:04 PM, John W Higgins wrote:



Would an answer be to switch the location of the jsonb "header" data 
to the end of the field as opposed to the beginning of the field? That 
would allow pglz to see what it wants to see early on and go to work 
when possible?


Add an offset at the top of the field to show where to look - but then 
it would be the same in terms of functionality outside of that? Or 
pretty close?




That might make building up jsonb structures piece by piece as we do 
difficult.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
John W Higgins  writes:
> Would an answer be to switch the location of the jsonb "header" data to the
> end of the field as opposed to the beginning of the field? That would allow
> pglz to see what it wants to see early on and go to work when possible?

Hm, might work.  Seems a bit odd, but it would make pglz_compress happier.

OTOH, the big-picture issue here is that jsonb is generating
noncompressible data in the first place.  Putting it somewhere else in the
Datum doesn't change the fact that we're going to have bloated storage,
even if we dodge the early-exit problem.  (I suspect the compression
disadvantage vs text/plain-json that I showed yesterday is coming largely
from that offset array.)  But I don't currently see how to avoid that and
still preserve the fast binary-search key lookup property, which is surely
a nice thing to have.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread Tom Lane
David Rowley  writes:
> I wasn't quite sure if it was possible to include the same column twice in
> a foreign key, so I tested

> create table r1 (a int);
> create table r2 (b int);
> create unique index on r2(b,b);
> alter table r1 add constraint r2_b_fkey foreign key (a,a) references r2
> (b,b);
> ERROR:  cache lookup failed for opclass 0

Ouch.

> The attached seems to fix the problem, but the whole thing makes me wonder
> if this is even meant to be allowed? I was thinking that this might be a
> good time to disallow this altogether, since it's already broken and looks
> like it has been for about 11 years

We've gone out of our way in the past to allow duplicate index columns
(eg commit cfc5008a51f4), so I'm not sure why we'd not allow such indexes
to be used as foreign key references.  The example you posted above does
look pretty pointless, but something like this is perhaps less so:

regression=# create table r1 (a int, c int);
CREATE TABLE
regression=# create table r2 (b int);
CREATE TABLE
regression=# create unique index on r2(b,b);
CREATE INDEX
regression=# alter table r1 add constraint r2_b_fkey foreign key (a,c) 
references r2
(b,b);
ERROR:  cache lookup failed for opclass 0

especially when using nondefault FK match rules.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread John W Higgins
On Fri, Aug 8, 2014 at 8:02 AM, Tom Lane  wrote:

> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>
> > I'm rather disinclined to change the on-disk format because of this
> > specific test, that feels a bit like the tail wagging the dog to me,
> > especially as I do hope that some day we'll figure out a way to use a
> > better compression algorithm than pglz.
>
> I'm unimpressed by that argument too, for a number of reasons:
>
> 1. The real problem here is that jsonb is emitting quite a bit of
> fundamentally-nonrepetitive data, even when the user-visible input is very
> repetitive.  That's a compression-unfriendly transformation by anyone's
> measure.  Assuming that some future replacement for pg_lzcompress() will
> nonetheless be able to compress the data strikes me as mostly wishful
> thinking.  Besides, we'd more than likely have a similar early-exit rule
> in any substitute implementation, so that we'd still be at risk even if
> it usually worked.
>

Would an answer be to switch the location of the jsonb "header" data to the
end of the field as opposed to the beginning of the field? That would allow
pglz to see what it wants to see early on and go to work when possible?

Add an offset at the top of the field to show where to look - but then it
would be the same in terms of functionality outside of that? Or pretty
close?

John


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
Andrew Dunstan  writes:
> On 08/08/2014 11:18 AM, Tom Lane wrote:
>> That's not really the issue here, I think.  The problem is that a
>> relatively minor aspect of the representation, namely the choice to store
>> a series of offsets rather than a series of lengths, produces
>> nonrepetitive data even when the original input is repetitive.

> It would certainly be worth validating that changing this would fix the 
> problem.
> I don't know how invasive that would be - I suspect (without looking 
> very closely) not terribly much.

I took a quick look and saw that this wouldn't be that easy to get around.
As I'd suspected upthread, there are places that do random access into a
JEntry array, such as the binary search in findJsonbValueFromContainer().
If we have to add up all the preceding lengths to locate the corresponding
value part, we lose the performance advantages of binary search.  AFAICS
that's applied directly to the on-disk representation.  I'd thought
perhaps there was always a transformation step to build a pointer list,
but nope.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Defining a foreign key with a duplicate column is broken

2014-08-08 Thread David Rowley
In master I've been testing some new code that I'm working on around
foreign keys.
I wasn't quite sure if it was possible to include the same column twice in
a foreign key, so I tested

create table r1 (a int);
create table r2 (b int);
create unique index on r2(b,b);
alter table r1 add constraint r2_b_fkey foreign key (a,a) references r2
(b,b);
ERROR:  cache lookup failed for opclass 0

Which is not quite any of the messages that I would have expected

I've tracked this down to a small bug in transformFkeyCheckAttrs()
where opclasses[1] won't get set because if (attnums[0] ==
indexStruct->indkey.values[1]) will match then break out of the inner loop.
Instead opclasses[0] gets set twice.

The attached seems to fix the problem, but the whole thing makes me wonder
if this is even meant to be allowed? I was thinking that this might be a
good time to disallow this altogether, since it's already broken and looks
like it has been for about 11 years

Disallowing it would simplify some code in my semi/anti join removal patch
that I posted here
http://www.postgresql.org/message-id/CAApHDvpCBEfuc5tD=vniepAv0pU5m=q=foqzcodmheei7oq...@mail.gmail.com

Any thoughts?

Regards

David Rowley


transformFkeyCheckAttrs_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Andrew Dunstan


On 08/08/2014 11:18 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 08/07/2014 11:17 PM, Tom Lane wrote:

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.



Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.

That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.



It would certainly be worth validating that changing this would fix the 
problem.


I don't know how invasive that would be - I suspect (without looking 
very closely) not terribly much.



2. Are we going to ship 9.4 without fixing this?  I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.

Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me.  We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.





Yeah, it would be a bit painful, but after all finding out this sort of 
thing is why we have betas.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-08 Thread Robert Haas
On Fri, Aug 8, 2014 at 11:00 AM, Bruce Momjian  wrote:
> On Fri, Aug  8, 2014 at 08:51:13AM -0400, Robert Haas wrote:
>> On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen  
>> wrote:
>> > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote:
>> >> That is, we log replication commands only when log_statement is set to
>> >> all. Neither new parameter is introduced nor log_statement is
>> >> redefined as a list.
>> >
>> > That sounds good to me.
>>
>> It sounds fairly unprincipled to me.  I liked the idea of making
>> log_statement a list, but if we aren't gonna do that, I think this
>> should be a separate parameter.
>
> I am unclear there is enough demand for a separate replication logging
> parameter --- using log_statement=all made sense to me.

Most people don't want to turn on log_statement=all because it
produces too much log volume.

See, for example:
http://bonesmoses.org/2014/08/05/on-postgresql-logging-verbosity/

But logging replication commands is quite low-volume, so it is not
hard to imagine someone wanting to log all replication commands but
not all SQL statements.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
Andrew Dunstan  writes:
> On 08/07/2014 11:17 PM, Tom Lane wrote:
>> I looked into the issue reported in bug #11109.  The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.

> Ouch.

> Back when this structure was first presented at pgCon 2013, I wondered 
> if we shouldn't extract the strings into a dictionary, because of key 
> repetition, and convinced myself that this shouldn't be necessary 
> because in significant cases TOAST would take care of it.

That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.

> Maybe we should have pglz_compress() look at the *last* 1024 bytes if it 
> can't find anything worth compressing in the first, for values larger 
> than a certain size.

Not possible with anything like the current implementation, since it's
just an on-the-fly status check not a trial compression.

> It's worth noting that this is a fairly pathological case. AIUI the 
> example you constructed has an array with 100k string elements. I don't 
> think that's typical. So I suspect that unless I've misunderstood the 
> statement of the problem we're going to find that almost all the jsonb 
> we will be storing is still compressible.

Actually, the 100K-string example I constructed *did* compress.  Larry's
example that's not compressing is only about 12kB.  AFAICS, the threshold
for trouble is in the vicinity of 256 array or object entries (resulting
in a 1kB JEntry array).  That doesn't seem especially high.  There is a
probabilistic component as to whether the early-exit case will actually
fire, since any chance hash collision will probably result in some 3-byte
offset prefix getting compressed.  But the fact that a beta tester tripped
over this doesn't leave me with a warm feeling about the odds that it
won't happen much in the field.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> I looked into the issue reported in bug #11109.  The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.
>> This interacts poorly with the code in pglz_compress() that gives up if
>> it's found nothing compressible in the first first_success_by bytes of a
>> value-to-be-compressed.  (first_success_by is 1024 in the default set of
>> compression parameters.)

> I haven't looked at this in any detail, so take this with a grain of
> salt, but what about teaching pglz_compress about using an offset
> farther into the data, if the incoming data is quite a bit larger than
> 1k?  This is just a test to see if it's worthwhile to keep going, no?

Well, the point of the existing approach is that it's a *nearly free*
test to see if it's worthwhile to keep going; there's just one if-test
added in the outer loop of the compression code.  (cf commit ad434473ebd2,
which added that along with some other changes.)  AFAICS, what we'd have
to do to do it as you suggest would to execute compression on some subset
of the data and then throw away that work entirely.  I do not find that
attractive, especially when for most datatypes there's no particular
reason to look at one subset instead of another.

> I'm rather disinclined to change the on-disk format because of this
> specific test, that feels a bit like the tail wagging the dog to me,
> especially as I do hope that some day we'll figure out a way to use a
> better compression algorithm than pglz.

I'm unimpressed by that argument too, for a number of reasons:

1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive.  That's a compression-unfriendly transformation by anyone's
measure.  Assuming that some future replacement for pg_lzcompress() will
nonetheless be able to compress the data strikes me as mostly wishful
thinking.  Besides, we'd more than likely have a similar early-exit rule
in any substitute implementation, so that we'd still be at risk even if
it usually worked.

2. Are we going to ship 9.4 without fixing this?  I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.

Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me.  We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-08 Thread Bruce Momjian
On Fri, Aug  8, 2014 at 08:51:13AM -0400, Robert Haas wrote:
> On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen  
> wrote:
> > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote:
> >> That is, we log replication commands only when log_statement is set to
> >> all. Neither new parameter is introduced nor log_statement is
> >> redefined as a list.
> >
> > That sounds good to me.
> 
> It sounds fairly unprincipled to me.  I liked the idea of making
> log_statement a list, but if we aren't gonna do that, I think this
> should be a separate parameter.

I am unclear there is enough demand for a separate replication logging
parameter --- using log_statement=all made sense to me.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-08-08 Thread Michael Paquier
On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao  wrote:
> The attached patch introduces...
A patch perhaps? :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-08-08 Thread Fujii Masao
On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao  wrote:
> On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane  wrote:
>> Fujii Masao  writes:
>>> On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas  wrote:
 Should we try to install some hack around fastupdate for 9.4?  I fear
 the divergence between reasonable values of work_mem and reasonable
 sizes for that list is only going to continue to get bigger.  I'm sure
 there's somebody out there who has work_mem = 16GB, and stuff like
 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
 appeal of large values.
>>
>>> Controlling the threshold of the size of pending list only by GUC doesn't
>>> seem reasonable. Users may want to increase the threshold only for the
>>> GIN index which can be updated heavily, and decrease it otherwise. So
>>> I think that it's better to add new storage parameter for GIN index to 
>>> control
>>> the threshold, or both storage parameter and GUC.
>>
>> Yeah, -1 for a GUC.  A GIN-index-specific storage parameter seems more
>> appropriate.
>
> The attached patch introduces the GIN index storage parameter
> "PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of
> GIN pending list. If it's not set, work_mem is used as that maximum size,
> instead. So this patch doesn't break the existing application which
> currently uses work_mem as the threshold of cleanup operation of
> the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.
>
> This is an index storage parameter, and allows us to specify each
> threshold per GIN index. So, for example, we can increase the threshold
> only for the GIN index which can be updated heavily, and decrease it 
> otherwise.
>
> This patch uses another patch that I proposed (*1) as an infrastructure.
> Please apply that infrastructure patch first if you apply this patch.
>
> (*1)
> http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com
>
> Regards,
>
> --
> Fujii Masao

Sorry, I forgot to attached the patch This time, attached.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/gin.sgml
--- b/doc/src/sgml/gin.sgml
***
*** 728,735 
 from the indexed item). As of PostgreSQL 8.4,
 GIN is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes too large
!(larger than ), the entries are moved to the
 main GIN data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 GIN index update speed, even counting the additional
--- 728,736 
 from the indexed item). As of PostgreSQL 8.4,
 GIN is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes larger than
!PENDING_LIST_CLEANUP_SIZE (or
! if not set), the entries are moved to the
 main GIN data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 GIN index update speed, even counting the additional
***
*** 812,829 

  

!
 
  
   During a series of insertions into an existing GIN
   index that has FASTUPDATE enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  work_mem.  To avoid fluctuations in observed response time,
!  it's desirable to have pending-list cleanup occur in the background
!  (i.e., via autovacuum).  Foreground cleanup operations can be avoided by
!  increasing work_mem or making autovacuum more aggressive.
!  However, enlarging work_mem means that if a foreground
!  cleanup does occur, it will take even longer.
  
 

--- 813,839 

  

!PENDING_LIST_CLEANUP_SIZE and
!
 
  
   During a series of insertions into an existing GIN
   index that has FASTUPDATE enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  PENDING_LIST_CLEANUP_SIZE (if not set, work_mem
!  is used as that threshold, instead). To avoid fluctuations in observed
!  response time, it's desirable to have pending-list cleanup occur in the
!  background (i.e., via autovacuum).  Foreground cleanup operations
!  can be avoided by increasing PENDING_LIST_CLEANUP_SIZE
!  (or work_mem) or making autovacuum more aggressive.
!  However, enlarging the threshold of the cleanup operation means that
!  if a foreground cleanup does occur, it will take even longer.
! 
! 
!  PENDING_LIST_CLEANUP_SIZE is an index storage parameter,
!  and allows each GIN index to have its own cleanup threshold.
!  For example, it's possible to increase the threshold only for the GIN
!  index which can be updated heavily, and decreas

PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-08-08 Thread Fujii Masao
On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane  wrote:
> Fujii Masao  writes:
>> On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas  wrote:
>>> Should we try to install some hack around fastupdate for 9.4?  I fear
>>> the divergence between reasonable values of work_mem and reasonable
>>> sizes for that list is only going to continue to get bigger.  I'm sure
>>> there's somebody out there who has work_mem = 16GB, and stuff like
>>> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
>>> appeal of large values.
>
>> Controlling the threshold of the size of pending list only by GUC doesn't
>> seem reasonable. Users may want to increase the threshold only for the
>> GIN index which can be updated heavily, and decrease it otherwise. So
>> I think that it's better to add new storage parameter for GIN index to 
>> control
>> the threshold, or both storage parameter and GUC.
>
> Yeah, -1 for a GUC.  A GIN-index-specific storage parameter seems more
> appropriate.

The attached patch introduces the GIN index storage parameter
"PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of
GIN pending list. If it's not set, work_mem is used as that maximum size,
instead. So this patch doesn't break the existing application which
currently uses work_mem as the threshold of cleanup operation of
the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.

This is an index storage parameter, and allows us to specify each
threshold per GIN index. So, for example, we can increase the threshold
only for the GIN index which can be updated heavily, and decrease it otherwise.

This patch uses another patch that I proposed (*1) as an infrastructure.
Please apply that infrastructure patch first if you apply this patch.

(*1)
http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

2014-08-08 Thread MauMau
I've tracked down the real root cause.  The fix is very simple.  Please 
check the attached one-liner patch.


The cause is that the temporary relations are truncated unconditionally 
regardless of whether they are accessed in the transaction or not.  That is, 
the following sequence of steps result in the hang:


1. A session creates a temporary table with ON COMMIT DELETE ROWS.  It adds 
the temporary table to the list of relations that should be truncated at 
transaction commit.


2. The session receives a sinval catchup signal (SIGUSR1) from another 
session.  It starts a transaction and processes sinval messages in the 
SIGUSR1 signal handler.  No WAL is output while processing the sinval 
messages.


3. When the transaction commits, the list of temporary relations are checked 
to see if they need to be truncated.


4. The temporary table created in step 1 is truncated.  To truncate a 
relation, Access Exclusive lock is acquired on it.  When hot standby is 
used, acquiring an Access Exclusive lock generates a WAL record 
(RM_STANDBY_ID, XLOG_STANDBY_LOCK).


5. The transaction waits on a latch for a reply from a synchronous standby, 
because it wrote some WAL.  But the latch wait never returns, because the 
latch needs to receive SIGUSR1 but the SIGUSR1 handler is already in 
progress from step 2.



The correct behavior is for the transaction not to truncate the temporary 
table in step 4, because the transaction didn't use the temporary table.


I confirmed that the fix is already in 9.3 and 9.5devel, so I just copied 
the code fragment from 9.5devel to 9.2.9.  The attached patch is for 9.2.9. 
I didn't check 9.4 and other versions.  Why wasn't the fix applied to 9.2?


Finally, I found a very easy way to reproduce the problem:

1. On terminal session 1, start psql and run:
 CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS;
Leave the psql session open.

2. On terminal session 2, run:
 pgbench -c8 -t500 -s1 -n -f test.sql dbname
[test.sql]
CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS;
DROP TABLE t;

3. On the psql session on terminal session 1, run any SQL statement.  It 
doesn't reply.  The backend is stuck at SyncRepWaitForLSN().


Regards
MauMau


sinval_catchup_hang_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Andrew Dunstan


On 08/07/2014 11:17 PM, Tom Lane wrote:

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed.  (first_success_by is 1024 in the default set of
compression parameters.)


[snip]


There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on.  There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:

  * pglz_hist_idx -
  *
  * Computes the history table slot for the lookup by the next 4
  * characters in the input.
  *
  * NB: because we use the next 4 characters, we are not guaranteed to
  * find 3-character matches; they very possibly will be in the wrong
  * hash list.  This seems an acceptable tradeoff for spreading out the
  * hash keys more.

For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match.  There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.

I'm not sure if there is any easy fix for this.  We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data.  A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation.  Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets.  Not sure though if
that would break binary searching for JSON object keys.





Ouch.

Back when this structure was first presented at pgCon 2013, I wondered 
if we shouldn't extract the strings into a dictionary, because of key 
repetition, and convinced myself that this shouldn't be necessary 
because in significant cases TOAST would take care of it.


Maybe we should have pglz_compress() look at the *last* 1024 bytes if it 
can't find anything worth compressing in the first, for values larger 
than a certain size.


It's worth noting that this is a fairly pathological case. AIUI the 
example you constructed has an array with 100k string elements. I don't 
think that's typical. So I suspect that unless I've misunderstood the 
statement of the problem we're going to find that almost all the jsonb 
we will be storing is still compressible.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A worst case for qsort

2014-08-08 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Aug 7, 2014 at 5:52 PM, Peter Geoghegan  wrote:
> > On Thu, Aug 7, 2014 at 2:34 PM, Rod Taylor  wrote:
> >> This one is frequently sorted as batch operations against the files are
> >> performed in alphabetical order to reduce conflict issues that a random
> >> ordering may cause between jobs.
> >
> > Sure. There are cases out there. But, again, I have a hard time
> > imagining why you'd expect those to be pre-sorted in practice, ...
> 
> Well, I'm not sure why you're having a hard time imagining it.
> Presorted input is a common case in general; that's why we have a
> check for it.  That check adds overhead in the non-pre-sorted case to
> improve the pre-sorted case, and nobody's ever argued for removing it
> that I can recall.

Agreed.  This is not all that uncommon to happen in practice.

That said- this is perhaps another good case for where it'd be
extremely handy to have anonymized statistics information from real
systems which would allow us to more easily go *look* at this
specific question.

There are organizations out there who run many thousands of PG instances
who have expressed interest in supporting exactly that kind of
statistics gathering (indeed, I'm pretty sure Peter is familiar with
one... ;) - what we need is an architecture, design, and implementation
to make it happen..

I'm guessing you (Robert and Peter, especially) have already been
thinking about what we could do to make the above wish/dream a reality.
Perhaps if we could get the initial requirements down, someone would be
able to have time to work on making it happen; it would be really great
to see progress on this front for 9.5.  Or, if existing products
implement such metrics collection already, perhaps some numbers could
be shared with the community to help address this (and other)
questions.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Minmax indexes

2014-08-08 Thread Heikki Linnakangas

Another race condition:

If a new tuple is inserted to the range while summarization runs, it's 
possible that the new tuple isn't included in the tuple that the 
summarization calculated, nor does the insertion itself udpate it.


1. There is no index tuple for page range 1-10
2. Summarization begins. It scans pages 1-5.
3. A new insertion inserts a heap tuple to page 1.
4. The insertion sees that there is no index tuple covering range 1-10, 
so it doesn't update it.
5. The summarization finishes scanning pages 5-10, and inserts the new 
index tuple. The summarization didn't see the newly inserted heap tuple, 
and hence it's not included in the calculated index tuple.


One idea is to do the summarization in two stages. First, insert a 
placeholder tuple, with no real value in it. A query considers the 
placeholder tuple the same as a missing tuple, ie. always considers it a 
match. An insertion updates the placeholder tuple with the value 
inserted, as if it was a regular mmtuple. After summarization has 
finished scanning the page range, it turns the placeholder tuple into a 
regular tuple, by unioning the placeholder value with the value formed 
by scanning the heap.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgcrypto: PGP armor headers

2014-08-08 Thread Marko Tiikkaja

Hi,

Currently there's no way to generate or extract armor headers from the 
PGP armored format in pgcrypto.  I've written a patch to add the 
support.  For example:


local:marko=#* select armor('zooka', array['Version', 'Comment'], 
array['Created by pgcrypto', 'PostgreSQL, the database']);

   armor
---
 -BEGIN PGP MESSAGE-  +
 Version: Created by pgcrypto +
 Comment: PostgreSQL, the database+
  +
 em9va2E= +
 =D5cR+
 -END PGP MESSAGE-+

local:marko=#* select pgp_armor_header(armor('zooka', array['Version', 
'Comment'], array['Created by pgcrypto', 'PostgreSQL, the database']), 
'Comment');

 pgp_armor_header
--
 PostgreSQL, the database
(1 row)


.marko
*** a/contrib/pgcrypto/Makefile
--- b/contrib/pgcrypto/Makefile
***
*** 26,32  MODULE_big   = pgcrypto
  OBJS  = $(SRCS:.c=.o) $(WIN32RES)
  
  EXTENSION = pgcrypto
! DATA = pgcrypto--1.1.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql
  PGFILEDESC = "pgcrypto - cryptographic functions"
  
  REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \
--- 26,32 
  OBJS  = $(SRCS:.c=.o) $(WIN32RES)
  
  EXTENSION = pgcrypto
! DATA = pgcrypto--1.2.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql
  PGFILEDESC = "pgcrypto - cryptographic functions"
  
  REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \
*** a/contrib/pgcrypto/expected/pgp-armor.out
--- b/contrib/pgcrypto/expected/pgp-armor.out
***
*** 102,104  em9va2E=
--- 102,362 
  -END PGP MESSAGE-
  ');
  ERROR:  Corrupt ascii-armor
+ -- corrupt
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ foo:
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+ ERROR:  Corrupt ascii-armor
+ -- empty
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ foo: 
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+  pgp_armor_header 
+ --
+  
+ (1 row)
+ 
+ -- simple
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ foo: bar
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+  pgp_armor_header 
+ --
+  bar
+ (1 row)
+ 
+ -- uninteresting keys, part 1
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ foo: found
+ bar: ignored
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+  pgp_armor_header 
+ --
+  found
+ (1 row)
+ 
+ -- uninteresting keys, part 2
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ bar: ignored
+ foo: found
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+  pgp_armor_header 
+ --
+  found
+ (1 row)
+ 
+ -- uninteresting keys, part 3
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ bar: ignored
+ foo: found
+ bar: ignored
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'foo');
+  pgp_armor_header 
+ --
+  found
+ (1 row)
+ 
+ -- insane keys, part 1
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ insane:key : 
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'insane:key ');
+  pgp_armor_header 
+ --
+  
+ (1 row)
+ 
+ -- insane keys, part 2
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ insane:key : text value here
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'insane:key ');
+  pgp_armor_header 
+ --
+  text value here
+ (1 row)
+ 
+ -- long value
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ long: this value is more than 76 characters long, but it should still parse 
correctly as that''s permitted by RFC 4880
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'long');
+ pgp_armor_header  
   
+ 
-
+  this value is more than 76 characters long, but it should still parse 
correctly as that's permitted by RFC 4880
+ (1 row)
+ 
+ -- long value, split up
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ long: this value is more than 76 characters long, but it should still 
+ long: parse correctly as that''s permitted by RFC 4880
+ 
+ em9va2E=
+ =
+ -END PGP MESSAGE-
+ ', 'long');
+ pgp_armor_header  
   
+ 
-
+  this value is more than 76 characters long, but it should still parse 
correctly as that's permitted by RFC 4880
+ (1 row)
+ 
+ -- long value, split up, part 2
+ select pgp_armor_header('
+ -BEGIN PGP MESSAGE-
+ long: this value is more than 
+ long: 76 characters long, but it should still 

Re: [HACKERS] A worst case for qsort

2014-08-08 Thread Robert Haas
On Thu, Aug 7, 2014 at 5:52 PM, Peter Geoghegan  wrote:
> On Thu, Aug 7, 2014 at 2:34 PM, Rod Taylor  wrote:
>> This one is frequently sorted as batch operations against the files are
>> performed in alphabetical order to reduce conflict issues that a random
>> ordering may cause between jobs.
>
> Sure. There are cases out there. But, again, I have a hard time
> imagining why you'd expect those to be pre-sorted in practice, ...

Well, I'm not sure why you're having a hard time imagining it.
Presorted input is a common case in general; that's why we have a
check for it.  That check adds overhead in the non-pre-sorted case to
improve the pre-sorted case, and nobody's ever argued for removing it
that I can recall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-08 Thread Robert Haas
On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen  wrote:
> At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote:
>> That is, we log replication commands only when log_statement is set to
>> all. Neither new parameter is introduced nor log_statement is
>> redefined as a list.
>
> That sounds good to me.

It sounds fairly unprincipled to me.  I liked the idea of making
log_statement a list, but if we aren't gonna do that, I think this
should be a separate parameter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: show only failed queries

2014-08-08 Thread Fujii Masao
On Fri, Aug 8, 2014 at 4:31 AM, Pavel Stehule  wrote:
>
>
>
> 2014-08-07 7:10 GMT+02:00 Fujii Masao :
>
>> On Thu, Aug 7, 2014 at 6:26 AM, Pavel Stehule 
>> wrote:
>> > Hello
>> >
>> > updated version patch in attachment
>>
>> Thanks! But ISTM you forgot to attached the patch.
>
>
> grr .. I am sorry

No problem. Thanks for the patch! Attached is the revised version of the patch.

>> >> +/* all psql known variables are included in list by default */
>> >> +for (known_varname = known_varnames; *known_varname;
>> >> known_varname++)
>> >> +varnames[nvars++] = pg_strdup(*known_varname);
>> >>
>> >> Don't we need to append both prefix and suffix to even known variables?
>> >
>> >
>> > ??? I am not sure if I understand well - probably system "read only"
>> > variables as DBNAME, USER, VERSION should not be there
>>
>> I had that question because complete_from_variables() is also called by
>> the
>> tab-completion of "\echo :" and it shows half-baked variables list. So
>> I thought probably we need to change complete_from_variables() more to
>> fix the problem.
>
>
> I understand now.
>
> I fixed it.
>
> I have a question.\echo probably should not to show empty known variable.
>
> data for autocomplete for \echo should be same as result of "\set"

Agreed. I think that only the variables having the set values should be
displayed in "\echo :" case. So I modified complete_from_variables()
so that the unset variables are not shown in that case but all the variables
are shown in the tab-completion of "\set".

>> >> +else if (strcmp(prev2_wd, "\\set") == 0)
>> >> +{
>> >> +if (strcmp(prev_wd, "AUTOCOMMIT") == 0)
>> >>
>> >> ISTM that some psql variables like IGNOREEOF are not there. Why not?
>> >
>> >
>> > yes, there are not complete for DBNAME, ENCODING, FETCH_COUNT,
>> > HISTCONTROL,
>> > HISTFILE, HISTSIZE, HOST, IGNOREEOFF, PROMPT*,USER,  VERSION
>> >
>> > There are more reasons:
>> >
>> > * paremeter is not a enum (string, number or both): FETCH_COUNT, PROMPT,
>> > HISTSIZE, ..
>> >
>> > * variable is pseudo read only  - change has not any effect: DBNAME,
>> > ENCODING, VERSION
>>
>> So HISTCONTROL should be there because it doesn't have such reasons at
>> all?
>>
>
> yes

ISTM that you forgot to add HISTCONTROL to your patch. So I just added that.

I added the tab-completion for "\unset" command. Like "\echo :", only
the variables having the set values should be displayed in "\unset" case.

I changed complete_from_variables() so that it checks the memory size of
the variable array even when appending the known variables. If the memory
size is not enough, it's dynamically increased. Practically this check would
not be required because the initial size of the array is enough larger than
the number of the known variables. I added this as the safe-guard.

Typo: IGNOREEOFF -> IGNOREEOF

I removed the value "none" from the value list of "ECHO" because it's not
documented and a user might get confused when he or she sees the undocumented
value "none". Thought?

Regards,

-- 
Fujii Masao
*** a/src/bin/psql/tab-complete.c
--- b/src/bin/psql/tab-complete.c
***
*** 813,820  static char *_complete_from_query(int is_schema_query,
  	 const char *text, int state);
  static char *complete_from_list(const char *text, int state);
  static char *complete_from_const(const char *text, int state);
  static char **complete_from_variables(const char *text,
! 		const char *prefix, const char *suffix);
  static char *complete_from_files(const char *text, int state);
  
  static char *pg_strdup_keyword_case(const char *s, const char *ref);
--- 813,823 
  	 const char *text, int state);
  static char *complete_from_list(const char *text, int state);
  static char *complete_from_const(const char *text, int state);
+ static void append_variable_names(char ***varnames, int *nvars,
+   int *maxvars, const char *varname,
+   const char *prefix, const char *suffix);
  static char **complete_from_variables(const char *text,
! 	  const char *prefix, const char *suffix, bool need_value);
  static char *complete_from_files(const char *text, int state);
  
  static char *pg_strdup_keyword_case(const char *s, const char *ref);
***
*** 925,935  psql_completion(const char *text, int start, int end)
  	else if (text[0] == ':' && text[1] != ':')
  	{
  		if (text[1] == '\'')
! 			matches = complete_from_variables(text, ":'", "'");
  		else if (text[1] == '"')
! 			matches = complete_from_variables(text, ":\"", "\"");
  		else
! 			matches = complete_from_variables(text, ":", "");
  	}
  
  	/* If no previous word, suggest one of the basic sql commands */
--- 928,938 
  	else if (text[0] == ':' && text[1] != ':')
  	{
  		if (text[1] == '\'')
! 			matches = complete_from_variables(text, ":'", "'", true);
  		else if (text[1] == '"')
! 			matches = complete_from_variables(text, ":\"", "\"", true);
  		else
! 			matches = complete_from_variables(text, ":"

Re: [HACKERS] inherit support for foreign tables

2014-08-08 Thread Etsuro Fujita
(2014/08/06 20:43), Etsuro Fujita wrote:
>> (2014/06/30 22:48), Tom Lane wrote:
>>> I wonder whether it isn't time to change that.  It was coded like that
>>> originally only because calculating the values would've been a waste of
>>> cycles at the time.  But this is at least the third place where it'd be
>>> useful to have attr_needed for child rels.

> Attached is a WIP patch for that.

I've revised the patch.

Changes:

* Make the code more readable
* Revise the comments
* Cleanup

Please find attached an updated version of the patch.

Thanks,

Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 799,806  check_selective_binary_conversion(RelOptInfo *baserel,
  	}
  
  	/* Collect all the attributes needed for joins or final output. */
! 	pull_varattnos((Node *) baserel->reltargetlist, baserel->relid,
!    &attrs_used);
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, baserel->baserestrictinfo)
--- 799,811 
  	}
  
  	/* Collect all the attributes needed for joins or final output. */
! 	for (i = baserel->min_attr; i <= baserel->max_attr; i++)
! 	{
! 		if (!bms_is_empty(baserel->attr_needed[i - baserel->min_attr]))
! 			attrs_used =
! bms_add_member(attrs_used,
! 			   i - FirstLowInvalidHeapAttributeNumber);
! 	}
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, baserel->baserestrictinfo)
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
***
*** 577,588  set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
  		childrel->has_eclass_joins = rel->has_eclass_joins;
  
  		/*
! 		 * Note: we could compute appropriate attr_needed data for the child's
! 		 * variables, by transforming the parent's attr_needed through the
! 		 * translated_vars mapping.  However, currently there's no need
! 		 * because attr_needed is only examined for base relations not
! 		 * otherrels.  So we just leave the child's attr_needed empty.
  		 */
  
  		/*
  		 * Compute the child's size.
--- 577,585 
  		childrel->has_eclass_joins = rel->has_eclass_joins;
  
  		/*
! 		 * Compute the child's attr_needed.
  		 */
+ 		adjust_appendrel_attr_needed(rel, childrel, childRTE, appinfo);
  
  		/*
  		 * Compute the child's size.
***
*** 2173,2178  remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
--- 2170,2176 
  {
  	Bitmapset  *attrs_used = NULL;
  	ListCell   *lc;
+ 	int			i;
  
  	/*
  	 * Do nothing if subquery has UNION/INTERSECT/EXCEPT: in principle we
***
*** 2193,2204  remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  	 * Collect a bitmap of all the output column numbers used by the upper
  	 * query.
  	 *
! 	 * Add all the attributes needed for joins or final output.  Note: we must
! 	 * look at reltargetlist, not the attr_needed data, because attr_needed
! 	 * isn't computed for inheritance child rels, cf set_append_rel_size().
! 	 * (XXX might be worth changing that sometime.)
  	 */
! 	pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used);
  
  	/* Add all the attributes used by un-pushed-down restriction clauses. */
  	foreach(lc, rel->baserestrictinfo)
--- 2191,2205 
  	 * Collect a bitmap of all the output column numbers used by the upper
  	 * query.
  	 *
! 	 * Add all the attributes needed for joins or final output.
  	 */
! 	for (i = rel->min_attr; i <= rel->max_attr; i++)
! 	{
! 		if (!bms_is_empty(rel->attr_needed[i - rel->min_attr]))
! 			attrs_used =
! bms_add_member(attrs_used,
! 			   i - FirstLowInvalidHeapAttributeNumber);
! 	}
  
  	/* Add all the attributes used by un-pushed-down restriction clauses. */
  	foreach(lc, rel->baserestrictinfo)
*** a/src/backend/optimizer/path/indxpath.c
--- b/src/backend/optimizer/path/indxpath.c
***
*** 1768,1779  check_index_only(RelOptInfo *rel, IndexOptInfo *index)
  	 * way out.
  	 */
  
! 	/*
! 	 * Add all the attributes needed for joins or final output.  Note: we must
! 	 * look at reltargetlist, not the attr_needed data, because attr_needed
! 	 * isn't computed for inheritance child rels.
! 	 */
! 	pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used);
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, rel->baserestrictinfo)
--- 1768,1781 
  	 * way out.
  	 */
  
! 	/* Add all the attributes needed for joins or final output. */
! 	for (i = rel->min_attr; i <= rel->max_attr; i++)
! 	{
! 		if (!bms_is_empty(rel->attr_needed[i - rel->min_attr]))
! 			attrs_used =
! bms_add_member(attrs_used,
! 			   i - FirstLowInvalidHeapAttributeNumber);
! 	}
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, rel->baserestrictinfo)
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
***
*** 109,114  static Bitmapset *translate_col_privs(cons

Re: [HACKERS] Minmax indexes

2014-08-08 Thread Heikki Linnakangas
I think there's a race condition in mminsert, if two backends insert a 
tuple to the same heap page range concurrently. mminsert does this:


1. Fetch the MMtuple for the page range
2. Check if any of the stored datums need updating
3. Unlock the page.
4. Lock the page again in exclusive mode.
5. Update the tuple.

It's possible that two backends arrive at phase 3 at the same time, with 
different values. For example, backend A wants to update the minimum to 
contain 10, and and backend B wants to update it to 5. Now, if backend B 
gets to update the tuple first, to 5, backend A will update the tuple to 
10 when it gets the lock, which is wrong.


The simplest solution would be to get the buffer lock in exclusive mode 
to begin with, so that you don't need to release it between steps 2 and 
5. That might be a significant hit on concurrency, though, when most of 
the insertions don't in fact have to update the value. Another idea is 
to re-check the updated values after acquiring the lock in exclusive 
mode, to see if they match the previous values.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-08 Thread Jeff Davis
On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote:
> Comparing the median times, that's about a 3% regression.  For this
> particular case, we might be able to recapture that by replacing the
> bespoke memory-tracking logic in tuplesort.c with use of this new
> facility.  I'm not sure whether there are other cases that we might
> also want to test; I think stuff that runs all on the server side is
> likely to show up problems more clearly than pgbench.  Maybe a
> PL/pgsql loop that does something allocation-intensive on each
> iteration, for example, like parsing a big JSON document.

I wasn't able to reproduce your results on my machine. At -s 300, with
maintenance_work_mem set high enough to do internal sort, it took about
40s and I heard some disk activity, so I didn't think it was a valid
result. I went down to -s 150, and it took around 5.3s on both master
and memory-accounting.

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 430,435  randomize_mem(char *ptr, size_t size)
--- 432,440 
   * minContextSize: minimum context size
   * initBlockSize: initial allocation block size
   * maxBlockSize: maximum allocation block size
+  *
+  * The flag determining whether this context tracks memory usage is inherited
+  * from the parent context.
   */
  MemoryContext
  AllocSetContextCreate(MemoryContext parent,
***
*** 438,443  AllocSetContextCreate(MemoryContext parent,
--- 443,468 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
+ 	return AllocSetContextCreateTracked(
+ 		parent, name, minContextSize, initBlockSize, maxBlockSize,
+ 		(parent == NULL) ? false : parent->track_mem);
+ }
+ 
+ /*
+  * AllocSetContextCreateTracked
+  *		Create a new AllocSet context.
+  *
+  * Implementation for AllocSetContextCreate, but also allows the caller to
+  * specify whether memory usage should be tracked or not.
+  */
+ MemoryContext
+ AllocSetContextCreateTracked(MemoryContext parent,
+ 			 const char *name,
+ 			 Size minContextSize,
+ 			 Size initBlockSize,
+ 			 Size maxBlockSize,
+ 			 bool track_mem)
+ {
  	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
***
*** 445,451  AllocSetContextCreate(MemoryContext parent,
  			 sizeof(AllocSetContext),
  			 &AllocSetMethods,
  			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 470,477 
  			 sizeof(AllocSetContext),
  			 &AllocSetMethods,
  			 parent,
! 			 name,
! 			 track_mem);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 526,534 
  	 errdetail("Failed while creating memory context \"%s\".",
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block->aset = context;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 619,625 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block->endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***
*** 632,637  AllocSetDelete(MemoryContext context)
--- 662,668 
  	{
  		AllocBlock	next = block->next;
  
+ 		update_allocation(context, -(block->endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***
*** 678,683  AllocSetAlloc(MemoryContext context, Size size)
--- 709,717 
  	 errmsg("out of memory"),
  	 errdetail("Failed on request of size %zu.", size)));
  		}
+ 
+ 		update_allocation(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
***
*** 873,878  AllocSetAlloc(MemoryContext context, Size size)
--- 907,914 
  	 errdetail("Failed on request of size %zu.", size)));
  		}
  
+ 		update_allocation(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***
*** 976,981  AllocSetFree(MemoryContext context, void *pointer

Re: [HACKERS] pg_receivexlog add synchronous mode

2014-08-08 Thread Fujii Masao
On Thu, Aug 7, 2014 at 5:24 PM,   wrote:
>> Okay, applied the patch.
>>
>> I heavily modified your patch based on the master that the refactoring
>> patch has been applied. Attached is that modified version. Could you
>> review that?
>
> Thank you for the patch.
> I did a review of the patch.
>
> No problem in the patch.

Thanks for the review! Applied the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-08-08 Thread Guillaume Lelarge
Hi,

As part of our monitoring work for our customers, we stumbled upon an issue
with our customers' servers who have a wal_keep_segments setting higher
than 0.

We have a monitoring script that checks the number of WAL files in the
pg_xlog directory, according to the setting of three parameters
(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
We usually add a percentage to the usual formula:

greatest(
  (2 + checkpoint_completion_target) * checkpoint_segments + 1,
  checkpoint_segments + wal_keep_segments + 1
)

And we have lots of alerts from the script for customers who set their
wal_keep_segments setting higher than 0.

So we started to question this sentence of the documentation:

There will always be at least one WAL segment file, and will normally not
be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1
or checkpoint_segments + wal_keep_segments + 1 files.

(http://www.postgresql.org/docs/9.3/static/wal-configuration.html)

While doing some tests, it appears it would be more something like:

wal_keep_segments + (2 + checkpoint_completion_target) *
checkpoint_segments + 1

But after reading the source code (src/backend/access/transam/xlog.c), the
right formula seems to be:

wal_keep_segments + 2 * checkpoint_segments + 1

Here is how we went to this formula...

CreateCheckPoint(..) is responsible, among other things, for deleting and
recycling old WAL files. From src/backend/access/transam/xlog.c, master
branch, line 8363:

/*
 * Delete old log files (those no longer needed even for previous
 * checkpoint or the standbys in XLOG streaming).
 */
if (_logSegNo)
{
KeepLogSeg(recptr, &_logSegNo);
_logSegNo--;
RemoveOldXlogFiles(_logSegNo, recptr);
}

KeepLogSeg(...) function takes care of wal_keep_segments. From
src/backend/access/transam/xlog.c, master branch, line 8792:

/* compute limit for wal_keep_segments first */
if (wal_keep_segments > 0)
{
/* avoid underflow, don't go below 1 */
if (segno <= wal_keep_segments)
segno = 1;
else
segno = segno - wal_keep_segments;
}

IOW, the segment number (segno) is decremented according to the setting of
wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via
_logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can
remove or recycle all files before this segment number. This function gets
the number of WAL files to recycle with the XLOGfileslop constant, which is
defined as:

/*
 * XLOGfileslop is the maximum number of preallocated future XLOG segments.
 * When we are done with an old XLOG segment file, we will recycle it as a
 * future XLOG segment as long as there aren't already XLOGfileslop future
 * segments; else we'll delete it.  This could be made a separate GUC
 * variable, but at present I think it's sufficient to hardwire it as
 * 2*CheckPointSegments+1.  Under normal conditions, a checkpoint will free
 * no more than 2*CheckPointSegments log segments, and we want to recycle
all
 * of them; the +1 allows boundary cases to happen without wasting a
 * delete/create-segment cycle.
 */
#define XLOGfileslop(2*CheckPointSegments + 1)

(in src/backend/access/transam/xlog.c, master branch, line 100)

IOW, PostgreSQL will keep wal_keep_segments WAL files before the current
WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones.
Hence the formula:

wal_keep_segments + 2 * checkpoint_segments + 1

And this is what we usually find in our customers' servers. We may find
more WAL files, depending on the write activity of the cluster, but in
average, we get this number of WAL files.

AFAICT, the documentation is wrong about the usual number of WAL files in
the pg_xlog directory. But I may be wrong, in which case, the documentation
isn't clear enough for me, and should be fixed so that others can't
misinterpret it like I may have done.

Any comments? did I miss something, or should we fix the documentation?

Thanks.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com