[HACKERS] Support for N synchronous standby servers
Hi all, Please find attached a patch to add support of synchronous replication for multiple standby servers. This is controlled by the addition of a new GUC parameter called synchronous_standby_num, that makes server wait for transaction commit on the first N standbys defined in synchronous_standby_names. The implementation is really straight-forward, and has just needed a couple of modifications in walsender.c for pg_stat_get_wal_senders and syncrep.c. When a process commit is cancelled manually by user or when ProcDiePending shows up, the message returned to user does not show the list of walsenders where the commit has not been confirmed as it partially confirmed. I have not done anything for that but let me know if that would be useful. This would need a scan of the walsenders to get their application_name. Thanks, -- Michael From 3dfff90032c38daba43e1e0c4d3221053d6386ac Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sat, 9 Aug 2014 14:49:24 +0900 Subject: [PATCH] Add parameter synchronous_standby_num This makes possible support of synchronous replication on a number of standby nodes equal to the new parameter. The synchronous standbys are chosen in the order they are listed in synchronous_standby_names. --- doc/src/sgml/config.sgml| 32 --- doc/src/sgml/high-availability.sgml | 18 - src/backend/replication/syncrep.c | 81 ++--- src/backend/replication/walsender.c | 74 - src/backend/utils/misc/guc.c| 10 + src/include/replication/syncrep.h | 1 + 6 files changed, 175 insertions(+), 41 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index be5c25b..c40de16 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2586,12 +2586,13 @@ include_dir 'conf.d' Specifies a comma-separated list of standby names that can support synchronous replication, as described in . -At any one time there will be at most one active synchronous standby; -transactions waiting for commit will be allowed to proceed after -this standby server confirms receipt of their data. -The synchronous standby will be the first standby named in this list -that is both currently connected and streaming data in real-time -(as shown by a state of streaming in the +At any one time there will be at a number of active synchronous standbys +defined by synchronous_standby_num; transactions waiting +for commit will be allowed to proceed after those standby servers +confirms receipt of their data. The synchronous standbys will be +the first entries named in this list that are both currently connected +and streaming data in real-time (as shown by a state of +streaming in the pg_stat_replication view). Other standby servers appearing later in this list represent potential @@ -2627,6 +2628,25 @@ include_dir 'conf.d' + + synchronous_standby_num (integer) + + synchronous_standby_num configuration parameter + + + + +Specifies the number of standbys that support +synchronous replication, as described in +, and listed as the first +elements of . + + +Default value is 1. + + + + vacuum_defer_cleanup_age (integer) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index d249959..085d51b 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot' WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless wal_receiver_status_interval is set to zero on the standby. -If the standby is the first matching standby, as specified in -synchronous_standby_names on the primary, the reply -messages from that standby will be used to wake users waiting for -confirmation that the commit record has been received. These parameters -allow the administrator to specify which standby servers should be -synchronous standbys. Note that the configuration of synchronous +If the standby is the first synchronous_standby_num matching +standbys, as specified in synchronous_standby_names on the +primary, the reply messages from that standby will be used to wake users +waiting for confirmation that the commit record has been received. These +parameters allow the administrator to specify which standby servers should +be synchronous standbys. Note that the configuration of synchronous replication is mainly on the master. Named standbys must be directly connected to the master; the master knows nothing about downstream standby servers using c
Re: [HACKERS] 9.4 pg_restore --help changes
Peter Eisentraut-2 wrote > 9.3 pg_restore --help output: > > -I, --index=NAME restore named index > -n, --schema=NAMErestore only objects in this schema > -P, --function=NAME(args)restore named function > -t, --table=NAME restore named table(s) > -T, --trigger=NAME restore named trigger > --section=SECTIONrestore named section (pre-data, data, or > post-data) > > 9.4 pg_restore --help output: > > -I, --index=NAME restore named indexes > -n, --schema=NAMErestore only objects in these schemas > -P, --function=NAME(args)restore named functions > -t, --table=NAME restore named tables > -T, --trigger=NAME restore named triggers > --section=SECTIONrestore named sections (pre-data, data, or > post-data) > > This is a valuable feature change, but I think the help output is > unhelpful. The left side has a singular placeholder, but the right side > talks about objects in plural. How do I actually specify multiple > indexes? Is is --index=foo,bar? It's --index=foo --index=bar, but > that's unclear from the help. > > I think it would be clearer to write something like > > -I, --index=NAME restore named indexes (repeat for multiple) > > or perhaps make a note at the bottom > > The options -I, -n, -P, -t, -T, --section can be combined and > specified multiple times to select multiple objects. "each* restores named " For the schema version how about: "each* restores named schema - including objects" Then at the end: * You may specify a flag multiple times to restore multiple entries of the given type. Though to be fair is there some platform standard that we can attempt to emulate here? It seems pointless to create our own design if we borrow something more universal. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/9-4-pg-restore-help-changes-tp5814288p5814304.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
akapila wrote > On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane < > tgl@.pa > > wrote: >> >> Stephen Frost < > sfrost@ > > writes: >> > What about considering how large the object is when we are analyzing if >> > it compresses well overall? >> >> Hmm, yeah, that's a possibility: we could redefine the limit at which >> we bail out in terms of a fraction of the object size instead of a fixed >> limit. However, that risks expending a large amount of work before we >> bail, if we have a very large incompressible object --- which is not >> exactly an unlikely case. Consider for example JPEG images stored as >> bytea, which I believe I've heard of people doing. Another issue is >> that it's not real clear that that fixes the problem for any fractional >> size we'd want to use. In Larry's example of a jsonb value that fails >> to compress, the header size is 940 bytes out of about 12K, so we'd be >> needing to trial-compress about 10% of the object before we reach >> compressible data --- and I doubt his example is worst-case. >> >> >> 1. The real problem here is that jsonb is emitting quite a bit of >> >> fundamentally-nonrepetitive data, even when the user-visible input is > very >> >> repetitive. That's a compression-unfriendly transformation by >> anyone's >> >> measure. >> >> > I disagree that another algorithm wouldn't be able to manage better on >> > this data than pglz. pglz, from my experience, is notoriously bad a >> > certain data sets which other algorithms are not as poorly impacted by. >> >> Well, I used to be considered a compression expert, and I'm going to >> disagree with you here. It's surely possible that other algorithms would >> be able to get some traction where pglz fails to get any, > > During my previous work in this area, I had seen that some algorithms > use skipping logic which can be useful for incompressible data followed > by compressible data or in general as well. Random thought from the sideline... This particular data type has the novel (within PostgreSQL) design of both a (feature oriented - and sizeable) header and a payload. Is there some way to add that model into the storage system so that, at a higher level, separate attempts are made to compress each section and then the compressed (or not) results and written out adjacently and with a small header indicating the length of the stored header and other meta data like whether each part is compressed and even the type that data represents? For reading back into memory the header-payload generic type is populated and then the header and payload decompressed - as needed - then the two parts are fed into the appropriate type constructor that understands and accepts the two pieces. Just hoping to spark an idea here - I don't know enough about the internals to even guess how close I am to something feasible. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/jsonb-format-is-pessimal-for-toast-compression-tp5814162p5814299.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
On Sat, Aug 9, 2014 at 3:34 PM, Tom Lane wrote: > David Rowley writes: > > On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane wrote: > >> So I'm thinking you're right: we should rewrite this code so that only > >> indexes having all columns distinct can match, thereby ensuring that > there > >> is only one possible interpretation of the equality semantics for the > >> foreign key. > > > I've attached a patch which disallows these, though I was not quite sure > > about the new error message since likely this is something that should be > > backpacked? I wasn't sure on the policy for new translations in a minor > > release. > > There's no need for a new error message I think, because we should just > ignore such indexes. After all, there might be a valid matching index > later on. > > hmm, but if the user attempts to define the foreign key that contains a duplicate column in the REFERENCES part, then we'll never "find" any indexes, so there's no point in looking at all. That's why I did the duplicate checks as a pre-check before the loop over the indexes. Though that's not to say that we couldn't throw the "there is no unique constraint matching given keys for referenced table ..." error there. I've attached a version of the patch that's a little smarter when it comes to doing the duplicate checks in the attnums array... For what it's worth. Regards David Rowley transformFkeyCheckAttrs_fix_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?
Thanks to all for the great info. We are new to postgresql and this discussion has both instructed us and increased our respect for the database and the community. I am seeing a behavior that I don’t understand and hopefully you guys can clear it up. I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18 comparing against AWS oracle on db.m3.2xlarge using sql developer and TOAD. I am running a query with 30 tables in the from clause, getting 137 columns back (this is our most basic query, they get a lot more more complex). It returns back 4800 rows. In oracle 1st run takes 3.92 seconds, 2nd .38 seconds. Scrolling to end takes and extra 1.5 seconds for total of 5.5. Using pgadmin, I run the query. Looking at the lower right hand I can see the time going up. It stops at 8200 ms or close to it every time, then it takes an extra 6 seconds before it displays the rows on the screen. 2nd, 3rd, etc. runs all take about same amount of time 8 sec plus 6 sec I then changed it to return only 1 column back. In oracle/sqldeveloper identical behavior as before, same time. In postgresql it now goes down to 1.8 seconds for 1st, 2nd, etc. runs. I then change it so that I am asking for the sum of 1 column. In oracle time goes down to .2 seconds and postgresql now goes down to .2 seconds also. I then change it back to get the full result set and behavior goes back to original, oracle .38 since its cached, postgresql 8 seconds. Of the 30 tables 6 are 10-50 gigs in size. Our setting are "shared_buffers";"7639832kB" "effective_cache_size";"15279664kB" "allow_system_table_mods";"off" "application_name";"pgAdmin III - Query Tool" "archive_command";"/etc/rds/dbbin/pgscripts/rds_wal_archive %p" "archive_mode";"on" "archive_timeout";"5min" "array_nulls";"on" "authentication_timeout";"1min" "autovacuum";"on" "autovacuum_analyze_scale_factor";"0.1" "autovacuum_analyze_threshold";"50" "autovacuum_freeze_max_age";"2" "autovacuum_max_workers";"3" "autovacuum_multixact_freeze_max_age";"4" "autovacuum_naptime";"1min" "autovacuum_vacuum_cost_delay";"20ms" "autovacuum_vacuum_cost_limit";"-1" "autovacuum_vacuum_scale_factor";"0.2" "autovacuum_vacuum_threshold";"50" "backslash_quote";"safe_encoding" "bgwriter_delay";"200ms" "bgwriter_lru_maxpages";"100" "bgwriter_lru_multiplier";"2" "block_size";"8192" "bonjour";"off" "bonjour_name";"" "bytea_output";"escape" "check_function_bodies";"on" "checkpoint_completion_target";"0.9" "checkpoint_segments";"16" "checkpoint_timeout";"5min" "checkpoint_warning";"30s" "client_encoding";"UNICODE" "client_min_messages";"notice" "commit_delay";"0" "commit_siblings";"5" "constraint_exclusion";"partition" "cpu_index_tuple_cost";"0.005" "cpu_operator_cost";"0.0025" "cpu_tuple_cost";"0.01" "cursor_tuple_fraction";"0.1" "DateStyle";"ISO, MDY" "db_user_namespace";"off" "deadlock_timeout";"1s" "debug_assertions";"off" "debug_pretty_print";"on" "debug_print_parse";"off" "debug_print_plan";"off" "debug_print_rewritten";"off" "default_statistics_target";"100" "default_tablespace";"" "default_text_search_config";"pg_catalog.simple" "default_transaction_deferrable";"off" "default_transaction_isolation";"read committed" "default_transaction_read_only";"off" "default_with_oids";"off" "effective_cache_size";"15279664kB" "effective_io_concurrency";"1" "enable_bitmapscan";"on" "enable_hashagg";"on" "enable_hashjoin";"on" "enable_indexonlyscan";"on" "enable_indexscan";"on" "enable_material";"on" "enable_mergejoin";"on" "enable_nestloop";"on" "enable_seqscan";"on" "enable_sort";"on" "enable_tidscan";"on" "escape_string_warning";"on" "event_source";"PostgreSQL" "exit_on_error";"off" "extra_float_digits";"0" "from_collapse_limit";"8" "fsync";"on" "full_page_writes";"on" "geqo";"on" "geqo_effort";"5" "geqo_generations";"0" "geqo_pool_size";"0" "geqo_seed";"0" "geqo_selection_bias";"2" "geqo_threshold";"12" "gin_fuzzy_search_limit";"0" "hot_standby";"off" "hot_standby_feedback";"off" "ignore_checksum_failure";"off" "ignore_system_indexes";"off" "integer_datetimes";"on" "IntervalStyle";"postgres" "join_collapse_limit";"8" "krb_caseins_users";"off" "krb_srvname";"postgres" "lc_collate";"en_US.UTF-8" "lc_ctype";"en_US.UTF-8" "lc_messages";"" "lc_monetary";"C" "lc_numeric";"C" "lc_time";"C" "listen_addresses";"*" "lo_compat_privileges";"off" "local_preload_libraries";"" "lock_timeout";"0" "log_autovacuum_min_duration";"-1" "log_checkpoints";"on" "log_connections";"off" "log_destination";"stderr" "log_disconnections";"off" "log_duration";"off" "log_error_verbosity";"default" "log_executor_stats";"off" "log_file_mode";"0644" "log_hostname";"on" "log_line_prefix";"%t:%r:%u@%d:[%p]:" "log_lock_waits";"off" "log_min_duration_statement";"-1" "log_min_error_statement";"error" "log_min_messages";"warning" "log_parser_stats";"off" "log_planner_stats";"off" "log_rotation_age";"1h" "log_rotation_size";"10MB" "log_statement";"none" "log_statement_stats";"off" "log_temp_files";"-1" "lo
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
David Rowley writes: > On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane wrote: >> So I'm thinking you're right: we should rewrite this code so that only >> indexes having all columns distinct can match, thereby ensuring that there >> is only one possible interpretation of the equality semantics for the >> foreign key. > I've attached a patch which disallows these, though I was not quite sure > about the new error message since likely this is something that should be > backpacked? I wasn't sure on the policy for new translations in a minor > release. There's no need for a new error message I think, because we should just ignore such indexes. After all, there might be a valid matching index later on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
On Sat, Aug 9, 2014 at 12:13 PM, Tom Lane wrote: > > So I'm thinking you're right: we should rewrite this code so that only > indexes having all columns distinct can match, thereby ensuring that there > is only one possible interpretation of the equality semantics for the > foreign key. > > I've attached a patch which disallows these, though I was not quite sure about the new error message since likely this is something that should be backpacked? I wasn't sure on the policy for new translations in a minor release. The wording of the comment above the code which disallows the duplicate column references could likely do with some meaningful explanation about the reasons why we disallow these, I thought your words would be a bit better at doing this so I didn't try very hard on that part. Is this roughly what you had in mind? Regards David Rowley transformFkeyCheckAttrs_fix_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote: > * Bruce Momjian (br...@momjian.us) wrote: > > On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > > jsonb is still within the bounds of reason. > > > > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 > > betas and report the problem JSONB columns. > > That is *not* a good solution.. If you change the JSONB binary format, and we can't read the old format, that is the only option. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane wrote: > > Stephen Frost writes: > > What about considering how large the object is when we are analyzing if > > it compresses well overall? > > Hmm, yeah, that's a possibility: we could redefine the limit at which > we bail out in terms of a fraction of the object size instead of a fixed > limit. However, that risks expending a large amount of work before we > bail, if we have a very large incompressible object --- which is not > exactly an unlikely case. Consider for example JPEG images stored as > bytea, which I believe I've heard of people doing. Another issue is > that it's not real clear that that fixes the problem for any fractional > size we'd want to use. In Larry's example of a jsonb value that fails > to compress, the header size is 940 bytes out of about 12K, so we'd be > needing to trial-compress about 10% of the object before we reach > compressible data --- and I doubt his example is worst-case. > > >> 1. The real problem here is that jsonb is emitting quite a bit of > >> fundamentally-nonrepetitive data, even when the user-visible input is very > >> repetitive. That's a compression-unfriendly transformation by anyone's > >> measure. > > > I disagree that another algorithm wouldn't be able to manage better on > > this data than pglz. pglz, from my experience, is notoriously bad a > > certain data sets which other algorithms are not as poorly impacted by. > > Well, I used to be considered a compression expert, and I'm going to > disagree with you here. It's surely possible that other algorithms would > be able to get some traction where pglz fails to get any, During my previous work in this area, I had seen that some algorithms use skipping logic which can be useful for incompressible data followed by compressible data or in general as well. One of the technique could be If we don't find any match for first 4 bytes, then skip 4 bytes and if we don't find match again for next 8 bytes, then skip 8 bytes and keep on doing the same until we find first match in which case it would go back to beginning of data. Now here we could follow this logic until we actually compare total of first_success_by bytes. There can be caveats in this particular scheme of skipping but I just wanted to mention in general about the skipping idea to reduce the number of situations where we will bail out even though there is lot of compressible data. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] 9.4 pg_restore --help changes
On Sat, Aug 9, 2014 at 10:44 AM, Peter Eisentraut wrote: > -I, --index=NAME restore named indexes (repeat for multiple) A single --index entry restores only one index, so what about something like that: -I, --index=NAME restore named index (repeat for multiple entries) > or perhaps make a note at the bottom > > The options -I, -n, -P, -t, -T, --section can be combined and > specified multiple times to select multiple objects. > > Other ideas? Specifying both to make it clearer to the user. The note at the bottom is a good thing as well. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.4 pg_restore --help changes
9.3 pg_restore --help output: -I, --index=NAME restore named index -n, --schema=NAMErestore only objects in this schema -P, --function=NAME(args)restore named function -t, --table=NAME restore named table(s) -T, --trigger=NAME restore named trigger --section=SECTIONrestore named section (pre-data, data, or post-data) 9.4 pg_restore --help output: -I, --index=NAME restore named indexes -n, --schema=NAMErestore only objects in these schemas -P, --function=NAME(args)restore named functions -t, --table=NAME restore named tables -T, --trigger=NAME restore named triggers --section=SECTIONrestore named sections (pre-data, data, or post-data) This is a valuable feature change, but I think the help output is unhelpful. The left side has a singular placeholder, but the right side talks about objects in plural. How do I actually specify multiple indexes? Is is --index=foo,bar? It's --index=foo --index=bar, but that's unclear from the help. I think it would be clearer to write something like -I, --index=NAME restore named indexes (repeat for multiple) or perhaps make a note at the bottom The options -I, -n, -P, -t, -T, --section can be combined and specified multiple times to select multiple objects. Other ideas? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > I agree that we need to avoid changing jsonb's on-disk representation. > > ... post-release, I assume you mean. Yes. > > Have I missed where a good suggestion has been made about how to do that > > which preserves the binary-search capabilities and doesn't make the code > > much more difficult? > > We don't have one yet, but we've only been thinking about this for a few > hours. Fair enough. > > Trying to move the header to the end just for the > > sake of this doesn't strike me as a good solution as it'll make things > > quite a bit more complicated. Is there a way we could interleave the > > likely-compressible user data in with the header instead? > > Yeah, I was wondering about that too, but I don't immediately see how to > do it without some sort of preprocessing step when we read the object > (which'd be morally equivalent to converting a series of lengths into a > pointer array). Binary search isn't going to work if the items it's > searching in aren't all the same size. > > Having said that, I am not sure that a preprocessing step is a > deal-breaker. It'd be O(N), but with a pretty darn small constant factor, > and for plausible sizes of objects I think the binary search might still > dominate. Worth investigation perhaps. For my part, I'm less concerned about a preprocessing step which happens when we store the data and more concerned about ensuring that we're able to extract data quickly. Perhaps that's simply because I'm used to writes being more expensive than reads, but I'm not alone in that regard either. I doubt I'll have time in the next couple of weeks to look into this and if we're going to want this change for 9.4, we really need someone working on it sooner than later. (to the crowd)- do we have any takers for this investigation? Thanks, Stephen signature.asc Description: Digital signature
[HACKERS] psql output change in 9.4
This is 9.3: peter=# \a Output format is unaligned. peter=# \a Output format is aligned. peter=# \x Expanded display is on. peter=# \x Expanded display is off. This is new in 9.4: peter=# \a Output format (format) is unaligned. peter=# \a Output format (format) is aligned. peter=# \x Expanded display (expanded) is on. peter=# \x Expanded display (expanded) is off. What is the point of that change? I suppose it is so that you can use \pset without arguments to show all settings: peter=# \pset Border style (border) is 1. Target width (columns) unset. Expanded display (expanded) is off. ... But those are unrelated features, and the changed output doesn't make any sense in the contexts I show above. I think this should be reverted, and the \pset output should be implemented separately. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org
* Tom Lane (t...@sss.pgh.pa.us) wrote: > We only bump the SO version when we make a low-level ABI break; but even > without doing that we could potentially have changed library behavior in > ways that could negatively impact some applications. We should definitely be paying attention for such cases as I'd generally feel that they're bug fixes we need to address.. > So I think there's > some merit in Josh's position: he doesn't want to have to re-QA his > applications against some new version of libpq, but if you force him to > move to 9.3 libpq, he's going to need to do that if he wants to be strict > about it. "Force" is a bit strong here, in my opinion. There's a (really, rather trivial) way to get the libpq version which is shipped with a specific PG major version- simple add that major version to the end of the 'deb' line in your sources.list file. > If the Debian guidelines think that only SO major version need > be considered, they're wrong, at least for the way we've been treating > that. The SO major version should be sufficient for applications to operate normally. If that isn't the case then I agree that we need to review the changes we are making to see if the SO should be bumped. Note that Debian's viewpoint on this is more along the lines of "why build against an old version if the latest one, whose major SO is the same, exists?" This is largely to avoid the hell of versioned Build-Depends and having to support multiple sets of -dev packages concurrently (consider that builds generally only look for the unversioned '.so' file also..). > At the same time, there would be more merit in Josh's position if he > could point to any *actual* libpq changes that might pose application > compatibility problems. I don't recall that we've made many such, > so the above argument might just be hypothetical. I don't believe it's hypothetical from Josh's perspective, but I didn't follow the threads completely to confirm that there was a real issue. If there is a real issue here, I'd most likely vote to fix it and backpatch it as a bug, though it's not clear if that would be considered 'good enough' for this case. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
Stephen Frost writes: > I agree that we need to avoid changing jsonb's on-disk representation. ... post-release, I assume you mean. > Have I missed where a good suggestion has been made about how to do that > which preserves the binary-search capabilities and doesn't make the code > much more difficult? We don't have one yet, but we've only been thinking about this for a few hours. > Trying to move the header to the end just for the > sake of this doesn't strike me as a good solution as it'll make things > quite a bit more complicated. Is there a way we could interleave the > likely-compressible user data in with the header instead? Yeah, I was wondering about that too, but I don't immediately see how to do it without some sort of preprocessing step when we read the object (which'd be morally equivalent to converting a series of lengths into a pointer array). Binary search isn't going to work if the items it's searching in aren't all the same size. Having said that, I am not sure that a preprocessing step is a deal-breaker. It'd be O(N), but with a pretty darn small constant factor, and for plausible sizes of objects I think the binary search might still dominate. Worth investigation perhaps. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org
Stephen Frost writes: > * Joshua D. Drake (j...@commandprompt.com) wrote: >> On 08/07/2014 10:12 PM, Stephen Frost wrote: >>> If you want the specific version, update your deb line. Don't complain >>> because you used the Debian repo that follows the Debian guidelines and >>> didn't like what you got- that's not going to change. >> May I have a link? Because I would argue that the "latest" version >> of the library for 9.2, is the library that ships with 9.2.9, not >> the one that ships with 9.3.5. > Really? The latest version of libpq SO version 5 that we ship is 9.2.9? > I certainly don't feel that way. > Next you'll be asking us to bump the major SO for every major release of > PG. > Note that the last time we changed the SO major version was back in 2006 > and that was primairly because we were removing symbols that people > weren't supposed to be using anyway. Prior to that, it was 2005, or the > 8.0/8.1 timeframe... We only bump the SO version when we make a low-level ABI break; but even without doing that we could potentially have changed library behavior in ways that could negatively impact some applications. So I think there's some merit in Josh's position: he doesn't want to have to re-QA his applications against some new version of libpq, but if you force him to move to 9.3 libpq, he's going to need to do that if he wants to be strict about it. If the Debian guidelines think that only SO major version need be considered, they're wrong, at least for the way we've been treating that. At the same time, there would be more merit in Josh's position if he could point to any *actual* libpq changes that might pose application compatibility problems. I don't recall that we've made many such, so the above argument might just be hypothetical. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > What about considering how large the object is when we are analyzing if > > it compresses well overall? > > Hmm, yeah, that's a possibility: we could redefine the limit at which > we bail out in terms of a fraction of the object size instead of a fixed > limit. However, that risks expending a large amount of work before we > bail, if we have a very large incompressible object --- which is not > exactly an unlikely case. Consider for example JPEG images stored as > bytea, which I believe I've heard of people doing. Another issue is > that it's not real clear that that fixes the problem for any fractional > size we'd want to use. In Larry's example of a jsonb value that fails > to compress, the header size is 940 bytes out of about 12K, so we'd be > needing to trial-compress about 10% of the object before we reach > compressible data --- and I doubt his example is worst-case. Agreed- I tried to allude to that in my prior mail, there's clearly a concern that we'd make things worse in certain situations. Then again, at least for that case, we could recommend changing the storage type to EXTERNAL. > >> 1. The real problem here is that jsonb is emitting quite a bit of > >> fundamentally-nonrepetitive data, even when the user-visible input is very > >> repetitive. That's a compression-unfriendly transformation by anyone's > >> measure. > > > I disagree that another algorithm wouldn't be able to manage better on > > this data than pglz. pglz, from my experience, is notoriously bad a > > certain data sets which other algorithms are not as poorly impacted by. > > Well, I used to be considered a compression expert, and I'm going to > disagree with you here. It's surely possible that other algorithms would > be able to get some traction where pglz fails to get any, but that doesn't > mean that presenting them with hard-to-compress data in the first place is > a good design decision. There is no scenario in which data like this is > going to be friendly to a general-purpose compression algorithm. It'd > be necessary to have explicit knowledge that the data consists of an > increasing series of four-byte integers to be able to do much with it. > And then such an assumption would break down once you got past the > header ... I've wondered previously as to if we, perhaps, missed the boat pretty badly by not providing an explicitly versioned per-type compression capability, such that we wouldn't be stuck with one compression algorith for all types, and would be able to version compression types in a way that would allow us to change them over time, provided the newer code always understands how to decode X-4 (or whatever) versions back. I do agree that it'd be great to represent every type in a highly compressable way for the sake of the compression algorithm, but I've not seen any good suggestions for how to make that happen and I've got a hard time seeing how we could completely change the jsonb storage format, retain the capabilities it has today, make it highly compressible, and get 9.4 out this calendar year. I expect we could trivially add padding into the jsonb header to make it compress better, for the sake of this particular check, but then we're going to always be compression jsonb, even when the user data isn't actually terribly good for compression, spending a good bit of CPU time while we're at it. > > Perhaps another options would be a new storage type which basically says > > "just compress it, no matter what"? We'd be able to make that the > > default for jsonb columns too, no? > > Meh. We could do that, but it would still require adding arguments to > toast_compress_datum() that aren't there now. In any case, this is a > band-aid solution; and as Josh notes, once we ship 9.4 we are going to > be stuck with jsonb's on-disk representation pretty much forever. I agree that we need to avoid changing jsonb's on-disk representation. Have I missed where a good suggestion has been made about how to do that which preserves the binary-search capabilities and doesn't make the code much more difficult? Trying to move the header to the end just for the sake of this doesn't strike me as a good solution as it'll make things quite a bit more complicated. Is there a way we could interleave the likely-compressible user data in with the header instead? I've not looked, but it seems like that's the only reasonable approach to address this issue in this manner. If that's simply done, then great, but it strikes me as unlikely to be.. I'll just throw out a bit of a counter-point to all this also though- we don't try to focus on making our on-disk representation of data, generally, very compressible even though there are filesystems, such as ZFS, which might benefit from certain rearrangements of our on-disk formats (no, I don't have any specific recommendations in this vein, but I certainly don't see anyone else asking after it or as
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 08:45 PM, Tom Lane wrote: Perhaps another options would be a new storage type which basically says "just compress it, no matter what"? We'd be able to make that the default for jsonb columns too, no? Meh. We could do that, but it would still require adding arguments to toast_compress_datum() that aren't there now. In any case, this is a band-aid solution; and as Josh notes, once we ship 9.4 we are going to be stuck with jsonb's on-disk representation pretty much forever. Yeah, and almost any other solution is likely to mean non-jsonb users potentially paying a penalty for fixing this for jsonb. So if we can adjust the jsonb layout to fix this problem I think we should do so. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Stephen Frost writes: > What about considering how large the object is when we are analyzing if > it compresses well overall? Hmm, yeah, that's a possibility: we could redefine the limit at which we bail out in terms of a fraction of the object size instead of a fixed limit. However, that risks expending a large amount of work before we bail, if we have a very large incompressible object --- which is not exactly an unlikely case. Consider for example JPEG images stored as bytea, which I believe I've heard of people doing. Another issue is that it's not real clear that that fixes the problem for any fractional size we'd want to use. In Larry's example of a jsonb value that fails to compress, the header size is 940 bytes out of about 12K, so we'd be needing to trial-compress about 10% of the object before we reach compressible data --- and I doubt his example is worst-case. >> 1. The real problem here is that jsonb is emitting quite a bit of >> fundamentally-nonrepetitive data, even when the user-visible input is very >> repetitive. That's a compression-unfriendly transformation by anyone's >> measure. > I disagree that another algorithm wouldn't be able to manage better on > this data than pglz. pglz, from my experience, is notoriously bad a > certain data sets which other algorithms are not as poorly impacted by. Well, I used to be considered a compression expert, and I'm going to disagree with you here. It's surely possible that other algorithms would be able to get some traction where pglz fails to get any, but that doesn't mean that presenting them with hard-to-compress data in the first place is a good design decision. There is no scenario in which data like this is going to be friendly to a general-purpose compression algorithm. It'd be necessary to have explicit knowledge that the data consists of an increasing series of four-byte integers to be able to do much with it. And then such an assumption would break down once you got past the header ... > Perhaps another options would be a new storage type which basically says > "just compress it, no matter what"? We'd be able to make that the > default for jsonb columns too, no? Meh. We could do that, but it would still require adding arguments to toast_compress_datum() that aren't there now. In any case, this is a band-aid solution; and as Josh notes, once we ship 9.4 we are going to be stuck with jsonb's on-disk representation pretty much forever. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
* Josh Berkus (j...@agliodbs.com) wrote: > On 08/08/2014 08:02 AM, Tom Lane wrote: > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > jsonb is still within the bounds of reason. > > > > Considering all the hype that's built up around jsonb, shipping a design > > with a fundamental performance handicap doesn't seem like a good plan > > to me. We could perhaps band-aid around it by using different compression > > parameters for jsonb, although that would require some painful API changes > > since toast_compress_datum() doesn't know what datatype it's operating on. > > I would rather ship late than ship a noncompressable JSONB. > > One we ship 9.4, many users are going to load 100's of GB into JSONB > fields. Even if we fix the compressability issue in 9.5, those users > won't be able to fix the compression without rewriting all their data, > which could be prohibitive. And we'll be in a position where we have > to support the 9.4 JSONB format/compression technique for years so that > users aren't blocked from upgrading. Would you accept removing the binary-search capability from jsonb just to make it compressable? I certainly wouldn't. I'd hate to ship late also, but I'd be willing to support that if we can find a good solution to keep both compressability and binary-search (and provided it doesn't delay us many months..). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
* Bruce Momjian (br...@momjian.us) wrote: > On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > jsonb is still within the bounds of reason. > > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 > betas and report the problem JSONB columns. That is *not* a good solution.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org
* Joshua D. Drake (j...@commandprompt.com) wrote: > On 08/07/2014 10:12 PM, Stephen Frost wrote: > >If you want the specific version, update your deb line. Don't complain > >because you used the Debian repo that follows the Debian guidelines and > >didn't like what you got- that's not going to change. > > May I have a link? Because I would argue that the "latest" version > of the library for 9.2, is the library that ships with 9.2.9, not > the one that ships with 9.3.5. Really? The latest version of libpq SO version 5 that we ship is 9.2.9? I certainly don't feel that way. Next you'll be asking us to bump the major SO for every major release of PG. Note that the last time we changed the SO major version was back in 2006 and that was primairly because we were removing symbols that people weren't supposed to be using anyway. Prior to that, it was 2005, or the 8.0/8.1 timeframe... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> I looked into the issue reported in bug #11109. The problem appears to be > >> that jsonb's on-disk format is designed in such a way that the leading > >> portion of any JSON array or object will be fairly incompressible, because > >> it consists mostly of a strictly-increasing series of integer offsets. > >> This interacts poorly with the code in pglz_compress() that gives up if > >> it's found nothing compressible in the first first_success_by bytes of a > >> value-to-be-compressed. (first_success_by is 1024 in the default set of > >> compression parameters.) > > > I haven't looked at this in any detail, so take this with a grain of > > salt, but what about teaching pglz_compress about using an offset > > farther into the data, if the incoming data is quite a bit larger than > > 1k? This is just a test to see if it's worthwhile to keep going, no? > > Well, the point of the existing approach is that it's a *nearly free* > test to see if it's worthwhile to keep going; there's just one if-test > added in the outer loop of the compression code. (cf commit ad434473ebd2, > which added that along with some other changes.) AFAICS, what we'd have > to do to do it as you suggest would to execute compression on some subset > of the data and then throw away that work entirely. I do not find that > attractive, especially when for most datatypes there's no particular > reason to look at one subset instead of another. Ah, I see- we were using the first block as it means we can reuse the work done on it if we decide to continue with the compression. Makes sense. We could possibly arrange to have the amount attempted depend on the data type, but you point out that we can't do that without teaching lower components about types, which is less than ideal. What about considering how large the object is when we are analyzing if it compresses well overall? That is- for a larger object, make a larger effort to compress it. There's clearly a pessimistic case which could arise from that, but it may be better than the current situation. There's a clear risk that such an algorithm may well be very type specific, meaning that we make things worse for some types (eg: bytea's which end up never compressing well we'd likely spend more CPU time trying than we do today). > 1. The real problem here is that jsonb is emitting quite a bit of > fundamentally-nonrepetitive data, even when the user-visible input is very > repetitive. That's a compression-unfriendly transformation by anyone's > measure. Assuming that some future replacement for pg_lzcompress() will > nonetheless be able to compress the data strikes me as mostly wishful > thinking. Besides, we'd more than likely have a similar early-exit rule > in any substitute implementation, so that we'd still be at risk even if > it usually worked. I agree that jsonb ends up being nonrepetitive in part, which is why I've been trying to push the discussion in the direction of making it more likely for the highly-compressible data to be considered rather than the start of the jsonb object. I don't care for our compression algorithm having to be catered to in this regard in general though as the exact same problem could, and likely does, exist in some real life bytea-using PG implementations. I disagree that another algorithm wouldn't be able to manage better on this data than pglz. pglz, from my experience, is notoriously bad a certain data sets which other algorithms are not as poorly impacted by. > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. I'd really hate to ship 9.4 without a fix for this, but I have a similar hard time with shipping 9.4 without the binary search component.. > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. I don't like the idea of shipping with this handicap either. Perhaps another options would be a new storage type which basically says "just compress it, no matter what"? We'd be able to make that the default for jsonb columns too, no? Again- I'll admit this is shooting from the hip, but I wanted to get these thoughts out and I won't have much more time tonight. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
I wrote: > David Rowley writes: >> The attached seems to fix the problem, but the whole thing makes me wonder >> if this is even meant to be allowed? I was thinking that this might be a >> good time to disallow this altogether, since it's already broken and looks >> like it has been for about 11 years > We've gone out of our way in the past to allow duplicate index columns > (eg commit cfc5008a51f4), so I'm not sure why we'd not allow such indexes > to be used as foreign key references. I poked at this some more and changed my mind. Per SQL standard, the referent of a foreign key has to match some unique or pkey constraint, and *we do not allow duplicate columns in constraint syntax*: regression=# create temp table pp (f1 int, unique(f1,f1)); ERROR: column "f1" appears twice in unique constraint LINE 1: create temp table pp (f1 int, unique(f1,f1)); ^ which restriction I believe is also per spec. It's true that using CREATE UNIQUE INDEX syntax, you can make an index with duplicate columns, but there seems no a-priori reason why we have to allow such an index to be the referent of a foreign key. The main reason *not* to allow it is that such an index might conceivably have different opclasses for the index columns containing the same attnum, and in that case it becomes ambiguous which opclass (ie, which definition of equality) applies to each column of the foreign key. We could make some arbitrary definition of which one to select but I think that would be stretching generality well past the point of usefulness. Worse, if the different opclasses do have different definitions of equality and we pick the wrong one, it's not real clear that it mightn't be possible for a referencing row to be considered equal to multiple rows in the referenced table (that are distinct according to the rules used by the index). So I'm thinking you're right: we should rewrite this code so that only indexes having all columns distinct can match, thereby ensuring that there is only one possible interpretation of the equality semantics for the foreign key. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan wrote: > Robert pointed out a case where varying character case of an English > word did not alter the primary level bytes (and thus the poor man's > normalized key was the same). He acknowledged that most of the entropy > of the first 8 bytes of the string went into the first 8 bytes of the > blob/key. This can actually be very useful to the optimization in some > cases. In particular, with most Latin alphabets you'll see the same > pattern when diacritics are used. This means that even though the > original string has (say) an accented character that would take 2 > bytes to store in UTF-8, the weight in the primary level is the same > as an equivalent unaccented character (and so only takes one byte to > store at that level, with differences only in subsequent levels). > Whole strxfrm() blobs are the same length regardless of how many > accents appear in otherwise equivalent original Latin string, and you > get exactly the same high concentrations of entropy in the first 8 > bytes in pretty much all Latin alphabets (the *absence* of diacritics > is stored in later weight levels too, even with the "en_US.UTF-8" > collation). There are many other interesting cases where en_US.UTF-8, and presumably all other collations concentrate much more entropy into leading bytes than might be expected. Consider: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc" "abc" -> 0c0d0e0109090901090909 (11 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc" "# abc" -> 0c0d0e01090909010909090101760135 (16 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "* abc" "* abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes) and, to show you what this looks like when the primary weights/original codepoints appear backwards: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba" "cba" -> 0e0d0c0109090901090909 (11 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba" "# cba" -> 0e0d0c01090909010909090101760135 (16 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "* cba" "* cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes) Evidently, the implementation always places primary weights first, corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and later "\0e\0d\0c") - no matter how many "extraneous" characters are placed in front. They're handled later. Space don't appear in the primary weight level at all: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c" "a b c" -> 0c0d0e01090909010909090102350235 (16 bytes) Lots of punctuation-type characters will not affect the primary weight level: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%@!()\/#-+,:^~? a b c" "%@!()\/#-+,:^~? a b c" -> 0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235 (48 bytes) Some non-alphabetic ASCII characters will affect the primary level, though. For example: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c" "1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes) There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e", a primary weight "\03" corresponding to '1'. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I was not complaining; I think JSONB is awesome. But I am one of those people who would like to put 100's of GB (or more) JSON files into Postgres and I am concerned about file size and possible future changes to the format. On Fri, Aug 8, 2014 at 7:10 PM, Peter Geoghegan wrote: > On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus wrote: > > One we ship 9.4, many users are going to load 100's of GB into JSONB > > fields. Even if we fix the compressability issue in 9.5, those users > > won't be able to fix the compression without rewriting all their data, > > which could be prohibitive. And we'll be in a position where we have > > to support the 9.4 JSONB format/compression technique for years so that > > users aren't blocked from upgrading. > > FWIW, if we take the delicious JSON data as representative, a table > storing that data as jsonb is 1374 MB in size. Whereas an equivalent > table with the data typed using the original json datatype (but with > white space differences more or less ignored, because it was created > using a jsonb -> json cast), the same data is 1352 MB. > > Larry's complaint is valid; this is a real problem, and I'd like to > fix it before 9.4 is out. However, let us not lose sight of the fact > that JSON data is usually a poor target for TOAST compression. With > idiomatic usage, redundancy is very much more likely to appear across > rows, and not within individual Datums. Frankly, we aren't doing a > very good job there, and doing better requires an alternative > strategy. > > -- > Peter Geoghegan >
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus wrote: > One we ship 9.4, many users are going to load 100's of GB into JSONB > fields. Even if we fix the compressability issue in 9.5, those users > won't be able to fix the compression without rewriting all their data, > which could be prohibitive. And we'll be in a position where we have > to support the 9.4 JSONB format/compression technique for years so that > users aren't blocked from upgrading. FWIW, if we take the delicious JSON data as representative, a table storing that data as jsonb is 1374 MB in size. Whereas an equivalent table with the data typed using the original json datatype (but with white space differences more or less ignored, because it was created using a jsonb -> json cast), the same data is 1352 MB. Larry's complaint is valid; this is a real problem, and I'd like to fix it before 9.4 is out. However, let us not lose sight of the fact that JSON data is usually a poor target for TOAST compression. With idiomatic usage, redundancy is very much more likely to appear across rows, and not within individual Datums. Frankly, we aren't doing a very good job there, and doing better requires an alternative strategy. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 12:41 PM, Ants Aasma wrote: > I don't think binary search is the main problem here. Objects are > usually reasonably sized, while arrays are more likely to be huge. To > make matters worse, jsonb -> int goes from O(1) to O(n). I don't think it's true that arrays are more likely to be huge. That regression would be bad, but jsonb -> int is not the most compelling operator by far. The indexable operators (in particular, @>) don't support subscripting arrays like that, and with good reason. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 06:17 AM, Tom Lane wrote: > I looked into the issue reported in bug #11109. The problem appears to be > that jsonb's on-disk format is designed in such a way that the leading > portion of any JSON array or object will be fairly incompressible, because > it consists mostly of a strictly-increasing series of integer offsets. How hard and how expensive would it be to teach pg_lzcompress to apply a delta filter on suitable data ? So that instead of integers their deltas will be fed to the "real" compressor -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OĂś -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan wrote: >> I took a quick look and saw that this wouldn't be that easy to get around. >> As I'd suspected upthread, there are places that do random access into a >> JEntry array, such as the binary search in findJsonbValueFromContainer(). >> If we have to add up all the preceding lengths to locate the corresponding >> value part, we lose the performance advantages of binary search. AFAICS >> that's applied directly to the on-disk representation. I'd thought >> perhaps there was always a transformation step to build a pointer list, >> but nope. > > It would be interesting to know what the performance hit would be if we > calculated the offsets/pointers on the fly, especially if we could cache it > somehow. The main benefit of binary search is in saving on comparisons, > especially of strings, ISTM, and that could still be available - this would > just be a bit of extra arithmetic. I don't think binary search is the main problem here. Objects are usually reasonably sized, while arrays are more likely to be huge. To make matters worse, jsonb -> int goes from O(1) to O(n). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev wrote: > Curious idea: we could swap JEntry array and values: values in the >> begining of type will be catched by pg_lzcompress. But we will need to >> know offset of JEntry array, so header will grow up till 8 bytes >> (actually, it will be a varlena header!) >> > > May be I wasn't clear:jsonb type will start from string collection instead > of JEntry array, JEntry array will be placed at the end of object/array. > so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of > jsonb. Another idea I have is that store offset in each JEntry is not necessary to have benefit of binary search. Namely what if we store offsets in each 8th JEntry and length in others? The speed of binary search will be about the same: overhead is only calculation offsets in the 8-entries chunk. But lengths will probably repeat. -- With best regards, Alexander Korotkov.
Re: [HACKERS] psql: show only failed queries
Hi 2014-08-08 13:58 GMT+02:00 Fujii Masao : > On Fri, Aug 8, 2014 at 4:31 AM, Pavel Stehule > wrote: > > > > > > > > 2014-08-07 7:10 GMT+02:00 Fujii Masao : > > > >> On Thu, Aug 7, 2014 at 6:26 AM, Pavel Stehule > >> wrote: > >> > Hello > >> > > >> > updated version patch in attachment > >> > >> Thanks! But ISTM you forgot to attached the patch. > > > > > > grr .. I am sorry > > No problem. Thanks for the patch! Attached is the revised version of the > patch. > > >> >> +/* all psql known variables are included in list by default */ > >> >> +for (known_varname = known_varnames; *known_varname; > >> >> known_varname++) > >> >> +varnames[nvars++] = pg_strdup(*known_varname); > >> >> > >> >> Don't we need to append both prefix and suffix to even known > variables? > >> > > >> > > >> > ??? I am not sure if I understand well - probably system "read only" > >> > variables as DBNAME, USER, VERSION should not be there > >> > >> I had that question because complete_from_variables() is also called by > >> the > >> tab-completion of "\echo :" and it shows half-baked variables list. So > >> I thought probably we need to change complete_from_variables() more to > >> fix the problem. > > > > > > I understand now. > > > > I fixed it. > > > > I have a question.\echo probably should not to show empty known variable. > > > > data for autocomplete for \echo should be same as result of "\set" > > Agreed. I think that only the variables having the set values should be > displayed in "\echo :" case. So I modified complete_from_variables() > so that the unset variables are not shown in that case but all the > variables > are shown in the tab-completion of "\set". > > >> >> +else if (strcmp(prev2_wd, "\\set") == 0) > >> >> +{ > >> >> +if (strcmp(prev_wd, "AUTOCOMMIT") == 0) > >> >> > >> >> ISTM that some psql variables like IGNOREEOF are not there. Why not? > >> > > >> > > >> > yes, there are not complete for DBNAME, ENCODING, FETCH_COUNT, > >> > HISTCONTROL, > >> > HISTFILE, HISTSIZE, HOST, IGNOREEOFF, PROMPT*,USER, VERSION > >> > > >> > There are more reasons: > >> > > >> > * paremeter is not a enum (string, number or both): FETCH_COUNT, > PROMPT, > >> > HISTSIZE, .. > >> > > >> > * variable is pseudo read only - change has not any effect: DBNAME, > >> > ENCODING, VERSION > >> > >> So HISTCONTROL should be there because it doesn't have such reasons at > >> all? > >> > > > > yes > > ISTM that you forgot to add HISTCONTROL to your patch. So I just added > that. > > I added the tab-completion for "\unset" command. Like "\echo :", only > the variables having the set values should be displayed in "\unset" case. > perfect > > I changed complete_from_variables() so that it checks the memory size of > the variable array even when appending the known variables. If the memory > size is not enough, it's dynamically increased. Practically this check > would > not be required because the initial size of the array is enough larger than > the number of the known variables. I added this as the safe-guard. > > Typo: IGNOREEOFF -> IGNOREEOF > > I removed the value "none" from the value list of "ECHO" because it's not > documented and a user might get confused when he or she sees the > undocumented > value "none". Thought? > isn't better to fix doc? I don't know any reason why we should not to support "none" I looked to code, you removed a check against duplicate varname in list. Is it ok? Regards Pavel > > Regards, > > -- > Fujii Masao >
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 08:02 AM, Tom Lane wrote: > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. > > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. I would rather ship late than ship a noncompressable JSONB. One we ship 9.4, many users are going to load 100's of GB into JSONB fields. Even if we fix the compressability issue in 9.5, those users won't be able to fix the compression without rewriting all their data, which could be prohibitive. And we'll be in a position where we have to support the 9.4 JSONB format/compression technique for years so that users aren't blocked from upgrading. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Specifying the unit in storage parameter
On 08/07/2014 08:32 PM, Fujii Masao wrote: > This is not user-friendly. I'd like to propose the attached patch which > introduces the infrastructure which allows us to specify the unit when > setting INTEGER storage parameter like autovacuum_vacuum_cost_delay. > Comment? Review? No review, but thank you for doing this! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 betas and report the problem JSONB columns. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Curious idea: we could swap JEntry array and values: values in the begining of type will be catched by pg_lzcompress. But we will need to know offset of JEntry array, so header will grow up till 8 bytes (actually, it will be a varlena header!) May be I wasn't clear:jsonb type will start from string collection instead of JEntry array, JEntry array will be placed at the end of object/array. so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of jsonb. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org
On 08/07/2014 10:12 PM, Stephen Frost wrote: If you want the specific version, update your deb line. Don't complain because you used the Debian repo that follows the Debian guidelines and didn't like what you got- that's not going to change. May I have a link? Because I would argue that the "latest" version of the library for 9.2, is the library that ships with 9.2.9, not the one that ships with 9.3.5. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, @cmdpromptinc "If we send our children to Caesar for their education, we should not be surprised when they come back as Romans." -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) Curious idea: we could swap JEntry array and values: values in the begining of type will be catched by pg_lzcompress. But we will need to know offset of JEntry array, so header will grow up till 8 bytes (actually, it will be a varlena header!) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Sat, Aug 9, 2014 at 12:29 AM, Robert Haas wrote: > On Fri, Aug 8, 2014 at 11:00 AM, Bruce Momjian wrote: >> On Fri, Aug 8, 2014 at 08:51:13AM -0400, Robert Haas wrote: >>> On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen >>> wrote: >>> > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote: >>> >> That is, we log replication commands only when log_statement is set to >>> >> all. Neither new parameter is introduced nor log_statement is >>> >> redefined as a list. >>> > >>> > That sounds good to me. >>> >>> It sounds fairly unprincipled to me. I liked the idea of making >>> log_statement a list, but if we aren't gonna do that, I think this >>> should be a separate parameter. >> >> I am unclear there is enough demand for a separate replication logging >> parameter --- using log_statement=all made sense to me. > > Most people don't want to turn on log_statement=all because it > produces too much log volume. > > See, for example: > http://bonesmoses.org/2014/08/05/on-postgresql-logging-verbosity/ > > But logging replication commands is quite low-volume, so it is not > hard to imagine someone wanting to log all replication commands but > not all SQL statements. You can do that by executing "ALTER ROLE SET log_statement TO 'all'". If you don't use the replication user to execute SQL statements, no SQL statements are logged in that setting. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A worst case for qsort
On Fri, Aug 8, 2014 at 5:54 AM, Robert Haas wrote: > Well, I'm not sure why you're having a hard time imagining it. > Presorted input is a common case in general; that's why we have a > check for it. That check adds overhead in the non-pre-sorted case to > improve the pre-sorted case, and nobody's ever argued for removing it > that I can recall. I think that there has been a fair amount of skepticism - e.g., [1] But that's beside the point. What I mean is that I think that the intersection of those three things - pre-sorted input, with all differences after the first 8 bytes, and with a user requirement to sort using the column - is fairly rare in practice. It's not impossible, but it is fairly rare. If that was the only case that was actually regressed, even taking into account fmgr elision, I think that would be quite reasonable. It wouldn't be massively regressed either, and it's a case that's already very fast relative to other systems anyway, if you're lucky enough to get it. You'd better have exactly sorted tuples, though. It's been my observation that one slight difference can drastically alter the outcome [2]. [1] http://www.postgresql.org/message-id/18033.1361789...@sss.pgh.pa.us [2] http://www.postgresql.org/message-id/caeylb_w++uhrcwprzg9tybvf7sn-c1s9olbabvavpgdep2d...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 11:54 AM, Tom Lane wrote: Andrew Dunstan writes: On 08/08/2014 11:18 AM, Tom Lane wrote: That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. It would certainly be worth validating that changing this would fix the problem. I don't know how invasive that would be - I suspect (without looking very closely) not terribly much. I took a quick look and saw that this wouldn't be that easy to get around. As I'd suspected upthread, there are places that do random access into a JEntry array, such as the binary search in findJsonbValueFromContainer(). If we have to add up all the preceding lengths to locate the corresponding value part, we lose the performance advantages of binary search. AFAICS that's applied directly to the on-disk representation. I'd thought perhaps there was always a transformation step to build a pointer list, but nope. It would be interesting to know what the performance hit would be if we calculated the offsets/pointers on the fly, especially if we could cache it somehow. The main benefit of binary search is in saving on comparisons, especially of strings, ISTM, and that could still be available - this would just be a bit of extra arithmetic. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 12:04 PM, John W Higgins wrote: Would an answer be to switch the location of the jsonb "header" data to the end of the field as opposed to the beginning of the field? That would allow pglz to see what it wants to see early on and go to work when possible? Add an offset at the top of the field to show where to look - but then it would be the same in terms of functionality outside of that? Or pretty close? That might make building up jsonb structures piece by piece as we do difficult. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
John W Higgins writes: > Would an answer be to switch the location of the jsonb "header" data to the > end of the field as opposed to the beginning of the field? That would allow > pglz to see what it wants to see early on and go to work when possible? Hm, might work. Seems a bit odd, but it would make pglz_compress happier. OTOH, the big-picture issue here is that jsonb is generating noncompressible data in the first place. Putting it somewhere else in the Datum doesn't change the fact that we're going to have bloated storage, even if we dodge the early-exit problem. (I suspect the compression disadvantage vs text/plain-json that I showed yesterday is coming largely from that offset array.) But I don't currently see how to avoid that and still preserve the fast binary-search key lookup property, which is surely a nice thing to have. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Defining a foreign key with a duplicate column is broken
David Rowley writes: > I wasn't quite sure if it was possible to include the same column twice in > a foreign key, so I tested > create table r1 (a int); > create table r2 (b int); > create unique index on r2(b,b); > alter table r1 add constraint r2_b_fkey foreign key (a,a) references r2 > (b,b); > ERROR: cache lookup failed for opclass 0 Ouch. > The attached seems to fix the problem, but the whole thing makes me wonder > if this is even meant to be allowed? I was thinking that this might be a > good time to disallow this altogether, since it's already broken and looks > like it has been for about 11 years We've gone out of our way in the past to allow duplicate index columns (eg commit cfc5008a51f4), so I'm not sure why we'd not allow such indexes to be used as foreign key references. The example you posted above does look pretty pointless, but something like this is perhaps less so: regression=# create table r1 (a int, c int); CREATE TABLE regression=# create table r2 (b int); CREATE TABLE regression=# create unique index on r2(b,b); CREATE INDEX regression=# alter table r1 add constraint r2_b_fkey foreign key (a,c) references r2 (b,b); ERROR: cache lookup failed for opclass 0 especially when using nondefault FK match rules. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 8:02 AM, Tom Lane wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > > I'm rather disinclined to change the on-disk format because of this > > specific test, that feels a bit like the tail wagging the dog to me, > > especially as I do hope that some day we'll figure out a way to use a > > better compression algorithm than pglz. > > I'm unimpressed by that argument too, for a number of reasons: > > 1. The real problem here is that jsonb is emitting quite a bit of > fundamentally-nonrepetitive data, even when the user-visible input is very > repetitive. That's a compression-unfriendly transformation by anyone's > measure. Assuming that some future replacement for pg_lzcompress() will > nonetheless be able to compress the data strikes me as mostly wishful > thinking. Besides, we'd more than likely have a similar early-exit rule > in any substitute implementation, so that we'd still be at risk even if > it usually worked. > Would an answer be to switch the location of the jsonb "header" data to the end of the field as opposed to the beginning of the field? That would allow pglz to see what it wants to see early on and go to work when possible? Add an offset at the top of the field to show where to look - but then it would be the same in terms of functionality outside of that? Or pretty close? John
Re: [HACKERS] jsonb format is pessimal for toast compression
Andrew Dunstan writes: > On 08/08/2014 11:18 AM, Tom Lane wrote: >> That's not really the issue here, I think. The problem is that a >> relatively minor aspect of the representation, namely the choice to store >> a series of offsets rather than a series of lengths, produces >> nonrepetitive data even when the original input is repetitive. > It would certainly be worth validating that changing this would fix the > problem. > I don't know how invasive that would be - I suspect (without looking > very closely) not terribly much. I took a quick look and saw that this wouldn't be that easy to get around. As I'd suspected upthread, there are places that do random access into a JEntry array, such as the binary search in findJsonbValueFromContainer(). If we have to add up all the preceding lengths to locate the corresponding value part, we lose the performance advantages of binary search. AFAICS that's applied directly to the on-disk representation. I'd thought perhaps there was always a transformation step to build a pointer list, but nope. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Defining a foreign key with a duplicate column is broken
In master I've been testing some new code that I'm working on around foreign keys. I wasn't quite sure if it was possible to include the same column twice in a foreign key, so I tested create table r1 (a int); create table r2 (b int); create unique index on r2(b,b); alter table r1 add constraint r2_b_fkey foreign key (a,a) references r2 (b,b); ERROR: cache lookup failed for opclass 0 Which is not quite any of the messages that I would have expected I've tracked this down to a small bug in transformFkeyCheckAttrs() where opclasses[1] won't get set because if (attnums[0] == indexStruct->indkey.values[1]) will match then break out of the inner loop. Instead opclasses[0] gets set twice. The attached seems to fix the problem, but the whole thing makes me wonder if this is even meant to be allowed? I was thinking that this might be a good time to disallow this altogether, since it's already broken and looks like it has been for about 11 years Disallowing it would simplify some code in my semi/anti join removal patch that I posted here http://www.postgresql.org/message-id/CAApHDvpCBEfuc5tD=vniepAv0pU5m=q=foqzcodmheei7oq...@mail.gmail.com Any thoughts? Regards David Rowley transformFkeyCheckAttrs_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 11:18 AM, Tom Lane wrote: Andrew Dunstan writes: On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. It would certainly be worth validating that changing this would fix the problem. I don't know how invasive that would be - I suspect (without looking very closely) not terribly much. 2. Are we going to ship 9.4 without fixing this? I definitely don't see replacing pg_lzcompress as being on the agenda for 9.4, whereas changing jsonb is still within the bounds of reason. Considering all the hype that's built up around jsonb, shipping a design with a fundamental performance handicap doesn't seem like a good plan to me. We could perhaps band-aid around it by using different compression parameters for jsonb, although that would require some painful API changes since toast_compress_datum() doesn't know what datatype it's operating on. Yeah, it would be a bit painful, but after all finding out this sort of thing is why we have betas. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Fri, Aug 8, 2014 at 11:00 AM, Bruce Momjian wrote: > On Fri, Aug 8, 2014 at 08:51:13AM -0400, Robert Haas wrote: >> On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen >> wrote: >> > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote: >> >> That is, we log replication commands only when log_statement is set to >> >> all. Neither new parameter is introduced nor log_statement is >> >> redefined as a list. >> > >> > That sounds good to me. >> >> It sounds fairly unprincipled to me. I liked the idea of making >> log_statement a list, but if we aren't gonna do that, I think this >> should be a separate parameter. > > I am unclear there is enough demand for a separate replication logging > parameter --- using log_statement=all made sense to me. Most people don't want to turn on log_statement=all because it produces too much log volume. See, for example: http://bonesmoses.org/2014/08/05/on-postgresql-logging-verbosity/ But logging replication commands is quite low-volume, so it is not hard to imagine someone wanting to log all replication commands but not all SQL statements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Andrew Dunstan writes: > On 08/07/2014 11:17 PM, Tom Lane wrote: >> I looked into the issue reported in bug #11109. The problem appears to be >> that jsonb's on-disk format is designed in such a way that the leading >> portion of any JSON array or object will be fairly incompressible, because >> it consists mostly of a strictly-increasing series of integer offsets. > Ouch. > Back when this structure was first presented at pgCon 2013, I wondered > if we shouldn't extract the strings into a dictionary, because of key > repetition, and convinced myself that this shouldn't be necessary > because in significant cases TOAST would take care of it. That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. > Maybe we should have pglz_compress() look at the *last* 1024 bytes if it > can't find anything worth compressing in the first, for values larger > than a certain size. Not possible with anything like the current implementation, since it's just an on-the-fly status check not a trial compression. > It's worth noting that this is a fairly pathological case. AIUI the > example you constructed has an array with 100k string elements. I don't > think that's typical. So I suspect that unless I've misunderstood the > statement of the problem we're going to find that almost all the jsonb > we will be storing is still compressible. Actually, the 100K-string example I constructed *did* compress. Larry's example that's not compressing is only about 12kB. AFAICS, the threshold for trouble is in the vicinity of 256 array or object entries (resulting in a 1kB JEntry array). That doesn't seem especially high. There is a probabilistic component as to whether the early-exit case will actually fire, since any chance hash collision will probably result in some 3-byte offset prefix getting compressed. But the fact that a beta tester tripped over this doesn't leave me with a warm feeling about the odds that it won't happen much in the field. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> I looked into the issue reported in bug #11109. The problem appears to be >> that jsonb's on-disk format is designed in such a way that the leading >> portion of any JSON array or object will be fairly incompressible, because >> it consists mostly of a strictly-increasing series of integer offsets. >> This interacts poorly with the code in pglz_compress() that gives up if >> it's found nothing compressible in the first first_success_by bytes of a >> value-to-be-compressed. (first_success_by is 1024 in the default set of >> compression parameters.) > I haven't looked at this in any detail, so take this with a grain of > salt, but what about teaching pglz_compress about using an offset > farther into the data, if the incoming data is quite a bit larger than > 1k? This is just a test to see if it's worthwhile to keep going, no? Well, the point of the existing approach is that it's a *nearly free* test to see if it's worthwhile to keep going; there's just one if-test added in the outer loop of the compression code. (cf commit ad434473ebd2, which added that along with some other changes.) AFAICS, what we'd have to do to do it as you suggest would to execute compression on some subset of the data and then throw away that work entirely. I do not find that attractive, especially when for most datatypes there's no particular reason to look at one subset instead of another. > I'm rather disinclined to change the on-disk format because of this > specific test, that feels a bit like the tail wagging the dog to me, > especially as I do hope that some day we'll figure out a way to use a > better compression algorithm than pglz. I'm unimpressed by that argument too, for a number of reasons: 1. The real problem here is that jsonb is emitting quite a bit of fundamentally-nonrepetitive data, even when the user-visible input is very repetitive. That's a compression-unfriendly transformation by anyone's measure. Assuming that some future replacement for pg_lzcompress() will nonetheless be able to compress the data strikes me as mostly wishful thinking. Besides, we'd more than likely have a similar early-exit rule in any substitute implementation, so that we'd still be at risk even if it usually worked. 2. Are we going to ship 9.4 without fixing this? I definitely don't see replacing pg_lzcompress as being on the agenda for 9.4, whereas changing jsonb is still within the bounds of reason. Considering all the hype that's built up around jsonb, shipping a design with a fundamental performance handicap doesn't seem like a good plan to me. We could perhaps band-aid around it by using different compression parameters for jsonb, although that would require some painful API changes since toast_compress_datum() doesn't know what datatype it's operating on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Fri, Aug 8, 2014 at 08:51:13AM -0400, Robert Haas wrote: > On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen > wrote: > > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote: > >> That is, we log replication commands only when log_statement is set to > >> all. Neither new parameter is introduced nor log_statement is > >> redefined as a list. > > > > That sounds good to me. > > It sounds fairly unprincipled to me. I liked the idea of making > log_statement a list, but if we aren't gonna do that, I think this > should be a separate parameter. I am unclear there is enough demand for a separate replication logging parameter --- using log_statement=all made sense to me. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao wrote: > The attached patch introduces... A patch perhaps? :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao wrote: > On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane wrote: >> Fujii Masao writes: >>> On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas wrote: Should we try to install some hack around fastupdate for 9.4? I fear the divergence between reasonable values of work_mem and reasonable sizes for that list is only going to continue to get bigger. I'm sure there's somebody out there who has work_mem = 16GB, and stuff like 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the appeal of large values. >> >>> Controlling the threshold of the size of pending list only by GUC doesn't >>> seem reasonable. Users may want to increase the threshold only for the >>> GIN index which can be updated heavily, and decrease it otherwise. So >>> I think that it's better to add new storage parameter for GIN index to >>> control >>> the threshold, or both storage parameter and GUC. >> >> Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more >> appropriate. > > The attached patch introduces the GIN index storage parameter > "PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of > GIN pending list. If it's not set, work_mem is used as that maximum size, > instead. So this patch doesn't break the existing application which > currently uses work_mem as the threshold of cleanup operation of > the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE. > > This is an index storage parameter, and allows us to specify each > threshold per GIN index. So, for example, we can increase the threshold > only for the GIN index which can be updated heavily, and decrease it > otherwise. > > This patch uses another patch that I proposed (*1) as an infrastructure. > Please apply that infrastructure patch first if you apply this patch. > > (*1) > http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com > > Regards, > > -- > Fujii Masao Sorry, I forgot to attached the patch This time, attached. Regards, -- Fujii Masao *** a/doc/src/sgml/gin.sgml --- b/doc/src/sgml/gin.sgml *** *** 728,735 from the indexed item). As of PostgreSQL 8.4, GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. !When the table is vacuumed, or if the pending list becomes too large !(larger than ), the entries are moved to the main GIN data structure using the same bulk insert techniques used during initial index creation. This greatly improves GIN index update speed, even counting the additional --- 728,736 from the indexed item). As of PostgreSQL 8.4, GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. !When the table is vacuumed, or if the pending list becomes larger than !PENDING_LIST_CLEANUP_SIZE (or ! if not set), the entries are moved to the main GIN data structure using the same bulk insert techniques used during initial index creation. This greatly improves GIN index update speed, even counting the additional *** *** 812,829 ! During a series of insertions into an existing GIN index that has FASTUPDATE enabled, the system will clean up the pending-entry list whenever the list grows larger than ! work_mem. To avoid fluctuations in observed response time, ! it's desirable to have pending-list cleanup occur in the background ! (i.e., via autovacuum). Foreground cleanup operations can be avoided by ! increasing work_mem or making autovacuum more aggressive. ! However, enlarging work_mem means that if a foreground ! cleanup does occur, it will take even longer. --- 813,839 !PENDING_LIST_CLEANUP_SIZE and ! During a series of insertions into an existing GIN index that has FASTUPDATE enabled, the system will clean up the pending-entry list whenever the list grows larger than ! PENDING_LIST_CLEANUP_SIZE (if not set, work_mem ! is used as that threshold, instead). To avoid fluctuations in observed ! response time, it's desirable to have pending-list cleanup occur in the ! background (i.e., via autovacuum). Foreground cleanup operations ! can be avoided by increasing PENDING_LIST_CLEANUP_SIZE ! (or work_mem) or making autovacuum more aggressive. ! However, enlarging the threshold of the cleanup operation means that ! if a foreground cleanup does occur, it will take even longer. ! ! ! PENDING_LIST_CLEANUP_SIZE is an index storage parameter, ! and allows each GIN index to have its own cleanup threshold. ! For example, it's possible to increase the threshold only for the GIN ! index which can be updated heavily, and decreas
PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane wrote: > Fujii Masao writes: >> On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas wrote: >>> Should we try to install some hack around fastupdate for 9.4? I fear >>> the divergence between reasonable values of work_mem and reasonable >>> sizes for that list is only going to continue to get bigger. I'm sure >>> there's somebody out there who has work_mem = 16GB, and stuff like >>> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the >>> appeal of large values. > >> Controlling the threshold of the size of pending list only by GUC doesn't >> seem reasonable. Users may want to increase the threshold only for the >> GIN index which can be updated heavily, and decrease it otherwise. So >> I think that it's better to add new storage parameter for GIN index to >> control >> the threshold, or both storage parameter and GUC. > > Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more > appropriate. The attached patch introduces the GIN index storage parameter "PENDING_LIST_CLEANUP_SIZE" which specifies the maximum size of GIN pending list. If it's not set, work_mem is used as that maximum size, instead. So this patch doesn't break the existing application which currently uses work_mem as the threshold of cleanup operation of the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE. This is an index storage parameter, and allows us to specify each threshold per GIN index. So, for example, we can increase the threshold only for the GIN index which can be updated heavily, and decrease it otherwise. This patch uses another patch that I proposed (*1) as an infrastructure. Please apply that infrastructure patch first if you apply this patch. (*1) http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Should smgrtruncate() avoid sending sinval message for temp relations
I've tracked down the real root cause. The fix is very simple. Please check the attached one-liner patch. The cause is that the temporary relations are truncated unconditionally regardless of whether they are accessed in the transaction or not. That is, the following sequence of steps result in the hang: 1. A session creates a temporary table with ON COMMIT DELETE ROWS. It adds the temporary table to the list of relations that should be truncated at transaction commit. 2. The session receives a sinval catchup signal (SIGUSR1) from another session. It starts a transaction and processes sinval messages in the SIGUSR1 signal handler. No WAL is output while processing the sinval messages. 3. When the transaction commits, the list of temporary relations are checked to see if they need to be truncated. 4. The temporary table created in step 1 is truncated. To truncate a relation, Access Exclusive lock is acquired on it. When hot standby is used, acquiring an Access Exclusive lock generates a WAL record (RM_STANDBY_ID, XLOG_STANDBY_LOCK). 5. The transaction waits on a latch for a reply from a synchronous standby, because it wrote some WAL. But the latch wait never returns, because the latch needs to receive SIGUSR1 but the SIGUSR1 handler is already in progress from step 2. The correct behavior is for the transaction not to truncate the temporary table in step 4, because the transaction didn't use the temporary table. I confirmed that the fix is already in 9.3 and 9.5devel, so I just copied the code fragment from 9.5devel to 9.2.9. The attached patch is for 9.2.9. I didn't check 9.4 and other versions. Why wasn't the fix applied to 9.2? Finally, I found a very easy way to reproduce the problem: 1. On terminal session 1, start psql and run: CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS; Leave the psql session open. 2. On terminal session 2, run: pgbench -c8 -t500 -s1 -n -f test.sql dbname [test.sql] CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS; DROP TABLE t; 3. On the psql session on terminal session 1, run any SQL statement. It doesn't reply. The backend is stuck at SyncRepWaitForLSN(). Regards MauMau sinval_catchup_hang_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. This interacts poorly with the code in pglz_compress() that gives up if it's found nothing compressible in the first first_success_by bytes of a value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) [snip] There is plenty of compressible data once we get into the repetitive strings in the payload part --- but that starts at offset 944, and up to that point there is nothing that pg_lzcompress can get a handle on. There are, by definition, no sequences of 4 or more repeated bytes in that area. I think in principle pg_lzcompress could decide to compress the 3-byte sequences consisting of the high-order 24 bits of each offset; but it doesn't choose to do so, probably because of the way its lookup hash table works: * pglz_hist_idx - * * Computes the history table slot for the lookup by the next 4 * characters in the input. * * NB: because we use the next 4 characters, we are not guaranteed to * find 3-character matches; they very possibly will be in the wrong * hash list. This seems an acceptable tradeoff for spreading out the * hash keys more. For jsonb header data, the "next 4 characters" are *always* different, so only a chance hash collision can result in a match. There is therefore a pretty good chance that no compression will occur before it gives up because of first_success_by. I'm not sure if there is any easy fix for this. We could possibly change the default first_success_by value, but I think that'd just be postponing the problem to larger jsonb objects/arrays, and it would hurt performance for genuinely incompressible data. A somewhat painful, but not yet out-of-the-question, alternative is to change the jsonb on-disk representation. Perhaps the JEntry array could be defined as containing element lengths instead of element ending offsets. Not sure though if that would break binary searching for JSON object keys. Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. Maybe we should have pglz_compress() look at the *last* 1024 bytes if it can't find anything worth compressing in the first, for values larger than a certain size. It's worth noting that this is a fairly pathological case. AIUI the example you constructed has an array with 100k string elements. I don't think that's typical. So I suspect that unless I've misunderstood the statement of the problem we're going to find that almost all the jsonb we will be storing is still compressible. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A worst case for qsort
* Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Aug 7, 2014 at 5:52 PM, Peter Geoghegan wrote: > > On Thu, Aug 7, 2014 at 2:34 PM, Rod Taylor wrote: > >> This one is frequently sorted as batch operations against the files are > >> performed in alphabetical order to reduce conflict issues that a random > >> ordering may cause between jobs. > > > > Sure. There are cases out there. But, again, I have a hard time > > imagining why you'd expect those to be pre-sorted in practice, ... > > Well, I'm not sure why you're having a hard time imagining it. > Presorted input is a common case in general; that's why we have a > check for it. That check adds overhead in the non-pre-sorted case to > improve the pre-sorted case, and nobody's ever argued for removing it > that I can recall. Agreed. This is not all that uncommon to happen in practice. That said- this is perhaps another good case for where it'd be extremely handy to have anonymized statistics information from real systems which would allow us to more easily go *look* at this specific question. There are organizations out there who run many thousands of PG instances who have expressed interest in supporting exactly that kind of statistics gathering (indeed, I'm pretty sure Peter is familiar with one... ;) - what we need is an architecture, design, and implementation to make it happen.. I'm guessing you (Robert and Peter, especially) have already been thinking about what we could do to make the above wish/dream a reality. Perhaps if we could get the initial requirements down, someone would be able to have time to work on making it happen; it would be really great to see progress on this front for 9.5. Or, if existing products implement such metrics collection already, perhaps some numbers could be shared with the community to help address this (and other) questions. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Minmax indexes
Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. 1. There is no index tuple for page range 1-10 2. Summarization begins. It scans pages 1-5. 3. A new insertion inserts a heap tuple to page 1. 4. The insertion sees that there is no index tuple covering range 1-10, so it doesn't update it. 5. The summarization finishes scanning pages 5-10, and inserts the new index tuple. The summarization didn't see the newly inserted heap tuple, and hence it's not included in the calculated index tuple. One idea is to do the summarization in two stages. First, insert a placeholder tuple, with no real value in it. A query considers the placeholder tuple the same as a missing tuple, ie. always considers it a match. An insertion updates the placeholder tuple with the value inserted, as if it was a regular mmtuple. After summarization has finished scanning the page range, it turns the placeholder tuple into a regular tuple, by unioning the placeholder value with the value formed by scanning the heap. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgcrypto: PGP armor headers
Hi, Currently there's no way to generate or extract armor headers from the PGP armored format in pgcrypto. I've written a patch to add the support. For example: local:marko=#* select armor('zooka', array['Version', 'Comment'], array['Created by pgcrypto', 'PostgreSQL, the database']); armor --- -BEGIN PGP MESSAGE- + Version: Created by pgcrypto + Comment: PostgreSQL, the database+ + em9va2E= + =D5cR+ -END PGP MESSAGE-+ local:marko=#* select pgp_armor_header(armor('zooka', array['Version', 'Comment'], array['Created by pgcrypto', 'PostgreSQL, the database']), 'Comment'); pgp_armor_header -- PostgreSQL, the database (1 row) .marko *** a/contrib/pgcrypto/Makefile --- b/contrib/pgcrypto/Makefile *** *** 26,32 MODULE_big = pgcrypto OBJS = $(SRCS:.c=.o) $(WIN32RES) EXTENSION = pgcrypto ! DATA = pgcrypto--1.1.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql PGFILEDESC = "pgcrypto - cryptographic functions" REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \ --- 26,32 OBJS = $(SRCS:.c=.o) $(WIN32RES) EXTENSION = pgcrypto ! DATA = pgcrypto--1.2.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql PGFILEDESC = "pgcrypto - cryptographic functions" REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \ *** a/contrib/pgcrypto/expected/pgp-armor.out --- b/contrib/pgcrypto/expected/pgp-armor.out *** *** 102,104 em9va2E= --- 102,362 -END PGP MESSAGE- '); ERROR: Corrupt ascii-armor + -- corrupt + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + foo: + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + ERROR: Corrupt ascii-armor + -- empty + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + foo: + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + pgp_armor_header + -- + + (1 row) + + -- simple + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + foo: bar + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + pgp_armor_header + -- + bar + (1 row) + + -- uninteresting keys, part 1 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + foo: found + bar: ignored + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + pgp_armor_header + -- + found + (1 row) + + -- uninteresting keys, part 2 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + bar: ignored + foo: found + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + pgp_armor_header + -- + found + (1 row) + + -- uninteresting keys, part 3 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + bar: ignored + foo: found + bar: ignored + + em9va2E= + = + -END PGP MESSAGE- + ', 'foo'); + pgp_armor_header + -- + found + (1 row) + + -- insane keys, part 1 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + insane:key : + + em9va2E= + = + -END PGP MESSAGE- + ', 'insane:key '); + pgp_armor_header + -- + + (1 row) + + -- insane keys, part 2 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + insane:key : text value here + + em9va2E= + = + -END PGP MESSAGE- + ', 'insane:key '); + pgp_armor_header + -- + text value here + (1 row) + + -- long value + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + long: this value is more than 76 characters long, but it should still parse correctly as that''s permitted by RFC 4880 + + em9va2E= + = + -END PGP MESSAGE- + ', 'long'); + pgp_armor_header + - + this value is more than 76 characters long, but it should still parse correctly as that's permitted by RFC 4880 + (1 row) + + -- long value, split up + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + long: this value is more than 76 characters long, but it should still + long: parse correctly as that''s permitted by RFC 4880 + + em9va2E= + = + -END PGP MESSAGE- + ', 'long'); + pgp_armor_header + - + this value is more than 76 characters long, but it should still parse correctly as that's permitted by RFC 4880 + (1 row) + + -- long value, split up, part 2 + select pgp_armor_header(' + -BEGIN PGP MESSAGE- + long: this value is more than + long: 76 characters long, but it should still
Re: [HACKERS] A worst case for qsort
On Thu, Aug 7, 2014 at 5:52 PM, Peter Geoghegan wrote: > On Thu, Aug 7, 2014 at 2:34 PM, Rod Taylor wrote: >> This one is frequently sorted as batch operations against the files are >> performed in alphabetical order to reduce conflict issues that a random >> ordering may cause between jobs. > > Sure. There are cases out there. But, again, I have a hard time > imagining why you'd expect those to be pre-sorted in practice, ... Well, I'm not sure why you're having a hard time imagining it. Presorted input is a common case in general; that's why we have a check for it. That check adds overhead in the non-pre-sorted case to improve the pre-sorted case, and nobody's ever argued for removing it that I can recall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Thu, Aug 7, 2014 at 12:07 PM, Abhijit Menon-Sen wrote: > At 2014-08-07 23:22:43 +0900, masao.fu...@gmail.com wrote: >> That is, we log replication commands only when log_statement is set to >> all. Neither new parameter is introduced nor log_statement is >> redefined as a list. > > That sounds good to me. It sounds fairly unprincipled to me. I liked the idea of making log_statement a list, but if we aren't gonna do that, I think this should be a separate parameter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: show only failed queries
On Fri, Aug 8, 2014 at 4:31 AM, Pavel Stehule wrote: > > > > 2014-08-07 7:10 GMT+02:00 Fujii Masao : > >> On Thu, Aug 7, 2014 at 6:26 AM, Pavel Stehule >> wrote: >> > Hello >> > >> > updated version patch in attachment >> >> Thanks! But ISTM you forgot to attached the patch. > > > grr .. I am sorry No problem. Thanks for the patch! Attached is the revised version of the patch. >> >> +/* all psql known variables are included in list by default */ >> >> +for (known_varname = known_varnames; *known_varname; >> >> known_varname++) >> >> +varnames[nvars++] = pg_strdup(*known_varname); >> >> >> >> Don't we need to append both prefix and suffix to even known variables? >> > >> > >> > ??? I am not sure if I understand well - probably system "read only" >> > variables as DBNAME, USER, VERSION should not be there >> >> I had that question because complete_from_variables() is also called by >> the >> tab-completion of "\echo :" and it shows half-baked variables list. So >> I thought probably we need to change complete_from_variables() more to >> fix the problem. > > > I understand now. > > I fixed it. > > I have a question.\echo probably should not to show empty known variable. > > data for autocomplete for \echo should be same as result of "\set" Agreed. I think that only the variables having the set values should be displayed in "\echo :" case. So I modified complete_from_variables() so that the unset variables are not shown in that case but all the variables are shown in the tab-completion of "\set". >> >> +else if (strcmp(prev2_wd, "\\set") == 0) >> >> +{ >> >> +if (strcmp(prev_wd, "AUTOCOMMIT") == 0) >> >> >> >> ISTM that some psql variables like IGNOREEOF are not there. Why not? >> > >> > >> > yes, there are not complete for DBNAME, ENCODING, FETCH_COUNT, >> > HISTCONTROL, >> > HISTFILE, HISTSIZE, HOST, IGNOREEOFF, PROMPT*,USER, VERSION >> > >> > There are more reasons: >> > >> > * paremeter is not a enum (string, number or both): FETCH_COUNT, PROMPT, >> > HISTSIZE, .. >> > >> > * variable is pseudo read only - change has not any effect: DBNAME, >> > ENCODING, VERSION >> >> So HISTCONTROL should be there because it doesn't have such reasons at >> all? >> > > yes ISTM that you forgot to add HISTCONTROL to your patch. So I just added that. I added the tab-completion for "\unset" command. Like "\echo :", only the variables having the set values should be displayed in "\unset" case. I changed complete_from_variables() so that it checks the memory size of the variable array even when appending the known variables. If the memory size is not enough, it's dynamically increased. Practically this check would not be required because the initial size of the array is enough larger than the number of the known variables. I added this as the safe-guard. Typo: IGNOREEOFF -> IGNOREEOF I removed the value "none" from the value list of "ECHO" because it's not documented and a user might get confused when he or she sees the undocumented value "none". Thought? Regards, -- Fujii Masao *** a/src/bin/psql/tab-complete.c --- b/src/bin/psql/tab-complete.c *** *** 813,820 static char *_complete_from_query(int is_schema_query, const char *text, int state); static char *complete_from_list(const char *text, int state); static char *complete_from_const(const char *text, int state); static char **complete_from_variables(const char *text, ! const char *prefix, const char *suffix); static char *complete_from_files(const char *text, int state); static char *pg_strdup_keyword_case(const char *s, const char *ref); --- 813,823 const char *text, int state); static char *complete_from_list(const char *text, int state); static char *complete_from_const(const char *text, int state); + static void append_variable_names(char ***varnames, int *nvars, + int *maxvars, const char *varname, + const char *prefix, const char *suffix); static char **complete_from_variables(const char *text, ! const char *prefix, const char *suffix, bool need_value); static char *complete_from_files(const char *text, int state); static char *pg_strdup_keyword_case(const char *s, const char *ref); *** *** 925,935 psql_completion(const char *text, int start, int end) else if (text[0] == ':' && text[1] != ':') { if (text[1] == '\'') ! matches = complete_from_variables(text, ":'", "'"); else if (text[1] == '"') ! matches = complete_from_variables(text, ":\"", "\""); else ! matches = complete_from_variables(text, ":", ""); } /* If no previous word, suggest one of the basic sql commands */ --- 928,938 else if (text[0] == ':' && text[1] != ':') { if (text[1] == '\'') ! matches = complete_from_variables(text, ":'", "'", true); else if (text[1] == '"') ! matches = complete_from_variables(text, ":\"", "\"", true); else ! matches = complete_from_variables(text, ":"
Re: [HACKERS] inherit support for foreign tables
(2014/08/06 20:43), Etsuro Fujita wrote: >> (2014/06/30 22:48), Tom Lane wrote: >>> I wonder whether it isn't time to change that. It was coded like that >>> originally only because calculating the values would've been a waste of >>> cycles at the time. But this is at least the third place where it'd be >>> useful to have attr_needed for child rels. > Attached is a WIP patch for that. I've revised the patch. Changes: * Make the code more readable * Revise the comments * Cleanup Please find attached an updated version of the patch. Thanks, Best regards, Etsuro Fujita *** a/contrib/file_fdw/file_fdw.c --- b/contrib/file_fdw/file_fdw.c *** *** 799,806 check_selective_binary_conversion(RelOptInfo *baserel, } /* Collect all the attributes needed for joins or final output. */ ! pull_varattnos((Node *) baserel->reltargetlist, baserel->relid, ! &attrs_used); /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel->baserestrictinfo) --- 799,811 } /* Collect all the attributes needed for joins or final output. */ ! for (i = baserel->min_attr; i <= baserel->max_attr; i++) ! { ! if (!bms_is_empty(baserel->attr_needed[i - baserel->min_attr])) ! attrs_used = ! bms_add_member(attrs_used, ! i - FirstLowInvalidHeapAttributeNumber); ! } /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel->baserestrictinfo) *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *** *** 577,588 set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, childrel->has_eclass_joins = rel->has_eclass_joins; /* ! * Note: we could compute appropriate attr_needed data for the child's ! * variables, by transforming the parent's attr_needed through the ! * translated_vars mapping. However, currently there's no need ! * because attr_needed is only examined for base relations not ! * otherrels. So we just leave the child's attr_needed empty. */ /* * Compute the child's size. --- 577,585 childrel->has_eclass_joins = rel->has_eclass_joins; /* ! * Compute the child's attr_needed. */ + adjust_appendrel_attr_needed(rel, childrel, childRTE, appinfo); /* * Compute the child's size. *** *** 2173,2178 remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel) --- 2170,2176 { Bitmapset *attrs_used = NULL; ListCell *lc; + int i; /* * Do nothing if subquery has UNION/INTERSECT/EXCEPT: in principle we *** *** 2193,2204 remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel) * Collect a bitmap of all the output column numbers used by the upper * query. * ! * Add all the attributes needed for joins or final output. Note: we must ! * look at reltargetlist, not the attr_needed data, because attr_needed ! * isn't computed for inheritance child rels, cf set_append_rel_size(). ! * (XXX might be worth changing that sometime.) */ ! pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used); /* Add all the attributes used by un-pushed-down restriction clauses. */ foreach(lc, rel->baserestrictinfo) --- 2191,2205 * Collect a bitmap of all the output column numbers used by the upper * query. * ! * Add all the attributes needed for joins or final output. */ ! for (i = rel->min_attr; i <= rel->max_attr; i++) ! { ! if (!bms_is_empty(rel->attr_needed[i - rel->min_attr])) ! attrs_used = ! bms_add_member(attrs_used, ! i - FirstLowInvalidHeapAttributeNumber); ! } /* Add all the attributes used by un-pushed-down restriction clauses. */ foreach(lc, rel->baserestrictinfo) *** a/src/backend/optimizer/path/indxpath.c --- b/src/backend/optimizer/path/indxpath.c *** *** 1768,1779 check_index_only(RelOptInfo *rel, IndexOptInfo *index) * way out. */ ! /* ! * Add all the attributes needed for joins or final output. Note: we must ! * look at reltargetlist, not the attr_needed data, because attr_needed ! * isn't computed for inheritance child rels. ! */ ! pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used); /* Add all the attributes used by restriction clauses. */ foreach(lc, rel->baserestrictinfo) --- 1768,1781 * way out. */ ! /* Add all the attributes needed for joins or final output. */ ! for (i = rel->min_attr; i <= rel->max_attr; i++) ! { ! if (!bms_is_empty(rel->attr_needed[i - rel->min_attr])) ! attrs_used = ! bms_add_member(attrs_used, ! i - FirstLowInvalidHeapAttributeNumber); ! } /* Add all the attributes used by restriction clauses. */ foreach(lc, rel->baserestrictinfo) *** a/src/backend/optimizer/prep/prepunion.c --- b/src/backend/optimizer/prep/prepunion.c *** *** 109,114 static Bitmapset *translate_col_privs(cons
Re: [HACKERS] Minmax indexes
I think there's a race condition in mminsert, if two backends insert a tuple to the same heap page range concurrently. mminsert does this: 1. Fetch the MMtuple for the page range 2. Check if any of the stored datums need updating 3. Unlock the page. 4. Lock the page again in exclusive mode. 5. Update the tuple. It's possible that two backends arrive at phase 3 at the same time, with different values. For example, backend A wants to update the minimum to contain 10, and and backend B wants to update it to 5. Now, if backend B gets to update the tuple first, to 5, backend A will update the tuple to 10 when it gets the lock, which is wrong. The simplest solution would be to get the buffer lock in exclusive mode to begin with, so that you don't need to release it between steps 2 and 5. That might be a significant hit on concurrency, though, when most of the insertions don't in fact have to update the value. Another idea is to re-check the updated values after acquiring the lock in exclusive mode, to see if they match the previous values. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote: > Comparing the median times, that's about a 3% regression. For this > particular case, we might be able to recapture that by replacing the > bespoke memory-tracking logic in tuplesort.c with use of this new > facility. I'm not sure whether there are other cases that we might > also want to test; I think stuff that runs all on the server side is > likely to show up problems more clearly than pgbench. Maybe a > PL/pgsql loop that does something allocation-intensive on each > iteration, for example, like parsing a big JSON document. I wasn't able to reproduce your results on my machine. At -s 300, with maintenance_work_mem set high enough to do internal sort, it took about 40s and I heard some disk activity, so I didn't think it was a valid result. I went down to -s 150, and it took around 5.3s on both master and memory-accounting. Either way, it's better to be conservative. Attached is a version of the patch with opt-in memory usage tracking. Child contexts inherit the setting from their parent. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 242,247 typedef struct AllocChunkData --- 242,249 #define AllocChunkGetPointer(chk) \ ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) + static void update_allocation(MemoryContext context, int64 size); + /* * These functions implement the MemoryContext API for AllocSet contexts. */ *** *** 430,435 randomize_mem(char *ptr, size_t size) --- 432,440 * minContextSize: minimum context size * initBlockSize: initial allocation block size * maxBlockSize: maximum allocation block size + * + * The flag determining whether this context tracks memory usage is inherited + * from the parent context. */ MemoryContext AllocSetContextCreate(MemoryContext parent, *** *** 438,443 AllocSetContextCreate(MemoryContext parent, --- 443,468 Size initBlockSize, Size maxBlockSize) { + return AllocSetContextCreateTracked( + parent, name, minContextSize, initBlockSize, maxBlockSize, + (parent == NULL) ? false : parent->track_mem); + } + + /* + * AllocSetContextCreateTracked + * Create a new AllocSet context. + * + * Implementation for AllocSetContextCreate, but also allows the caller to + * specify whether memory usage should be tracked or not. + */ + MemoryContext + AllocSetContextCreateTracked(MemoryContext parent, + const char *name, + Size minContextSize, + Size initBlockSize, + Size maxBlockSize, + bool track_mem) + { AllocSet context; /* Do the type-independent part of context creation */ *** *** 445,451 AllocSetContextCreate(MemoryContext parent, sizeof(AllocSetContext), &AllocSetMethods, parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 470,477 sizeof(AllocSetContext), &AllocSetMethods, parent, ! name, ! track_mem); /* * Make sure alloc parameters are reasonable, and save them. *** *** 500,505 AllocSetContextCreate(MemoryContext parent, --- 526,534 errdetail("Failed while creating memory context \"%s\".", name))); } + + update_allocation((MemoryContext) context, blksize); + block->aset = context; block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block->endptr = ((char *) block) + blksize; *** *** 590,595 AllocSetReset(MemoryContext context) --- 619,625 else { /* Normal case, release the block */ + update_allocation(context, -(block->endptr - ((char*) block))); #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block->freeptr - ((char *) block)); #endif *** *** 632,637 AllocSetDelete(MemoryContext context) --- 662,668 { AllocBlock next = block->next; + update_allocation(context, -(block->endptr - ((char*) block))); #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block->freeptr - ((char *) block)); #endif *** *** 678,683 AllocSetAlloc(MemoryContext context, Size size) --- 709,717 errmsg("out of memory"), errdetail("Failed on request of size %zu.", size))); } + + update_allocation(context, blksize); + block->aset = set; block->freeptr = block->endptr = ((char *) block) + blksize; *** *** 873,878 AllocSetAlloc(MemoryContext context, Size size) --- 907,914 errdetail("Failed on request of size %zu.", size))); } + update_allocation(context, blksize); + block->aset = set; block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block->endptr = ((char *) block) + blksize; *** *** 976,981 AllocSetFree(MemoryContext context, void *pointer
Re: [HACKERS] pg_receivexlog add synchronous mode
On Thu, Aug 7, 2014 at 5:24 PM, wrote: >> Okay, applied the patch. >> >> I heavily modified your patch based on the master that the refactoring >> patch has been applied. Attached is that modified version. Could you >> review that? > > Thank you for the patch. > I did a review of the patch. > > No problem in the patch. Thanks for the review! Applied the patch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Maximum number of WAL files in the pg_xlog directory
Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) And we have lots of alerts from the script for customers who set their wal_keep_segments setting higher than 0. So we started to question this sentence of the documentation: There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. (http://www.postgresql.org/docs/9.3/static/wal-configuration.html) While doing some tests, it appears it would be more something like: wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments + 1 But after reading the source code (src/backend/access/transam/xlog.c), the right formula seems to be: wal_keep_segments + 2 * checkpoint_segments + 1 Here is how we went to this formula... CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, line 8363: /* * Delete old log files (those no longer needed even for previous * checkpoint or the standbys in XLOG streaming). */ if (_logSegNo) { KeepLogSeg(recptr, &_logSegNo); _logSegNo--; RemoveOldXlogFiles(_logSegNo, recptr); } KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792: /* compute limit for wal_keep_segments first */ if (wal_keep_segments > 0) { /* avoid underflow, don't go below 1 */ if (segno <= wal_keep_segments) segno = 1; else segno = segno - wal_keep_segments; } IOW, the segment number (segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle all files before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant, which is defined as: /* * XLOGfileslop is the maximum number of preallocated future XLOG segments. * When we are done with an old XLOG segment file, we will recycle it as a * future XLOG segment as long as there aren't already XLOGfileslop future * segments; else we'll delete it. This could be made a separate GUC * variable, but at present I think it's sufficient to hardwire it as * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free * no more than 2*CheckPointSegments log segments, and we want to recycle all * of them; the +1 allows boundary cases to happen without wasting a * delete/create-segment cycle. */ #define XLOGfileslop(2*CheckPointSegments + 1) (in src/backend/access/transam/xlog.c, master branch, line 100) IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the formula: wal_keep_segments + 2 * checkpoint_segments + 1 And this is what we usually find in our customers' servers. We may find more WAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files. AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I may be wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpret it like I may have done. Any comments? did I miss something, or should we fix the documentation? Thanks. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com