Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 20.12.2011 11:20, Magnus Hagander wrote: 2011/12/20 Tomas Vondra t...@fuzzy.cz: I haven't updated the docs yet - let's see if the patch is acceptable at all first. Again, without having reviewed the code, this looks like a feature we'd want, so please add some docs, and then submit it for the next commitfest! I've added the docs (see the attachment) and rebased to current head. Tomas diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index a12a9a2..3635c3f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -691,6 +691,26 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_db_temp_files/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Nuber of temporary files written for the database. All temporary files are + counted, regardless of why the temporary file was created (sorting or hash + join) or file size (log_temp_file does not affect this). + /entry + /row + + row + entryliteralfunctionpg_stat_get_db_temp_bytes/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Amount of data written to temporary files for the database. All temporary + files are counted, regardless of why the temporary file was created (sorting + or hash join) or file size (log_temp_file does not affect this). + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 2253ca8..55d20dc 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -574,6 +574,8 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, pg_stat_get_db_conflict_all(D.oid) AS conflicts, +pg_stat_get_db_temp_files(D.oid) AS temp_files, +pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes, pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 24f4cde..97c7004 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -286,7 +286,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len); static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len); static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); - +static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); /* * Public functions called from postmaster follow @@ -1339,6 +1339,29 @@ pgstat_report_recovery_conflict(int reason) pgstat_send(msg, sizeof(msg)); } + +/* + * pgstat_report_tempfile() - + * + * Tell the collector about a temporary file. + * + */ +void +pgstat_report_tempfile(size_t filesize) +{ + PgStat_MsgTempFile msg; + + if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) + return; + + pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_TEMPFILE); + msg.m_databaseid = MyDatabaseId; + msg.m_filesize = filesize; + pgstat_send(msg, sizeof(msg)); +} + +; + /* -- * pgstat_ping() - * @@ -3185,6 +3208,10 @@ PgstatCollectorMain(int argc, char *argv[]) pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len); break; + case PGSTAT_MTYPE_TEMPFILE: + pgstat_recv_tempfile((PgStat_MsgTempFile *) msg, len); + break; + default: break; } @@ -3266,6 +3293,8 @@ pgstat_get_db_entry(Oid databaseid, bool create) result-n_conflict_snapshot = 0; result-n_conflict_bufferpin = 0; result-n_conflict_startup_deadlock = 0; + result-n_temp_files = 0; + result-n_temp_bytes = 0; result-stat_reset_timestamp = GetCurrentTimestamp(); @@ -4177,6 +4206,8 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len) dbentry-n_tuples_updated = 0; dbentry-n_tuples_deleted = 0; dbentry-last_autovac_time = 0; + dbentry-n_temp_bytes = 0; + dbentry-n_temp_files = 0; dbentry-stat_reset_timestamp = GetCurrentTimestamp(); @@ -4403,6 +4434,24 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len
[HACKERS] WIP: explain analyze with 'rows' but not timing
Hi all, most of the time I use auto_explain, all I need is duration of the query and the plan with estimates and actual row counts. And it would be handy to be able to catch long running queries with estimates that are significantly off (say 100x lower or higher compared to actual row numbers). The gettimeofday() calls are not exactly cheap in some cases, so why to pay that price when all you need is the number of rows? The patch attached does this: 1) adds INSTRUMENT_ROWS, a new InstrumentOption - counts rows without timing (no gettimeofday() callse) - if you want timing info, use INSTRUMENT_TIMER 2) adds new option TIMING to EXPLAIN, i.e. EXPLAIN (ANALYZE ON, TIMING ON) SELECT ... 3) adds auto_explain.log_rows_only (false by default) - if you set this to 'true', then the instrumentation will just count rows, without calling gettimeofday() It works quite well, except one tiny issue - when the log_rows_only is set to false (so that auto_explain requires timing), it silently overrides the EXPLAIN option. So that even when the user explicitly disables timing (TIMING OFF), it's overwritten and the explain collects the timing data. I could probably hide the timing info, but that'd make the issue even worse (the user would not notice that the timing was actually enabled). Maybe the right thing would be to explicitly disable timing for queries executed with EXPLAIN (TIMING OFF). Any other ideas how to make this work reasonably? The patch does not implement any checks (how far is the estimate from the reality) yet, that'll be round two. regards Tomas diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index b320698..34015b8 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or -1 */ static bool auto_explain_log_analyze = false; static bool auto_explain_log_verbose = false; static bool auto_explain_log_buffers = false; +static bool auto_explain_log_rows_only = false; static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT; static bool auto_explain_log_nested_statements = false; @@ -133,6 +134,17 @@ _PG_init(void) NULL, NULL); + DefineCustomBoolVariable(auto_explain.log_rows_only, +Do not collect timing data, just row number., +NULL, + auto_explain_log_rows_only, +false, +PGC_SUSET, +0, +NULL, +NULL, +NULL); + EmitWarningsOnPlaceholders(auto_explain); /* Install hooks. */ @@ -170,7 +182,11 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags) /* Enable per-node instrumentation iff log_analyze is required. */ if (auto_explain_log_analyze (eflags EXEC_FLAG_EXPLAIN_ONLY) == 0) { - queryDesc-instrument_options |= INSTRUMENT_TIMER; + if (auto_explain_log_rows_only) + queryDesc-instrument_options |= INSTRUMENT_ROWS; + else + queryDesc-instrument_options |= (INSTRUMENT_TIMER | INSTRUMENT_ROWS); + if (auto_explain_log_buffers) queryDesc-instrument_options |= INSTRUMENT_BUFFERS; } diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index e38de5c..4e3f343 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -133,6 +133,8 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString, es.costs = defGetBoolean(opt); else if (strcmp(opt-defname, buffers) == 0) es.buffers = defGetBoolean(opt); + else if (strcmp(opt-defname, timing) == 0) + es.timing = defGetBoolean(opt); else if (strcmp(opt-defname, format) == 0) { char *p = defGetString(opt); @@ -229,6 +231,7 @@ ExplainInitState(ExplainState *es) /* Set default options. */ memset(es, 0, sizeof(ExplainState)); es-costs = true; + es-timing = true; /* Prepare output buffer. */ es-str = makeStringInfo(); } @@ -355,8 +358,11 @@ ExplainOnePlan(PlannedStmt *plannedstmt, ExplainState *es, int eflags; int
Re: [HACKERS] WIP: explain analyze with 'rows' but not timing
On 23.12.2011 14:57, Robert Haas wrote: 2011/12/22 Tomas Vondra t...@fuzzy.cz: The gettimeofday() calls are not exactly cheap in some cases, so why to pay that price when all you need is the number of rows? Fair point. The patch attached does this: 1) adds INSTRUMENT_ROWS, a new InstrumentOption - counts rows without timing (no gettimeofday() callse) - if you want timing info, use INSTRUMENT_TIMER 2) adds new option TIMING to EXPLAIN, i.e. EXPLAIN (ANALYZE ON, TIMING ON) SELECT ... 3) adds auto_explain.log_rows_only (false by default) - if you set this to 'true', then the instrumentation will just count rows, without calling gettimeofday() This seems like an unnecessarily confusing interface, because you've named the auto_explain option differently from the EXPLAIN option and given it (almost) the opposite sense: timing=off means the same thing as log_rows_only=on. I think the EXPLAIN (TIMING) option is good the way you have it, but then just have auto_explain.log_timing, with a default value of on. Makes sense. I've updated the patch to reflect this, so the option is called auto_explain.log_timing and is true by default. I'll add the patch to the next commit fest. One thing I'm wondering about is that the InstrumentOptions are not exclusive - INSTRUMENT_TIMER means 'collect timing and row counts' while INSTRUMENT_ROWS means 'collect row counts'. Wouldn't it be better to redefine the INSTRUMENT_TIMER so that it collects just timing info. I.e. to get the current behaviour, you'd have to do this instrument_options |= (INSTRUMENT_TIMER | INSTRUMENT_ROWS) It's quite trivial change in explain.c, the problem I have with that is that it might break extensions. Tomas diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index b320698..4b52b26 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or -1 */ static bool auto_explain_log_analyze = false; static bool auto_explain_log_verbose = false; static bool auto_explain_log_buffers = false; +static bool auto_explain_log_timing = false; static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT; static bool auto_explain_log_nested_statements = false; @@ -133,6 +134,17 @@ _PG_init(void) NULL, NULL); + DefineCustomBoolVariable(auto_explain.log_timing, +Collect timing data, not just row counts., +NULL, + auto_explain_log_timing, +true, +PGC_SUSET, +0, +NULL, +NULL, +NULL); + EmitWarningsOnPlaceholders(auto_explain); /* Install hooks. */ @@ -170,7 +182,12 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags) /* Enable per-node instrumentation iff log_analyze is required. */ if (auto_explain_log_analyze (eflags EXEC_FLAG_EXPLAIN_ONLY) == 0) { - queryDesc-instrument_options |= INSTRUMENT_TIMER; + if (auto_explain_log_timing) + queryDesc-instrument_options |= INSTRUMENT_TIMER; + else + queryDesc-instrument_options |= INSTRUMENT_ROWS; + + if (auto_explain_log_buffers) queryDesc-instrument_options |= INSTRUMENT_BUFFERS; } diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml index 8a9c9de..4488956 100644 --- a/doc/src/sgml/ref/explain.sgml +++ b/doc/src/sgml/ref/explain.sgml @@ -40,6 +40,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable class=parameterstatement/replac VERBOSE [ replaceable class=parameterboolean/replaceable ] COSTS [ replaceable class=parameterboolean/replaceable ] BUFFERS [ replaceable class=parameterboolean/replaceable ] +TIMING [ replaceable class=parameterboolean/replaceable ] FORMAT { TEXT | XML | JSON | YAML } /synopsis /refsynopsisdiv @@ -172,6 +173,20 @@ ROLLBACK; /varlistentry varlistentry +termliteralTIMING/literal/term +listitem + para + Include information on timing for each node. Specifically, include the + actual startup time and time spent in the node. This may involve a lot + of gettimeofday calls, and on some systems this means significant + slowdown
Re: [HACKERS] WIP: explain analyze with 'rows' but not timing
On 23.12.2011 16:14, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: One thing I'm wondering about is that the InstrumentOptions are not exclusive - INSTRUMENT_TIMER means 'collect timing and row counts' while INSTRUMENT_ROWS means 'collect row counts'. Wouldn't it be better to redefine the INSTRUMENT_TIMER so that it collects just timing info. I.e. to get the current behaviour, you'd have to do this instrument_options |= (INSTRUMENT_TIMER | INSTRUMENT_ROWS) It's quite trivial change in explain.c, the problem I have with that is that it might break extensions. I'm not especially concerned by that angle --- we make bigger API changes all the time. But you could change the name, eg instrument_options |= (INSTRUMENT_TIMING | INSTRUMENT_ROWS) and then #define INSTRUMENT_TIMER as the OR of the two real bits for backward compatibility. OK, that seems like a good solution. But is it worth the additional complexity in explain.c? The motivation for this patch was that collection timing data often causes performance issues and in some cases it's not needed. But is this true for row counts? Are there machines where collecting row counts is above noise level? I've never seen that, but that's not a proof of nonexistence. If the overhead of this is negligible, then I could just hide the row counts from the output. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: explain analyze with 'rows' but not timing
Dne 23.12.2011 22:37, Pavel Stehule napsal(a): 2011/12/23 Tom Lanet...@sss.pgh.pa.us: Tomas Vondrat...@fuzzy.cz writes: The motivation for this patch was that collection timing data often causes performance issues and in some cases it's not needed. But is this true for row counts? Perhaps more to the point, is there a use case for collecting timing data without row counts? I find it hard to visualize a valid reason. yes - a searching of bad prediction But that's the purpose of collecting row counts without timing data. TL is asking about the opposite case - collecting timing data without row counts. I can't imagine such use case ... Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
On 2 Leden 2012, 14:11, Pavel Stehule wrote: Hello this is relative simple patch that add possibility to skip noexisting tables. It is necessary for silent cleaning when dump is loaded. Just a curious question - what use case is solved by this? Under what circumstances you get an ALTER TABLE without a preceding CREATE TABLE? I can't think of such scenario ... Or is this meant for scripts written manually so that it's possible to do alter if the table already exists and create if it does not exist? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstat wait timeout
On 4 Leden 2012, 13:17, pratikchirania wrote: I have installed RAMdisk and pointed the parameter: #stats_temp_directory = 'B:\pg_stat_tmp' I also tried #stats_temp_directory = 'B:/pg_stat_tmp' But, still there is no file created in the RAM disk. The previous stat file is touched even after the change is made. (I have restarted the service after effecting the change) You have to remove the '#' at the beginning, this way it's commented out. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] easy way of copying regex_t ?
Hi all, I've been working on moving an extension that allows to move the ispell dictionaries to shared segment. It's almost complete, the last FIXME is about copying regex_t structure (stored in AFFIX). According to regex.h the structure is fairly complex and not exactly easy to understand, so I'd like to know if anyone here already implemented that or something that may serve the same purpose. Any ideas? kind regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
On 13.1.2012 18:07, Josh Berkus wrote: Eric's review follows: Compiling on Ubuntu 10.04 LTS AMD64 on a GoGrid virtual machine from 2012-01-12 git checkout. Patch applied fine. 'make check' results in failures when this patch is put into place. 6 of 128 tests failed. Here are the relevant failures. parallel group (2 tests): create_view create_index create_index ... FAILED parallel group (9 tests): create_cast create_aggregate drop_if_exists typed_table vacuum constraints create_table_like triggers inherit inherit ... FAILED parallel group (20 tests): select_distinct_on btree_index update random select_distinct select_having namespace delete case hash_index union select_implicit select_into transactions portals subselect arrays aggregates join prepared_xacts join ... FAILED aggregates ... FAILED parallel group (15 tests): combocid portals_p2 advisory_lock xmlmap tsdicts guc functional_deps dependency select_views cluster tsearch window foreign_data foreign_key bitmapops foreign_data ... FAILED parallel group (19 tests): limit prepare copy2 conversion xml plancache returning temp sequence without_oid with rowtypes truncate polymorphism domain largeobject rangefuncs alter_table plpgsql alter_table ... FAILED Fixed. The default value of TIMING option did not work as intended, it was set to true even for plain EXPLAIN (without ANALYZE). In that case the EXPLAIN failed. Tomas diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index 61da6a2..9fc0ae1 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or -1 */ static bool auto_explain_log_analyze = false; static bool auto_explain_log_verbose = false; static bool auto_explain_log_buffers = false; +static bool auto_explain_log_timing = false; static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT; static bool auto_explain_log_nested_statements = false; @@ -133,6 +134,17 @@ _PG_init(void) NULL, NULL); + DefineCustomBoolVariable(auto_explain.log_timing, +Collect timing data, not just row counts., +NULL, + auto_explain_log_timing, +true, +PGC_SUSET, +0, +NULL, +NULL, +NULL); + EmitWarningsOnPlaceholders(auto_explain); /* Install hooks. */ @@ -170,7 +182,12 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags) /* Enable per-node instrumentation iff log_analyze is required. */ if (auto_explain_log_analyze (eflags EXEC_FLAG_EXPLAIN_ONLY) == 0) { - queryDesc-instrument_options |= INSTRUMENT_TIMER; + if (auto_explain_log_timing) + queryDesc-instrument_options |= INSTRUMENT_TIMER; + else + queryDesc-instrument_options |= INSTRUMENT_ROWS; + + if (auto_explain_log_buffers) queryDesc-instrument_options |= INSTRUMENT_BUFFERS; } diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml index 8a9c9de..4488956 100644 --- a/doc/src/sgml/ref/explain.sgml +++ b/doc/src/sgml/ref/explain.sgml @@ -40,6 +40,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable class=parameterstatement/replac VERBOSE [ replaceable class=parameterboolean/replaceable ] COSTS [ replaceable class=parameterboolean/replaceable ] BUFFERS [ replaceable class=parameterboolean/replaceable ] +TIMING [ replaceable class=parameterboolean/replaceable ] FORMAT { TEXT | XML | JSON | YAML } /synopsis /refsynopsisdiv @@ -172,6 +173,20 @@ ROLLBACK; /varlistentry varlistentry +termliteralTIMING/literal/term +listitem + para + Include information on timing for each node. Specifically, include the + actual startup time and time spent in the node. This may involve a lot + of gettimeofday calls, and on some systems this means significant + slowdown of the query. + This parameter may only be used when literalANALYZE/literal is also + enabled. It defaults to
Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 20.12.2011 19:59, Tomas Vondra wrote: On 20.12.2011 11:20, Magnus Hagander wrote: 2011/12/20 Tomas Vondra t...@fuzzy.cz: I haven't updated the docs yet - let's see if the patch is acceptable at all first. Again, without having reviewed the code, this looks like a feature we'd want, so please add some docs, and then submit it for the next commitfest! I've added the docs (see the attachment) and rebased to current head. Tomas Fixed a failing regression test (check of pg_stat_database structure). Tomas diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index a12a9a2..3635c3f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -691,6 +691,26 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_db_temp_files/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Nuber of temporary files written for the database. All temporary files are + counted, regardless of why the temporary file was created (sorting or hash + join) or file size (log_temp_file does not affect this). + /entry + /row + + row + entryliteralfunctionpg_stat_get_db_temp_bytes/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Amount of data written to temporary files for the database. All temporary + files are counted, regardless of why the temporary file was created (sorting + or hash join) or file size (log_temp_file does not affect this). + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 50ba20c..3a0dca3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -574,6 +574,8 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, pg_stat_get_db_conflict_all(D.oid) AS conflicts, +pg_stat_get_db_temp_files(D.oid) AS temp_files, +pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes, pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 323d42b..6a2ff1d 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -286,7 +286,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len); static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len); static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); - +static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); /* * Public functions called from postmaster follow @@ -1339,6 +1339,29 @@ pgstat_report_recovery_conflict(int reason) pgstat_send(msg, sizeof(msg)); } + +/* + * pgstat_report_tempfile() - + * + * Tell the collector about a temporary file. + * + */ +void +pgstat_report_tempfile(size_t filesize) +{ + PgStat_MsgTempFile msg; + + if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) + return; + + pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_TEMPFILE); + msg.m_databaseid = MyDatabaseId; + msg.m_filesize = filesize; + pgstat_send(msg, sizeof(msg)); +} + +; + /* -- * pgstat_ping() - * @@ -3185,6 +3208,10 @@ PgstatCollectorMain(int argc, char *argv[]) pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len); break; + case PGSTAT_MTYPE_TEMPFILE: + pgstat_recv_tempfile((PgStat_MsgTempFile *) msg, len); + break; + default: break; } @@ -3266,6 +3293,8 @@ pgstat_get_db_entry(Oid databaseid, bool create) result-n_conflict_snapshot = 0; result-n_conflict_bufferpin = 0; result-n_conflict_startup_deadlock = 0; + result-n_temp_files = 0; + result-n_temp_bytes = 0; result-stat_reset_timestamp = GetCurrentTimestamp(); @@ -4177,6 +4206,8 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len) dbentry-n_tuples_updated = 0; dbentry-n_tuples_deleted = 0; dbentry-last_autovac_time = 0; + dbentry-n_temp_bytes = 0; + dbentry-n_temp_files = 0; dbentry-stat_reset_timestamp
Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 20 Leden 2012, 13:23, Magnus Hagander wrote: On Tue, Jan 17, 2012 at 21:39, Tomas Vondra t...@fuzzy.cz wrote: On 20.12.2011 19:59, Tomas Vondra wrote: On 20.12.2011 11:20, Magnus Hagander wrote: 2011/12/20 Tomas Vondra t...@fuzzy.cz: I haven't updated the docs yet - let's see if the patch is acceptable at all first. Again, without having reviewed the code, this looks like a feature we'd want, so please add some docs, and then submit it for the next commitfest! I've added the docs (see the attachment) and rebased to current head. Tomas Fixed a failing regression test (check of pg_stat_database structure). I'm wondering if this (and also my deadlocks stats patch that's int he queue) should instead of inventing new pgstats messages, add fields to the tabstat message. It sounds like that one is just for tables, but it's already the one collecting info about commits and rollbacks, and it's already sent on every commit. Hmmm, I'm not against that, but I'd recommend changing the message name to something that reflects the reality. If it's not just about table statistics, it should not be named 'tabstats' IMHO. Or maybe split that into two messages, both sent at the commit time. I do like the idea of not sending the message for each temp file, though. One thing that worries me are long running transactions (think about a batch process that runs for several hours within a single transaction). By sending the data only at the commit, such transactions would not be accounted properly. So I'd suggest sending the message either at commit time or after collecting enough data (increment a counter whenever the struct is updated and send a message when the counter = N for a reasonable value of N, say 20). But maybe it already works that way - I haven't checked the current 'tabstat' implementation. Adding two fields to that one would add some extra bytes on every send, but I wonder if that woudl ever affect performance, given the total size of the packet? And it would certainly be lower overhead in the cases that there *have* been temp tables used. It's not about temp tables, it's about temp files. Which IMHO implies that there would be exactly 0.01% performance difference because temporary files are quite expensive. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Finer Extension dependencies
On 21 Leden 2012, 18:20, Dimitri Fontaine wrote: Hi, Thank you for reviewing this patch! Hitoshi Harada umi.tan...@gmail.com writes: Next, some questions: - Why is the finer dependency needed? Do you have tangible example that struggles with the dependency granularity? I feel so good about the existing dependency on extension as an extension developer of several ones. The problem is not yet very apparent only because extensions are very new. The main thing we address with this patch is depending on a feature that appeared while developing an extension or that gets removed down the line. It allows to depend on features and avoid needing to compare version numbers and maintain a list of which version number is providing which feature. This feature has been asked by several extension users, beginning even before 9.1 got released. It's also about several extension providing the same sort of functionality implemented in different ways. Think about extensions that allow you to send e-mails right from the database. One could do that directly, the other one could implement queuing, another one could be optimized for a specific SMTP server. With features, you could just say 'require = mail' and use the extension that's installed. Yes, this needs a common API. I personally see this as a major step towards fully-fledged package management, similar to those available in modern Linux distributions (apt, yum, portage, ...). Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 21 Leden 2012, 18:13, Magnus Hagander wrote: On Sat, Jan 21, 2012 at 18:02, Tomas Vondra t...@fuzzy.cz wrote: On 20 Leden 2012, 13:23, Magnus Hagander wrote: I'm wondering if this (and also my deadlocks stats patch that's int he queue) should instead of inventing new pgstats messages, add fields to the tabstat message. It sounds like that one is just for tables, but it's already the one collecting info about commits and rollbacks, and it's already sent on every commit. Hmmm, I'm not against that, but I'd recommend changing the message name to something that reflects the reality. If it's not just about table statistics, it should not be named 'tabstats' IMHO. Or maybe split that into two messages, both sent at the commit time. Yes, renaming it might be a good idea. If you split it into two messages that would defeat much of the point. Not really, if combined with the 'send after each N updates or at the transaction end' because then those parts might be sent independently (thus not transmitting the whole message). Though I just looked at the tabstat code again, and we already split that message up at regular intervals. Which would make it quite weird to have global counters in it as well. But instead of there, perhaps we need a general non table, but more than one type of data message sent out at the same time. There is other stuff in the queue for it. I'm not convinced either way - I'm not against the original way in your patch either. I just wanted to throw the idea out there, and was hoping somebody else would have an opinion too :-) Let's make that in a separate patch. It seems like a good thing to do, but I don't think it should be mixed with this patch. I do like the idea of not sending the message for each temp file, though. One thing that worries me are long running transactions (think about a batch process that runs for several hours within a single transaction). By sending the data only at the commit, such transactions would not be accounted properly. So I'd suggest sending the message either at commit By that argument they are *already* not accounted properly, because the number of rows and those counters are wrong. By sending the temp file data in the middle of the transaction, you won't be able to correlate those numbers with the temp file usage. I'm not saying the other usecase isn't more common, but whichever way you do it, it's going to get inconsistent with *something*. The question is whether the patch should be modified to send the message only at commit (i.e. just like tabstats) or if it can remain the way it is now. And maybe we could change the way all those messages are sent (e.g. to the regular submits I've proposed in the previous post) to make them more consistent. I'm not asking for 100% consistency. What I don't like is a batch process consuming resources but not reflected in the stats until the final commit. With a huge spike in the stats right after that, although the activity was actually spread over several hours. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 26 Leden 2012, 14:44, Magnus Hagander wrote: On Sat, Jan 21, 2012 at 20:12, Tomas Vondra t...@fuzzy.cz wrote: On 21 Leden 2012, 18:13, Magnus Hagander wrote: On Sat, Jan 21, 2012 at 18:02, Tomas Vondra t...@fuzzy.cz wrote: Though I just looked at the tabstat code again, and we already split that message up at regular intervals. Which would make it quite weird to have global counters in it as well. But instead of there, perhaps we need a general non table, but more than one type of data message sent out at the same time. There is other stuff in the queue for it. I'm not convinced either way - I'm not against the original way in your patch either. I just wanted to throw the idea out there, and was hoping somebody else would have an opinion too :-) Let's make that in a separate patch. It seems like a good thing to do, but I don't think it should be mixed with this patch. Yeah, I'm not sure we even want to do that yet, when we go down this road. We might eventually, of course. Yes, that's one of the reasons why I suggested to do that in a separate patch (that might be rejected if we find it's a bad idea). I've cleaned up the patch a bit and applied it - thanks! Other than cosmetic, I changed the text in the description around a bit (sol if that's worse now, that's purely my fault) and added docs to the section about pg_stat_database that you'd forgotten. Great. Thanks for the fixes. Also, I'd like to point you in the direction of the src/include/catalog/find_unused_oids and duplicate_oids scripts. The oids you'd picked conflicted with the pg_extension catalog from a year ago. Easy enough to fix, and there are often conflicts with more recent oids that the committer has to deal with anyway, but cleaning those up before submission always helps a little bit. For the next one :-) Aha! I've been wondering how the commiters identify duplicate oids etc. but I was unaware of those scripts. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
On 3 Únor 2012, 17:12, Robert Haas wrote: On Sat, Jan 21, 2012 at 10:32 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Jan 13, 2012 at 3:07 PM, Tomas Vondra t...@fuzzy.cz wrote: Fixed. The default value of TIMING option did not work as intended, it was set to true even for plain EXPLAIN (without ANALYZE). In that case the EXPLAIN failed. I've applied this over the show Heap Fetches in EXPLAIN ANALYZE output for index-only scans patch. It applied and built and passes installcheck. I have not done a full review of this, but I want to say that I want this very much. Not only can I get the row counts, but also the Heap Fetches and the output of BUFFERS, without the overhead of timing. I haven't looked at contrib aspects of it at all. Thank you for making this. +1. After looking at this a bit, I think we can make the interface a bit more convenient. I'd like to propose that we call the EXPLAIN option ROWS rather than TIMING, and that we provide the row-count information if *either* ROWS or ANALYZE is specified. Then, if you want rows without timing, you can say this: EXPLAIN (ROWS) query... Rather than, as in the approach taken by the current patch: EXPLAIN (ANALYZE, TIMING off) query... ...which is a little more long-winded. This also has the nice advantage of making the name of the EXPLAIN option match the name of the INSTRUMENT flag. Revised patch along those lines attached. Barring objections, I'll commit this version. I don't think changing the EXPLAIN syntax this way is a good idea. I think that one option should not silently enable/disable others, so ROWS should not enable ANALYZE automatically. It should throw an error just like the TIMING option does when invoked without ANALYZE (IIRC). Which leads to syntax like this EXPLAIN (ANALYZE, ROWS) and that's a bit strange because it mentions ROWS but actually manipulates TIMING behind the curtains. You can't actually do EXPLAIN (ANALYZE, ROWS off) because row counts are always collected. So I think the syntax I proposed makes more sense. kind regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
On 3.2.2012 18:28, Robert Haas wrote: On Fri, Feb 3, 2012 at 12:08 PM, Tomas Vondra t...@fuzzy.cz wrote: I don't think changing the EXPLAIN syntax this way is a good idea. I think that one option should not silently enable/disable others, so ROWS should not enable ANALYZE automatically. I didn't propose that. The point is that the desired behavior (however we name it) is a SUBSET of what ANALYZE does. So we can either: 1. Have ANALYZE enable all the behavior, and have another option (TIMING) that can be used to turn some of it back on again. 2. Have ANALYZE enable all the behavior, and have another option (ROWS) that enables just the subset of it that we want. I prefer #2 to #1, because I think it's a bit confusing to have an option whose effect is to partially disable another option. YMMV, of course. OK, thanks for the explanation. I don't like the idea of subsets as it IMHO makes it less obvious what options are enabled. For example this EXPLAIN (ROWS) query... does not immediately show it's actually going to do ANALYZE. I prefer to keep the current 'ANALYZE' definition, i.e. collecting both row counts and timing data (which is what 99% of people wants anyway), and an option to disable the timing. And the BUFFERS option currently works exactly like that, so defining ROWS the way you proposed would be inconsistent with the current behavior. Sure, we could redefine BUFFERS as a subset, so you could do EXPLAIN (ROWS) ... instead of ... EXPLAIN (ANALYZE, TIMING off) EXPLAIN (BUFFERS) ... instead of ... EXPLAIN (ANALYZE, BUFFERS on) but what if someone wants both at the same time? Maybe he could do EXPLAIN (ROWS, BUFFERS) and treat that as a union of those subsets. I don't think it's worth it. I surely can live with both solutions (mine or the one you proposed). kind regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
On 3.2.2012 22:51, Robert Haas wrote: On Fri, Feb 3, 2012 at 2:56 PM, Tomas Vondra t...@fuzzy.cz wrote: OK, thanks for the explanation. I don't like the idea of subsets as it IMHO makes it less obvious what options are enabled. For example this EXPLAIN (ROWS) query... does not immediately show it's actually going to do ANALYZE. Well, it isn't, if ANALYZE means rows + timing... Yeah, sure. It depends on how you define ANALYZE. I see that as 'stats collected at runtime' (in contrast to a query plan, which is just a ... well, plan). but what if someone wants both at the same time? Maybe he could do EXPLAIN (ROWS, BUFFERS) and treat that as a union of those subsets. I don't think it's worth it. Yeah, I forgot that we'd have to allow that, though I don't think it would be a big deal to fix that. I surely can live with both solutions (mine or the one you proposed). Let's wait and see if anyone else has an opinion. Sure. And thanks for your feedback! Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance-test farm
On 12.5.2011 08:54, Greg Smith wrote: Tomas Vondra wrote: The idea is that buildfarm systems that are known to have a) reasonable hardware and b) no other concurrent work going on could also do performance tests. The main benefit of this approach is it avoids duplicating all of the system management and source code building work needed for any sort of thing like this; just leverage the buildfarm parts when they solve similar enough problems. Someone has actually done all that already; source code was last sync'd to the build farm master at the end of March: https://github.com/greg2ndQuadrant/client-code By far the #1 thing needed to move this forward from where it's stuck at now is someone willing to dig into the web application side of this. We're collecting useful data. It needs to now be uploaded to the server, saved, and then reports of what happened generated. Eventually graphs of performance results over time will be straighforward to generate. But the whole idea requires someone else (not Andrew, who has enough to do) sits down and figures out how to extend the web UI with these new elements. Hi, I'd like to revive this thread. A few days ago we have finally got our buildfarm member working (it's called magpie) - it's spending ~2h a day chewing on the buildfarm tasks, so we can use the other 22h to do some useful work. I suppose most of you are busy with 9.2 features, but I'm not so I'd like to spend my time on this. Now that I had to set up the buildfarm member I'm somehow aware of how the buildfarm works. I've checked the PGBuildFarm/server-code and greg2ndQuadrant/client-code repositories and while I certainly am not a perl whiz, I believe I can tweak it to handle the performance-related result too. What is the current state of this effort? Is there someone else working on that? If not, I propose this (for starters): * add a new page Performance results to the menu, with a list of members that uploaded the perfomance-results * for each member, there will be a list of tests along with a running average for each test, last test and indicator if it improved, got worse or is the same * for each member/test, a history of runs will be displayed, along with a simple graph I'm not quite sure how to define which members will run the performance tests - I see two options: * for each member, add a flag run performance tests so that we can choose which members are supposed to be safe OR * run the tests on all members (if enabled in build-farm.conf) and then decide which results are relevant based on data describing the environment (collected when running the tests) I'm also wondering if * using the buildfarm infrastructure the right thing to do, if it can provide some 'advanced features' (see below) * we should use the current buildfarm members (although maybe not all of them) * it can handle one member running the tests with different settings (various shared_buffer/work_mem sizes, num of clients etc.) and various hw configurations (for example magpie contains a regular SATA drive as well as an SSD - would be nice to run two sets of tests, one for the spinner, one for the SSD) * this can handle 'pushing' a list of commits to test (instead of just testing the HEAD) so that we can ask the members to run the tests for particular commits in the past (I consider this to be very handy feature) regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)
Hi, I've noticed a locale-specific bug in regression tests, I discovered thanks to the new magpie buildfarm member (testing cs_CZ locale). The problem is in foreign_data where the output is sorted by a column, and cs_CZ behaves differently from C and en_US. More precisely, in C it's true that ('s4' 'sc') but that's not true in cs_CZ (and supposedly some other locales). I've fixed this by replacing 'sc' with 't0' which seems to fix the ordering (and should work with other locales too). See the patch attached. kind regards Tomas diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index 2b3eddf..3f6bce2 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -631,25 +631,25 @@ DROP SERVER s7; -- CREATE FOREIGN TABLE CREATE SCHEMA foreign_schema; -CREATE SERVER sc FOREIGN DATA WRAPPER dummy; +CREATE SERVER t0 FOREIGN DATA WRAPPER dummy; CREATE FOREIGN TABLE ft1 ();-- ERROR ERROR: syntax error at or near ; LINE 1: CREATE FOREIGN TABLE ft1 (); ^ CREATE FOREIGN TABLE ft1 () SERVER no_server; -- ERROR ERROR: server no_server does not exist -CREATE FOREIGN TABLE ft1 (c1 serial) SERVER sc; -- ERROR +CREATE FOREIGN TABLE ft1 (c1 serial) SERVER t0; -- ERROR NOTICE: CREATE FOREIGN TABLE will create implicit sequence ft1_c1_seq for serial column ft1.c1 ERROR: default values on foreign tables are not supported -CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;-- ERROR +CREATE FOREIGN TABLE ft1 () SERVER t0 WITH OIDS;-- ERROR ERROR: syntax error at or near WITH OIDS -LINE 1: CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS; +LINE 1: CREATE FOREIGN TABLE ft1 () SERVER t0 WITH OIDS; ^ CREATE FOREIGN TABLE ft1 ( c1 integer NOT NULL, c2 text, c3 date -) SERVER sc OPTIONS (delimiter ',', quote ''); +) SERVER t0 OPTIONS (delimiter ',', quote ''); COMMENT ON FOREIGN TABLE ft1 IS 'ft1'; COMMENT ON COLUMN ft1.c1 IS 'ft1.c1'; \d+ ft1 @@ -659,14 +659,14 @@ COMMENT ON COLUMN ft1.c1 IS 'ft1.c1'; c1 | integer | not null | plain| ft1.c1 c2 | text| | extended | c3 | date| | plain| -Server: sc +Server: t0 Has OIDs: no \det+ List of foreign tables Schema | Table | Server | Options +---++ - public | ft1 | sc | {delimiter=,,quote=\} + public | ft1 | t0 | {delimiter=,,quote=\} (1 row) CREATE INDEX id_ft1_c2 ON ft1 (c2); -- ERROR @@ -737,7 +737,7 @@ Foreign table foreign_schema.foreign_table_1 c7 | integer | c8 | text| c10 | integer | -Server: sc +Server: t0 -- Information schema SELECT * FROM information_schema.foreign_data_wrappers ORDER BY 1, 2; @@ -761,7 +761,7 @@ SELECT * FROM information_schema.foreign_servers ORDER BY 1, 2; regression | s5 | regression | foo | | 15.0 | regress_test_role regression | s6 | regression | foo | | 16.0 | regress_test_indirect regression | s8 | regression | postgresql| || foreign_data_user - regression | sc | regression | dummy | || foreign_data_user + regression | t0 | regression | dummy | || foreign_data_user regression | t1 | regression | foo | || regress_test_indirect regression | t2 | regression | foo | || regress_test_role (7 rows) @@ -823,7 +823,7 @@ SELECT * FROM information_schema.role_usage_grants WHERE object_type LIKE 'FOREI SELECT * FROM information_schema.foreign_tables ORDER BY 1, 2, 3; foreign_table_catalog | foreign_table_schema | foreign_table_name | foreign_server_catalog | foreign_server_name ---+--+++- - regression| foreign_schema | foreign_table_1| regression | sc + regression| foreign_schema | foreign_table_1| regression | t0 (1 row)
Re: [HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)
On 7.3.2012 17:56, Robert Haas wrote: On Tue, Mar 6, 2012 at 1:59 PM, Tomas Vondra t...@fuzzy.cz wrote: I've noticed a locale-specific bug in regression tests, I discovered thanks to the new magpie buildfarm member (testing cs_CZ locale). The problem is in foreign_data where the output is sorted by a column, and cs_CZ behaves differently from C and en_US. More precisely, in C it's true that ('s4' 'sc') but that's not true in cs_CZ (and supposedly some other locales). I've fixed this by replacing 'sc' with 't0' which seems to fix the ordering (and should work with other locales too). See the patch attached. This was fixed on master in commit 3e9a2672d25aed15ae6b4a09decbd8927d069868, but that picked the name s0 rather than t0. I suggest we make the same naming decision in the back-branch to avoid future confusion... Yes, that's a better solution - I haven't noticed that commit. Should I prepare a modified patch or is it possible to apply the fix from master to this branch? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)
On 7.3.2012 21:39, Robert Haas wrote: On Wed, Mar 7, 2012 at 3:08 PM, Tomas Vondra t...@fuzzy.cz wrote: On 7.3.2012 17:56, Robert Haas wrote: On Tue, Mar 6, 2012 at 1:59 PM, Tomas Vondra t...@fuzzy.cz wrote: I've noticed a locale-specific bug in regression tests, I discovered thanks to the new magpie buildfarm member (testing cs_CZ locale). The problem is in foreign_data where the output is sorted by a column, and cs_CZ behaves differently from C and en_US. More precisely, in C it's true that ('s4' 'sc') but that's not true in cs_CZ (and supposedly some other locales). I've fixed this by replacing 'sc' with 't0' which seems to fix the ordering (and should work with other locales too). See the patch attached. This was fixed on master in commit 3e9a2672d25aed15ae6b4a09decbd8927d069868, but that picked the name s0 rather than t0. I suggest we make the same naming decision in the back-branch to avoid future confusion... Yes, that's a better solution - I haven't noticed that commit. Should I prepare a modified patch or is it possible to apply the fix from master to this branch? I tried to cherry-pick it, but there were conflicts, so I guess someone will need to go through and adjust. It's probably only 10 minutes work, but if you don't mind doing it, I'd be grateful. Ok, so here's a fixed patch. I haven't used the 3e9a2672 commit directly, because there seem to be additional changes. I've simply renamed the 'sc' to 's0' and fixed the differences in output. Tomas diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index 2b3eddf..2d7e884 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -631,25 +631,25 @@ DROP SERVER s7; -- CREATE FOREIGN TABLE CREATE SCHEMA foreign_schema; -CREATE SERVER sc FOREIGN DATA WRAPPER dummy; +CREATE SERVER s0 FOREIGN DATA WRAPPER dummy; CREATE FOREIGN TABLE ft1 ();-- ERROR ERROR: syntax error at or near ; LINE 1: CREATE FOREIGN TABLE ft1 (); ^ CREATE FOREIGN TABLE ft1 () SERVER no_server; -- ERROR ERROR: server no_server does not exist -CREATE FOREIGN TABLE ft1 (c1 serial) SERVER sc; -- ERROR +CREATE FOREIGN TABLE ft1 (c1 serial) SERVER s0; -- ERROR NOTICE: CREATE FOREIGN TABLE will create implicit sequence ft1_c1_seq for serial column ft1.c1 ERROR: default values on foreign tables are not supported -CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;-- ERROR +CREATE FOREIGN TABLE ft1 () SERVER s0 WITH OIDS;-- ERROR ERROR: syntax error at or near WITH OIDS -LINE 1: CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS; +LINE 1: CREATE FOREIGN TABLE ft1 () SERVER s0 WITH OIDS; ^ CREATE FOREIGN TABLE ft1 ( c1 integer NOT NULL, c2 text, c3 date -) SERVER sc OPTIONS (delimiter ',', quote ''); +) SERVER s0 OPTIONS (delimiter ',', quote ''); COMMENT ON FOREIGN TABLE ft1 IS 'ft1'; COMMENT ON COLUMN ft1.c1 IS 'ft1.c1'; \d+ ft1 @@ -659,14 +659,14 @@ COMMENT ON COLUMN ft1.c1 IS 'ft1.c1'; c1 | integer | not null | plain| ft1.c1 c2 | text| | extended | c3 | date| | plain| -Server: sc +Server: s0 Has OIDs: no \det+ List of foreign tables Schema | Table | Server | Options +---++ - public | ft1 | sc | {delimiter=,,quote=\} + public | ft1 | s0 | {delimiter=,,quote=\} (1 row) CREATE INDEX id_ft1_c2 ON ft1 (c2); -- ERROR @@ -737,7 +737,7 @@ Foreign table foreign_schema.foreign_table_1 c7 | integer | c8 | text| c10 | integer | -Server: sc +Server: s0 -- Information schema SELECT * FROM information_schema.foreign_data_wrappers ORDER BY 1, 2; @@ -757,11 +757,11 @@ SELECT * FROM information_schema.foreign_data_wrapper_options ORDER BY 1, 2, 3; SELECT * FROM information_schema.foreign_servers ORDER BY 1, 2; foreign_server_catalog | foreign_server_name | foreign_data_wrapper_catalog | foreign_data_wrapper_name | foreign_server_type | foreign_server_version | authorization_identifier +-+--+---+-++-- + regression | s0 | regression | dummy | || foreign_data_user regression | s4 | regression | foo | oracle || foreign_data_user regression | s5 | regression
[HACKERS] proposal : cross-column stats
Hi everyone, one of the ssesion I've attended on PgDay last week was Heikki's session about statistics in PostgreSQL. One of the issues he mentioned (and one I regularly run into) is the absence of cross-column stats. When the columns are not independent, this usually result in poor estimates (and then in suboptimal plans). I was thinking about this issue before, but that session was the last impulse that pushed me to try to hack up a proof of concept. So here it is ;-) Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what improvements are possible, what issues are there. Anyway this is by no means a perfect or complete solution - it's just a starting point. Short introduction Say we have a table with two INT columns - col_a and col_b, and we want to estimate number of rows for a condition involving both columns: WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q) When the columns are independent, doing the estimate is just a matter of multiplication. When the columns are dependent, the estimate may be way off. Lets assume there are histograms with 5 bins for both columns. What I propose is basically building a 2D histogram. It kind of resembles contingency table. So we do have a table like this col_b \ col_a || 20% | 20% | 20% | 20% | 20% | ===||== 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | ===||== where each column / row represents a bin in the original histograms, and each cell represents an expected number of rows in it (for really independent columns, it's 4%). For dependent columns the actual values may be actually very different, of course - e.g. for strongly dependent columns it might be like this col_b \ col_a || 20% | 20% | 20% | 20% | 20% | ===||== 20% || 20% | 0% | 0% | 0% | 0% | 20% || 0% | 20% | 0% | 0% | 0% | 20% || 0% | 0% | 20% | 0% | 0% | 20% || 0% | 0% | 0% | 20% | 0% | 20% || 0% | 0% | 9% | 0% | 20% | ===||== To estimate the number of rows matching the condition, you'd sum estimates for cells matching the condition. I.e. when the condition on col_a matches the lowest 20% (the first histogram bin) and the condition on col_b matches the lowest 20% of values, this corresponds to the first cell (20% of rows). Current optimizer estimates this to be 4% as it believes the columns are independent. I'm not sure whether I've explained it well enough, but the essence of the proposal is to build N-dimensional histograms (where N is the number of columns covered by the statistics) just as we are building histograms today. --- Proof of concept --- I've hacked a nasty proof of concept in PL/pgSQL (see the attachment). It creates two tables - test_table and cross_stats. The former contains the data (in two columns), the latter is used to store cross-column statistics of test_table. Then there are several PL/pgSQL functions - two of them are really important: collect_stats(p_bins_a INT, p_bins_b INT) - this one is used to collect the stats, the parameters represent number of bins for the columns (sides of the contingency table) - collect_stats(10,10) will build contingency table with 100 cells get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT) - computes estimate for the condition listed above (ranges on both columns) So to run the PoC, just do something like this: 1) fill the table with some data INSERT INTO test_table SELECT round(random()*1000), round(random()*1000) FROM generate_series(1,10); 2) collect the cross-column statistics SELECT collect_stats(10,10); 3) see current estimated and actual number of rows EXPLAIN ANALYZE SELECT * FROM test_table WHERE (col_a BETWEEN 30 AND 129) AND (col_b BETWEEN 234 AND 484); 4) see the estimate based on contingency table SELECT get_estimate(30, 129, 234, 484); Just two very simple tests for now - col_a/col_b contain the range from the query, then there are actual number of matching rows and a current estimate, And finally the new estimate based on contingency table with various number of bins. A) independent columns (proof that it produces just as good estimates as the current code) col_a | col_b | actual | expected | 10x10 | 20x20 | [50,70] | [50,70] |41 | 40 | 41 |47 | [50,250] | [50,250] | 4023 | 4024 | 4436 | 3944 | [50,250] | [750,950] | 4023 | 3955 | 4509 |
Re: [HACKERS] proposal : cross-column stats
Hi! Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a): Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what improvements are possible, what issues are there. Anyway this is by no means a perfect or complete solution - it's just a starting point. It looks like you handled most of the issues. Just a few points: - This is obviously applicable to more than just integers, probably anything with a b-tree operator class. What you've coded seems rely on calculations on the values. Have you thought about how it could work for, for example, strings? Yes, I know, I just forgot to address this in my previous e-mail. The contingency tables have a really nice feature - they are based on splitting the sets into groups (~ bins of the histograms for each column). And this can be done if you can sort the values, you really don't need any calculations. So it should work with strings. And another thing I somehow forgot is handling the case when there is no histogram, just MCV. That's mostly the same - each of the values might be a separate group, or the values might be grouped to form less groups, etc. The classic failure case has always been: postcodes and city names. Strongly correlated, but in a way that the computer can't easily see. Not that I suggest you fix this, but it's food for though. Though strictly speaking this is a different kind of correlation than what you're looking at. Hmmm, I see. I think the proposal does not fix this particular case, although it might improve the situation a little bit (limit the error between expected and observed number of rows). The problem is that once we get to a cell-level of the contingency table, there is no additional (more detailed) information. So we're stuck with the multiplication estimate, or something like that. I was thinking about it actually, and I think we could collect some more info - a correlation coefficient for each bin, or something like that. But that was not part of my proposal, and I'm not sure how to do that. 2) I really don't think we should collect stats for all combinations of columns of a table - I do like the Oracle-like approach where a DBA has to enable cross-column stats using an ALTER TABLE (for a particular list of columns). The only exception might be columns from a multi-column index. It might be quite efficient I guess? In the past it has been suggested to only do it for multi-column indexes, but I find these days I find in some situations I prefer to make individual indexes and let the bitmap scan code combine them. So perhaps it would be best to let it be configured by the DBA. Yes, I prefer individual indexes too. The idea behind collecting cross-column stats for multi-column indexes was that maybe we could 'append' this to the current functionality (building the index or something like that) so that it does not introduce significant performance problems. 3) There are independence tests for contingency tables (e.g. Pearson's Chi-squared test), so that it's easy to find out whether the columns are independent. In that case we can just throw away these stats and use the simple estimation. http://mathworld.wolfram.com/Chi-SquaredTest.html I think this would be good to include, if possible. Actually, I wonder if the existing stats collection code could be altered to attempt to calculate the correlation between columns as part of its other work. I guess that would be rather expensive - to compute correlation you need two passes, and you need to do that for each pair or columns. So I'd be surprised if it is possible (and effective). Another thing is that you can compute correlation only for numeric columns, so it's not possible to do that for city/ZIP code mentioned above. More precisely - it's possible to do that (if you map strings to numbers somehow), but I doubt you'll get useful results as the assignment is rather random. Well, you could ask the governments to assign the ZIP codes to cities in strictly alphabecital order, but I guess they'll say no. 4) Or we could store just those cells where expected and observed values differ significantly (may help if most of the values are indendent, but there's a small glitch somewhere). Comrpessing that grid would be useful, given that for many dimensions most of the grid will be not interesting. In fact, storing the 20 largest values may be enough. Worth an experiment. Not exactly just the largest values - rather values that are significantly different from the expected values. Generally there are two interesting cases expected observed - The optimizer may choose index scan, although the seq scan would be better. expected observed - The optimizer may choose seq scan, although the index scan would be better. regards Tomas -- Sent via pgsql-hackers mailing
Re: [HACKERS] proposal : cross-column stats
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): The classic failure case has always been: postcodes and city names. Strongly correlated, but in a way that the computer can't easily see. Yeah, and that's actually analogous to the example I used in my presentation. The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ? AND city = ? should be the selectivity of postcode = ? alone. The measurement we need is implicativeness: How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample. Yes, those issues are a righteous punishment for breaking BCNF rules ;-) I'm not sure it's solvable using the contingency tables, as it requires knowledge about dependencies between individual values (working with cells is not enough, although it might improve the estimates). Well, maybe we could collect these stats (number of cities for a given ZIP code and number of ZIP codes for a given city). Collecting a good stats about this is a bit tricky, but possible. What about collecting this for the MCVs from both columns? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 12.12.2010 17:33, Florian Pflug napsal(a): On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ? AND city = ? should be the selectivity of postcode = ? alone. The measurement we need is implicativeness: How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample. The statistical term for this is conditional probability, written P(A|B), meaning the probability of A under the assumption or knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that P(A|B) = P(A and B) / P(B). Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3) does not change under additional assumptions like b=4. Bayes' theorem thus becomes P(A) = P(A and B) / P(B)= P(A and B) = P(A)*P(B) which is how we currently compute the selectivity of a clause such as WHERE a=3 AND b=4. I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea. Maybe we could count the number of distinct values of b for every one of the most common values of a, and compare that to the overall number of distinct values of b... Good point! Well, I was thinking about this too - generally this means creating a contingency table with the MCV as bins. Then you can compute these interesting probabilities P(A and B). (OK, now I definitely look like some contingency table weirdo, who tries to solve everything with a contingency table. OMG!) The question is - what are we going to do when the values in the query are not in the MCV list? Is there some heuristics to estimate the probability from MCV, or something like that? Could we use some average probability or what? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 01:05, Robert Haas napsal(a): This is a good idea, but I guess the question is what you do next. If you know that the applicability is 100%, you can disregard the restriction clause on the implied column. And if it has no implicatory power, then you just do what we do now. But what if it has some intermediate degree of implicability? Well, I think you've missed the e-mail from Florian Pflug - he actually pointed out that the 'implicativeness' Heikki mentioned is called conditional probability. And conditional probability can be used to express the AND probability we are looking for (selectiveness). For two columns, this is actually pretty straighforward - as Florian wrote, the equation is P(A and B) = P(A|B) * P(B) = P(B|A) * P(A) where P(B) may be estimated from the current histogram, and P(A|B) may be estimated from the contingency (see the previous mails). And P(A and B) is actually the value we're looking for. Anyway there really is no intermediate degree of aplicability, it just gives you the right estimate. And AFAIR this is easily extensible to more than two columns, as P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C) so it's basically a recursion. Well, I hope my statements are really correct - it's been a few years since I gained my degree in statistics ;-) regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
P(A|B) = P(A and B) / P(B). Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630). Sometimes that may return nearly 2x the actual selectivity, but in general it's a reasonable estimate. Are there any severe failure cases that produce much worse estimates? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 03:00, Robert Haas napsal(a): Well, the question is what data you are actually storing. It's appealing to store a measure of the extent to which a constraint on column X constrains column Y, because you'd only need to store O(ncolumns^2) values, which would be reasonably compact and would potentially handle the zip code problem - a classic hard case rather neatly. But that wouldn't be sufficient to use the above equation, because there A and B need to be things like column X has value x, and it's not going to be practical to store a complete set of MCVs for column X for each possible value that could appear in column Y. O(ncolumns^2) values? You mean collecting such stats for each possible pair of columns? Well, I meant something different. The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 16:34, Tom Lane napsal(a): Tomas Vondra t...@fuzzy.cz writes: Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630). If you can solve the AND case, the OR case falls out of that. Just replace s1*s2 with a more accurate AND calculation. Oh yeah, now I see - it's just the usual equation P(A or B) = P(A) + P(B) - P(A and B) and we're estimating P(A and B) as P(A)*P(B). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 16:38, Tom Lane napsal(a): The reason that this wasn't done years ago is precisely that nobody's figured out how to do it with a tolerable amount of stats data and a tolerable amount of processing time (both at ANALYZE time and during query planning). It's not hard to see what we'd ideally like to do; it's getting from there to something useful in production that's hard. OK, I fully realize that. My plan is to simply (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient solution (c) optimize in iterations, i.e. making it faster, consuming less space etc. It will take time, it won't be perfect for a long time, but every journey starts somewhere. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 18:59, Joshua Tolley napsal(a): On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple years back ). Josh Tolley still looks at it occasionally, though time hasn't permitted any sort of significant work for quite some time. The multicolstat branch on my git.postgresql.org repository will create an empirical copula each multi-column index, and stick it in pg_statistic. It doesn't yet do anything useful with that information, nor am I convinced it's remotely bug-free. In a brief PGCon discussion with Tom a while back, it was suggested a good place for the planner to use these stats would be clausesel.c, which is responsible for handling code such as ...WHERE foo 4 AND foo 5. Well, that's good news ;-) I've done a bit of research today, and I've found some quite interesting papers on this topic (probably, I did not have time to read them, in most cases I've read just the title and abstract). [1] Selectivity estimators for multidimensional range queries over real attributes http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914 This seems like a very good starting point. AFAIK it precisely describes what data need to be collected, how to do the math etc. [2] Selectivity Estimation Without the Attribute Value Independence Assumption http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126 This obviously deals with the independence problem. Haven't investigated it further, but it seems worth to read. [3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions and Sort Operations (A Cost Function for Uniformly Partitioned UB-Trees) http://mistral.in.tum.de/results/publications/MB00_ideas.pdf Describes something called UB-Tree, and shows how it may be used to do estimates. Might be interesting as an alternative to the traditional histograms. There are more details about UB-Trees at http://mistral.in.tum.de/results/publications/ [4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/ A rather nice collection of papers related to estimation (including some of the papers listed above). Hm, I planned to finally read the Understanding MySQL Internals over the Xmas ... that obviously won't happen. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 13.12.2010 22:50, Josh Berkus napsal(a): Tomas, (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient solution (c) optimize in iterations, i.e. making it faster, consuming less space etc. I'll suggest again how to decide *which* columns to cross: whichever columns are combined in composite indexes. In version 2, allow the DBA to specify combinations. In the unlikely event that correlation could be reduced to a single float number, it would be conceivable for each column to have an array of correlation stats for every other column where correlation was non-random; on most tables (i.e. ones with less than 100 columns) we're not talking about that much storage space. The main cost would be the time spent collecting that info ... I think this is a bit early to discuss this, given the fact that we don't have a working solution yet. But OK, let's discuss these options anyway 1) collecting the automatically for composite indexes I don't think this is wise idea. The first versions definitely won't be very efficient, and collecting the data for each composite index means everyone will be hit by this inefficiency, even if he actually does not need that (e.g. the columns are independent so the current estimates are quite accurate or he's not using those columns very often in the same WHERE clause). Another reason against this is that many DBAs don't actually use composed indexes - they simply create indexes on each column and let the bitmap index scan to work it out. And this would not work for this case. And actually it's not very complicated to allow the DBA to do this, this can be a quite simple PL/pgSQL procedure. 2) collecting correlation for each pair of columns Again, you're effectively forcing everyone to pay the price even though he may not need the feature. Maybe we'll get there one day, but it's not a good idea to do that from the beginning. And the correlation itself has a very limited use in real life, as it's not possible to compute it for character columns and is not very useful in case of some numerical columns (e.g. ZIP codes). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): On 12.12.2010 15:17, Martijn van Oosterhout wrote: On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: Very cool that you're working on this. +1 Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what improvements are possible, what issues are there. Anyway this is by no means a perfect or complete solution - it's just a starting point. It looks like you handled most of the issues. Just a few points: - This is obviously applicable to more than just integers, probably anything with a b-tree operator class. What you've coded seems rely on calculations on the values. Have you thought about how it could work for, for example, strings? The classic failure case has always been: postcodes and city names. Strongly correlated, but in a way that the computer can't easily see. Yeah, and that's actually analogous to the example I used in my presentation. The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for postcode = ? AND city = ? should be the selectivity of postcode = ? alone. The measurement we need is implicativeness: How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample. That's assuming the combination has any matches. It's possible that the user chooses a postcode and city combination that doesn't exist, but that's no different from a user doing city = 'fsdfsdfsd' on a single column, returning no matches. We should assume that the combination makes sense. Well, I've read several papers this week, in an attempt to find out what possibilities are there when implementing cross-column statistics. One of the papers seems like a reasonable solution to this particular problem - discrete data + equality queries. The article is A Bayesian Approach to Estimating The Selectivity of Conjuctive Predicates (written by Heimel, Markl and Murthy). It's freely available as a PDF http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf I do have a PDF with my own notes in it (as annotations), let me know if you're interested ... The basic principle is that instead of 'attribute value independence' (which is the problem with correlated columns) they assume 'uniform correlation' which is a much weaker assumption. In the end, all they need to compute an estimate is number of distinct values for each of the columns (we already have that in pg_stats) and a number of distinct values for the group of columns in a query. They really don't need any multidimensional histogram or something like that. The funny thing is I've implemented most of the PoC before reading the article - the only difference was the way to combine the estimates ( pros and cons -- So let's see the pros: * it's reasonably simple, the overhead should be minimal I guess (we're already estimating distinct values for the columns, so I guess we can do that for multiple columns with reasonable impact) * there are no 'advanced data structures' as multi-dimensional histograms that are expensive to build and maintain * it consistently produces better estimates than the current estimator based on attribute value independence assumption, and it's reasonably accurate (I've been playing with it for some time, so we'll need more tests of course) * it's easily extensible to more columns * I guess we could improve the estimated by our own heuristicts, to catch the special case when one column is perfectly implied by another one (e.g. when state is implied by ZIP code) - we can catch that by 'distinct for combination = distinct for column' and use just the estimate for one of the column but there are some cons too: * this really is not designed to work with real-valued data, it's a very nice solution for discrete data (namely 'labels' as for example city/state names, ZIP codes etc.) * you need to know the list of columns in advance (there's nothing like 'let's build' the stats for columns (a,b,c) and then query just (a,b)) as you need to know the count of distinct values for the queried columns (well, maybe it's possible - I guess we could 'integrate' over all possible values of the omited column) * it's built for 'equality' qeuries, although maybe we could modify it for inequalities too (but it won't be as elegant), but I guess the equality queries are much more common in case of discrete data (OK, you can run something like WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48
Re: [HACKERS] proposal : cross-column stats
Hi, I've read about 10 papers about estimation this week, and I have gained some insight. I'm not going to describe each of the papers here, I plan to set up a wiki page about cross column statistics http://wiki.postgresql.org/wiki/Cross_Columns_Stats and I'll put the list of papers and some basic info there. There was a lot of discussion in this thread, and it's difficult to follow it, so I'll put there some details about the proposal etc. So I'll just briefly describe the interesting things I've learned from those papers. -- instances of the problem -- Generally, there are four quite different cases of the problem. First, the columns may be either real-valued or discrete. And by 'discrete' I mean the value is rather a label than a value. It does not make sense to add or subtract them, etc. So for example city names or zip codes are discrete values (although zip codes are numbers etc). Second, the queries are consist of equality or inequality (range) conditions. So actually there are four instances of the problem: | equality conditions | inequality conditions | discrete values | A | D | numeric values | C | B | I think those four problems should be treated separately, although some of the techniques may be common. A) discrete values and inequality conditions One of the papers (A Bayesian Approach to Estimating The Selectivity of Conjuctive Predicates) describes a quite interesting solution to this problem - I've already posted a description on how to apply it to the ZIP code / city name problem - see this http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php B) numeric values and inequality conditions Most of the papers dealing with this problem are based on discretization and multi-dimensional histograms to approximate the joint distribution. So I believe my initial proposal was not a complete nonsense. Once we have a working histogram-based solution, we can work on precision and efficiency (how much space is needed to store it, how long does it take to compute an estimate, etc.). There are two ways to do that. First, most of the papers offer an enhanced type of histogram (although the histogram proposed in the paper is always evaluated as the most efficient one, which is a bit suspicious). Anyway there are advanced quite-promising ways to build multi-dimensional histograms. Second, the paper Selectivity estimation using probabilistic models) describes a solution based on bayesian networks. That means really really promising (actually it addresses join selectivity estimation too). So yeas, I'm quite confident we can solve this issue and improve the efficiency - either by an advanced histogram or using bayesian networks. C) numeric values and equality conditions OK, I'm not sure how to solve this case. But the queries against numeric queries are range queries in most cases I guess, so maybe that's not that big deal. D) discrete values and inequality conditions Basically, this can be handled just like numeric values after discretization, i.e. using a histogram. But this is usually E) a combination of discrete / numeric columns I'm not sure how to deal with this. Obviously it's possible to build multi-dimensional histogram, and estimate as many queries as possible. --- The papers describe some interesting alternative approaches, e.g. based on SVD (singular value decomposition) or random sampling (or kernel estimators, which an enhanced version of sampling). But there are various issues associated with those solutions. SVD is applicable to 2D only, so we'd be stuck with 2 columns, and random sampling sounds a bit suspicious to me). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed before, making them accurate requires sampling a significant percentage of the table, whereas all of our other statistics can be computed reasonably accurately by sampling a fixed amount of an arbitrarily large table. So it's possible that relying more heavily on n_distinct could turn out worse overall even if the algorithm is better. Not sure if that's an issue here, just throwing it out there... Yes, you're right - the paper really is based on (estimates of) number of distinct values for each of the columns as well as for the group of columns. AFAIK it will work with reasonably precise estimates, but the point is you need an estimate of distinct values of the whole group of columns. So when you want to get an estimate for queries on columns (a,b), you need the number of distinct value combinations of these two columns. And I think we're not collecting this right now, so this solution requires scanning the table (or some part of it). I know this is a weak point of the whole solution, but the truth is every cross-column stats solution will have to do something like this. I don't think we'll find a solution with 0 performance impact, without the need to scan sufficient part of a table. That's why I want to make this optional so that the users will use it only when really needed. Anyway one possible solution might be to allow the user to set these values manually (as in case when ndistinct estimates are not precise). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 17.12.2010 22:24, Tom Lane napsal(a): That seems likely to be even more unreliable than our existing single-column estimates :-( regards, tom lane Well, yes :-( I guess this is a place where we could use a multi-column index, if it contains all the interesting columns. We could scan the index, not the whole table. That could save us tremendous amount of I/O and should be quite precise (unless it's severely bloated). Another thing is those 'discrete' columns are usually quite stable, i.e. there's usually a limited list of values and it does not change very often. Think about ZIP codes, states, etc. And the combinations are quite stable too (counties do not move to other states, etc.). So I think scanning a reasonable part of a table should be enough in these cases. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a): On 17.12.2010 23:13, Tomas Vondra wrote: Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed before, making them accurate requires sampling a significant percentage of the table, whereas all of our other statistics can be computed reasonably accurately by sampling a fixed amount of an arbitrarily large table. So it's possible that relying more heavily on n_distinct could turn out worse overall even if the algorithm is better. Not sure if that's an issue here, just throwing it out there... Yes, you're right - the paper really is based on (estimates of) number of distinct values for each of the columns as well as for the group of columns. AFAIK it will work with reasonably precise estimates, but the point is you need an estimate of distinct values of the whole group of columns. So when you want to get an estimate for queries on columns (a,b), you need the number of distinct value combinations of these two columns. And I think we're not collecting this right now, so this solution requires scanning the table (or some part of it). Any idea how sensitive it is to the accuracy of that estimate on distinct value combinations? If we get that off by a factor of ten or a hundred, what kind of an effect does it have on the final cost estimates? Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and sel(X) is selectivity of column X. dependency on dist(A), dist(B) and dist(A,C) So it seems to be proportional to dist(A) and dist(B), and inverse proportional to dist(A,B). So when increasing the dist(A) and dist(B) 10x, you'll get a 10x the original estimate. Similarly, by increasing the dist(A,B) 10x, you'll get 10x lower estimate. upper bound --- Unless dist(A) or dist(B) is greater than dist(A,B), the estimated selectivity is bounded by (sel(A) + sel(B)) / 2 I guess we can say that (sel(A) sel(A,B)) is generally the same as sel(A) = sel(A,B) i.e. we can use the heuristict I've already offered in the PoC. lower bound --- Not sure what the lower bound is, but it might be 0 if the dist(A) and dist(B) are very small and/or dist(A,B) is huge. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 18.12.2010 16:59, Florian Pflug napsal(a): On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and sel(X) is selectivity of column X. Huh? This is the selectivity estimate for A = x AND B = y? Surely, if A and B are independent, the formula must reduce to sel(A) * sel(B), and I cannot see how that'd work with the formula above. Yes, it's a selectivity estimate for P(A=a and B=b). It's based on conditional probability, as P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) and uniform correlation assumption so that it's possible to replace the conditional probabilities with constants. And those constants are then estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). Ok, I think I understand this now. The selectivity equation actually *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate for sel(A). Take the clause A = x AND B = y for example. Without knowing anything about x and y, reasonable guesses for sel(A=x) and sel(B=y) are sel(A=x) = 1 / dist(A) sel(B=y) = 1 / dist(B). This is also what we do currently, according to var_eq_non_const() in src/backend/utils/adt/selfuncs.c, if we don't have any additional knowledge about x (Actually, we also factor the probability of A being NULL into this). With these estimates, your formula becomes sel(A=x,B=y) = 1 / dist(A,B). and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus sel(A=x,B=y) = sel(A=x) * sel(B=y). If, however, y is a constant, then we use the MKVs to estimate sel(B=y) (var_eq_const() in src/backend/utils/adt/selfuncs.c). If sel(B=y) ~= 0, we'd currently also conclude that sel(A=x,B=y) ~= 0. With the uniform correlation approach, we'd instead estimate sel(A=x,B=y) ~= sel(A=x) / dist(B) assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated. If dist(B) is small, this estimate is much worse than what we'd currently get, since we've effectively ignored the information that the restriction B=y alone guarantees that only very few rows will match. Well, I guess you're right. Let's see two examples - uncorrelated and higly correlated data (and then a few comments on what I think you're missing). uncorrelated Let there be 10 distinct values in A, 100 distinct values in B, and there are all possible combinations (10x100 = 1000). All the values (and combinations) are uniformly distributed. The current estimate is correct, i.e. sel(A=a) = 1/10 = 1/dist(A) sel(B=b) = 1/100 = 1/dist(B) sel(A=a, B=b) = sel(A) * sel(B) = 1/1000 /* due to AVI */ the proposed estimate is sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B)) = (10 * 1/10 + 100 * 1/100) / 2000 = 1/1000 so actually it produces the same estimate, but thats due to the uniform distribution assumption. Let's say the selectivities for A and B are very different - A is 10x les common, B is 10x more common (let's say we know this). Then the current estimate is still correct, i.e. it gives sel(A=a) = 1/100 sel(B=b) = 1/10 = sel(A=a,B=b) = 1/1000 but the new estimate is (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200 So yes, in case of uncorrelated data this overestimates the result. highly correlated - Again, let dist(A)=10, dist(B)=100. But this case let B=A i.e. for each value in B, there is a unique value in A. Say B=mod(A,10) or something like that. So now there is dist(A,B)=100. Assuming uniform distribution, the correct estimate is sel(A=a,B=b) = sel(B=b) = 1/100 and the current estimate is sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000 The proposed estimate is (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100 which is correct. Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we currently get (just as before) sel(A=a,B=b) = 1/10 * 1/100 = 1/1000 which is 100x less than the correct value (sel(B)=1/10). The new estimate yields (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20 which is not as accurate as with uniform distribution, but it's considerably more accurate than the current estimate (1/1000). comments I really don't think this a perfect solution giving 100% accurate results in all cases. But the examples above illustrate that it gives much better estimates in case of correlated data. It seems to me you're missing one very important thing - this was not meant as a new default way to do estimates. It was meant as an option when the user (DBA, developer, ...) realizes the current solution gives really bad estimates (due to correlation). In that case he could create 'cross-column' statistics on those columns
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 12.12.2010 03:47, Robert Haas napsal(a): On Sat, Dec 11, 2010 at 4:40 PM, t...@fuzzy.cz wrote: Hello you have to respect pg coding style: a) not too long lines b) not C++ line comments OK, thanks for the notice. I've fixed those two problems. Please add this to the currently-open commitfest. https://commitfest.postgresql.org/action/commitfest_view/open I've done several small changes to the patch, namely - added docs for the functions (in SGML) - added the same thing for background writer So I think now it's 'complete' and I'll add it to the commit fest in a few minutes. regards Tomas diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5fd0213..40c6c6b 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -600,6 +600,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_db_last_stat_reset_time/function(typeoid/type)/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics for this database (as a result of executing + functionpg_stat_reset/function function) + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry @@ -725,6 +734,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_last_stat_reset_time/function(typeoid/type)/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics on this table (as a result of executing + functionpg_stat_reset/function or functionpg_stat_reset_single_table_counters/function) + /entry + /row + + row entryliteralfunctionpg_stat_get_vacuum_count/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry @@ -879,6 +897,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_function_last_stat_reset_time/function(typeoid/type)/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics for this function (as a result of executing + functionpg_stat_reset/function or functionpg_stat_reset_single_function_counters/function) + /entry + /row + + row entryliteralfunctionpg_stat_get_xact_function_calls/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry @@ -1064,6 +1091,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_bgwriter_last_stat_reset_time()/function/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics for the background writer (as a result of executing + functionpg_stat_reset_shared('bgwriter')/function) + /entry + /row + + row entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry entrytypebigint/type/entry entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 346eaaf..3e1cca3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -310,6 +310,7 @@ CREATE VIEW pg_stat_all_tables AS pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum, pg_stat_get_last_analyze_time(C.oid) as last_analyze, pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze, +pg_stat_get_last_stat_reset_time(C.oid) as last_stat_reset, pg_stat_get_vacuum_count(C.oid) AS vacuum_count, pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count, pg_stat_get_analyze_count(C.oid) AS analyze_count, @@ -502,7 +503,8 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, -pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted +pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, +pg_stat_get_db_last_stat_reset_time(D.oid) AS last_stat_reset FROM pg_database D; CREATE VIEW pg_stat_user_functions AS @@ -512,7 +514,8 @@ CREATE VIEW pg_stat_user_functions AS P.proname AS funcname, pg_stat_get_function_calls(P.oid) AS calls, pg_stat_get_function_time(P.oid) / 1000 AS total_time, -pg_stat_get_function_self_time(P.oid) / 1000 AS self_time +pg_stat_get_function_self_time(P.oid) / 1000 AS self_time, +pg_stat_get_function_last_stat_reset_time(P.oid) AS last_stat_reset FROM pg_proc P LEFT JOIN pg_namespace N ON (N.oid =
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 19.12.2010 17:26, Tom Lane napsal(a): That seems like quite a bizarre definition. What I was envisioning was that we'd track only the time of the last whole-database stats reset. Well, but that does not quite work. I need is to know whether the snapshot is 'consistent' i.e. whether I can compare it to the previous snapshot and get meaningful results (so that I can perform some analysis on the difference). And by 'consistent' I mean that none of the counters was reset since the previous snapshot. The most obvious and most flexible way to do this is keeping track of the last reset for each of the counters, which is exactly what I've done in the patch. The other possibility I've offered in my previous post is to keep just a per-database timestamp, and set it whenever some stats in the database are reset (table/index/function counters or all stats). It definitely is not as flexible as the previous solution, but it gives me at least some info that something was reset. But I'd have to throw away the whole snapshot - in the previous case I could do analysis at least on the counters that were not reset. The solution you've offered - keeping only the per-database timestamp, and not updating it when a table/index/function stats are reset, that's completely useless for me. It simply does not provide an answer to the question Is this snapshot consistent? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 19.12.2010 19:13, Jim Nasby napsal(a): Is there a reason this info needs to be tracked in the stats table? I know it's the most obvious place, but since we're worried about the size of it, what about tracking it in pg_class or somewhere else? I guess this is the best place for this kind of info, as it is directly related to the stats items. Well, maybe we could place it to pg_class, but is it a wise idea? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 19.12.2010 21:21, Simon Riggs napsal(a): On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote: The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. The reason that this wasn't done years ago is precisely that nobody's figured out how to do it with a tolerable amount of stats data and a tolerable amount of processing time (both at ANALYZE time and during query planning). It's not hard to see what we'd ideally like to do; it's getting from there to something useful in production that's hard. I think we have to face up to the fact that attempting to derive meaningful cross-column stats will require larger sample sizes. Amen. If we collect everything we're going to have ~10^9 stats slots with default stats_target 100 and a 100 column table. We should disconnect sample size from histogram size, and we need to make the initial column pairings vastly fewer than all combinations. Manual specification seems like it will be required for the cases where we decide not to include it automatically, so it seems we'll need manual specification anyway. In that case, we should do manual specification first. Well, not really. The more bins you have, the larger sample you need to get a representative representation of the stats. So the histogram and sample size are naturally connected. And there are some (mostly heuristics) rules to determine how large the sample should be. E.g. when building a contingency table for a chi-squared test, a common rule is that each bin should contain at least 5 values. So the more bins you have, the larger sample you need. I like the way oracle does this - you can either let them decide what is the proper sample size, or you can specify how large the sample should be (what portion of the table). So the tricky part here is determining the number of bins in the histogram. In the one-dimensional case, stats_target=100 actually means each bin contains about 1% of data. So I guess we should use a similar approach in the multi-dimensional case too, i.e. let the user determine a desired precision and then derive the number of bins automatically (which is a bit tricky because of the multiple dimensions). But yeah, in general you're right - this will require larger samples, more complex stats to collect, we'll need some space to store the collected stats ... regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 19.12.2010 20:28, Tom Lane napsal(a): Tomas Vondra t...@fuzzy.cz writes: Dne 19.12.2010 17:26, Tom Lane napsal(a): That seems like quite a bizarre definition. What I was envisioning was that we'd track only the time of the last whole-database stats reset. Well, but that does not quite work. I need is to know whether the snapshot is 'consistent' i.e. whether I can compare it to the previous snapshot and get meaningful results (so that I can perform some analysis on the difference). Oh, I see. Yeah, if that's what you want it for then partial resets have to change the timestamp too. I guess if we are careful to document it properly, this won't be too horrid. regards, tom lane Well, maybe. I'd prefer the timestamp for each item, as that allows me to determine which stats were not reset and analyze at least those data. Plus I won't have time to write the other patch for at least a week, so let's see whether there are serious objections agains the current patch. I've never had problems with the pgstat.dat file, but I understand others might, although this adds just 8 bytes to each item. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 19.12.2010 23:58, Tom Lane napsal(a): Tomas Vondra t...@fuzzy.cz writes: Plus I won't have time to write the other patch for at least a week, so let's see whether there are serious objections agains the current patch. If you think this objection is not serious, you're mistaken. I know there were problems with pgstats.stat and I/O (for example this thread discussing an I/O storm http://archives.postgresql.org/pgsql-hackers/2008-01/msg01095.php). But I thought those issues were already resolved and I have not noticed any such issue recently. Am I missing something? What is bothering me about that is this: how many of our users ever reset the stats at all? Of those, how many reset the stats for just some tables and not all of them? And of those, how many care about having the database remember when they did it, at a per-table level? I think the answer to each of those questions is not many, and so the fraction of our users who will get a benefit from that added overhead is epsilon cubed. But approximately 100% of our users will be paying the overhead. Sure, I understand that only a fraction of the users will use the patch I've proposed. That's why I'm saying that the per-database timestamp would be sufficient (although I'd prefer the per-record timestamp). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 21.12.2010 16:54, Florian Pflug napsal(a): Hmmm, maybe we could give this possibility (to identify two separate groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the user would say 'build stats for (A,B) and (C)' - this actually represents apriori knowledge of dependencies supplied by the user. In that case we could search for 'implicativeness' between those two groups (and not within the groups), and we could restrict ourselves to 2D (and thus use a more sophisticated formula). Hm, I hated this idea at first, but I'm starting to like it more and more. It *does* seem rather unrealistic that a user would know that a bunch of columns are correlated, but have no idea in what way... Yes, that's true. Although sometimes the dependency may be very complicated - but let's restrict to 2D for now, build something that solves this simplified case and then we can discuss higher dimensions. Any examples when this's be the case would be very much appreciated - Maybe we should ask around on -general about this? Well, I think the ZIP code example i a typical case of this - the users know about the dependency between ZIP codes and cities. A natural workaround would be to omit the dependent column from the query, but that's not always possible (e.g. when an ORM is involved, building the queries automatically). But we should be able to do some basic analysis even when the user supplies a list of columns without such apriori knowledge. That, I think, overcomplicates things, at least for a first cut. To summarize, I think you've shown quite nicely that the uniform bayesian approach is a very sensible first step towards better estimates in the case of correlated columns. It's statistically sound, and the dist(A,B) estimates it requires are probably a necessary ingredient of any solution to the problem. If we can make it degrade more gracefully if the columns are uncorrelated we should do that, but if we can't thats still no reason to drop the whole idea. Agreed. IMHO the uncorrelated case is not a big concern, as the users usually know something's wrong with the columns. But we should introduce some 'autodetect' but let's leave that for the future. So I guess we should turn our attention to how we'd obtain reasonably good estimates of dist(A,B), and return to the current discussion once the other pieces are in place. I think that maybe it'd be acceptable to scan a large portion of the table to estimate dist(A,B), *if* we must only do so only once in a while. But even with a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, and spilling them into, say, an on-disk hash table adds even more overhead to the already expensive full scan. Maybe using a bloom filter instead of a hash table could avoid the spilling to disk, in exchange for a slightly less precise result... I have no idea what a Bloom filter is (shame on me). I was not thinking about collecting the stats, I was interested primarily in what data do we actually need. And my knowledge about the algorithms currently used is very limited :-( But I agree we should at least discuss the possible solutions. Until now I've done something like this SELECT COUNT(DISTINCT a) AS dist_a, COUNT(DISTINCT b) AS dist_b, COUNT(DISTINCT a || ':' || b) AS dist_ab FROM my_table; but that's not very efficient. My plan for the near future (a few weeks) is to build a simple 'module' with the ability to estimate the number of rows for a given condition. This could actually be quite useful as a stand-alone contrib module, as the users often ask how to get a number of rows fast (usually for paging). That may be quite slow when the query returns too many rows, even when there is an index. It may be even much slower than the actual query (as it usually contains a small LIMIT). An estimate is often sufficient, but the 'pg_class.tuples' does not really work with conditions. So this might be an improvement ... regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 20.12.2010 00:03, Tom Lane napsal(a): I wrote: That is not the number of interest. The number of interest is that it's 8 bytes added onto a struct that currently contains 11 of 'em; in other words a 9% increase in the size of the stats file, and consequently about a 9% increase in the cost of updating it. Wups, sorry, I was looking at the wrong struct. It's actually an addition of 1 doubleword to a struct of 21 of 'em, or about 5%. That's still an awful lot in comparison to the prospective usefulness of the information. regards, tom lane OK, so here goes the simplified patch - it tracks one reset timestamp for a background writer and for each database. regards Tomas PS: I've noticed Magnus posted a patch to track recovery conflicts, adding a new view pg_stat_database_conflicts. I have not checked it yet but it should not influence this patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 21.12.2010 16:54, Florian Pflug napsal(a): I think that maybe it'd be acceptable to scan a large portion of the table to estimate dist(A,B), *if* we must only do so only once in a while. But even with a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, and spilling them into, say, an on-disk hash table adds even more overhead to the already expensive full scan. Maybe using a bloom filter instead of a hash table could avoid the spilling to disk, in exchange for a slightly less precise result... I've read some basics about a Bloom filters, and I'm not sure we can use it in this case. I see two problems with this approach: 1) impossibility to predict the number of elements in advance To build a Bloom filter with limited error rate, you need to size it properly (number of hash function and size of the bit array). But that's exactly the the information we're looking for. I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1% error, i.e. about 10MB of memory for each million of elements. That's not much but this needs to be done for each column separately and for the combination - for N columns you need (at least) N+1 filters. Sure - increasing the error rate and using a different estimate to build the bloom filter would significantly decrease the memory requirements. 2) sampling the whole table A naive approach would be to sample the whole table each time, but for large tables that might be a bit of a problem. So I'm thinking about what alternatives are there ... One possibility is to build the Bloom filter incrementally and store it on disk, or something like that. I guess this could be done during VACUUM or something like that. Anyway there's an issue how to set the filter size initially (estimate the number of elements), just as in the previous section. Another possibility is to collect the data from just a small portion of a table and then use the result to estimate the number of distinct values for the whole table. But I'm not sure we can do this reliably, I see many traps in this. But maybe I'm just missing something important. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 23.12.2010 20:09, Robert Haas napsal(a): 2010/12/23 Tomas Vondra t...@fuzzy.cz: Dne 20.12.2010 00:03, Tom Lane napsal(a): I wrote: That is not the number of interest. The number of interest is that it's 8 bytes added onto a struct that currently contains 11 of 'em; in other words a 9% increase in the size of the stats file, and consequently about a 9% increase in the cost of updating it. Wups, sorry, I was looking at the wrong struct. It's actually an addition of 1 doubleword to a struct of 21 of 'em, or about 5%. That's still an awful lot in comparison to the prospective usefulness of the information. regards, tom lane OK, so here goes the simplified patch - it tracks one reset timestamp for a background writer and for each database. I think you forgot the attachment. Yes, I did. Thanks! Tomas From da2154954a547c143cd0d130597411911f339c07 Mon Sep 17 00:00:00 2001 From: vampire vamp...@rimmer.reddwarf Date: Thu, 23 Dec 2010 18:40:48 +0100 Subject: [PATCH] Track timestamp of the last stats reset (for each database and a background writer). This is simplified version of the previous patch, that kept a timestamp for each database, table, index and function and for the background writer. This does not keep the timestamp for individual objects, but if stats for a table/index/function are reset, the timestamp for the whole database is updated. --- doc/src/sgml/monitoring.sgml | 18 ++ src/backend/catalog/system_views.sql |6 -- src/backend/postmaster/pgstat.c | 13 + src/backend/utils/adt/pgstatfuncs.c | 26 ++ src/include/catalog/pg_proc.h|4 src/include/pgstat.h |5 + 6 files changed, 70 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5fd0213..99e091d 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -600,6 +600,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_db_stat_reset_time/function(typeoid/type)/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics for this database (as a result of executing + functionpg_stat_reset/function function) or any object within it (table or index). + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry @@ -1062,6 +1071,15 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re varnamebgwriter_lru_maxpages/varname parameter /entry /row + + row + entryliteralfunctionpg_stat_get_bgwriter_stat_reset_time()/function/literal/entry + entrytypetimestamptz/type/entry + entry +Time of the last reset of statistics for the background writer (as a result of executing +functionpg_stat_reset_shared('bgwriter')/function) + /entry + /row row entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 346eaaf..da92438 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -502,7 +502,8 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, -pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted +pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, + pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; CREATE VIEW pg_stat_user_functions AS @@ -538,7 +539,8 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, -pg_stat_get_buf_alloc() AS buffers_alloc; +pg_stat_get_buf_alloc() AS buffers_alloc, + pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; CREATE VIEW pg_user_mappings AS SELECT diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 856daa7..66e0b69 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3130,6 +3130,8 @@ pgstat_get_db_entry(Oid databaseid, bool create) result-n_tuples_deleted = 0; result-last_autovac_time = 0; + result-stat_reset_timestamp = GetCurrentTimestamp(); + memset(hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatTabEntry); @@ -3408,6
Re: [HACKERS] proposal : cross-column stats
Dne 21.12.2010 19:03, Tomas Vondra napsal(a): My plan for the near future (a few weeks) is to build a simple 'module' with the ability to estimate the number of rows for a given condition. This could actually be quite useful as a stand-alone contrib module, as the users often ask how to get a number of rows fast (usually for paging). Hm, I've been thinking about where to place this new module - I thought pgfoundry might be the right place, but I've noticed it supports just CVS. Well, that's a bit antiquated SCM I think - I'm very happy with the comfort provided by Git or SVN. Is there some other place commonly used for pg-related projects? I've been thinking about github or something like that ... And I've finally set up the wiki-page about this effort - it's just a summary of this whole thread. http://wiki.postgresql.org/wiki/Cross_Columns_Stats regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a): 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1% error, i.e. about 10MB of memory for each million of elements. Drat. I had expected these number to come out quite a bit lower than that, at least for a higher error target. But even with 10% false positive rate, it's still 4.5MB per 1e6 elements. Still too much to assume the filter will always fit into memory, I fear :-( I have the impression that both of you are forgetting that there are 8 bits in a byte. 10 bits per element = 1.25MB per milion elements. We are aware of that, but we really needed to do some very rough estimates and it's much easier to do the calculations with 10. Actually according to wikipedia it's not 10bits per element but 9.6, etc. But it really does not matter if there is 10MB or 20MB of data, it's still a lot of data ... Oooops, now I see what's the problem. I thought you were pointing out something out, but I've actually used 1B = 1b (which is obviously wrong). But Florian already noticed that and fixed the estimates. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 24.12.2010 13:37, Florian Pflug napsal(a): On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1% error, i.e. about 10MB of memory for each million of elements. Drat. I had expected these number to come out quite a bit lower than that, at least for a higher error target. But even with 10% false positive rate, it's still 4.5MB per 1e6 elements. Still too much to assume the filter will always fit into memory, I fear :-( I have the impression that both of you are forgetting that there are 8 bits in a byte. 10 bits per element = 1.25MB per milion elements. Uh, of course. So in the real universe, the numbers are ~1.2MB per 1e6 elements for a false positive rate of 1% ~0.5MB per 1e6 elements for a false positive rate of 10% Hm. So for a table with a billion distinct elements, we'd need half a gigabyte per column for the filter. A tuple with two int columns takes at least 24+2*4 = 32bytes to store I think, making such a table at least 32GB in size. The filter size would thus be 1/64 of the table size in the worst case. Yes, but in reality you need three such filters - one for each column, one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB (with 1% error rate). But this is severely excessive compared to the real needs, as there are usually much less distinct (not equal to the number of tuples as we assume in these computations). I was thinking about a simple heuristics to scale the filter properly, something like this: 1) sample a small portion of the table and count distinct of values 2) compute number of dist. values / number of sampled tuples 3) scale this to the whole table and scale the filter Say there are really 50 distinct values, 1.000 rows will be sampled but 20 distinct values are missing in the sample. This gives 5% in step (2) and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the filter needs just 60kB. Which is a huge improvement compared to the previous approach (1.2MB). Obviously this will still lead to overestimates in most cases, and there are probably some other fail cases, but I think it's a reasonable solution. I don't think this can result in an underestimate (which is the case where you loose precision). And in case we want to build this incrementally (from a VACUUM) we really need to use a bit larger filter, because rescaling the filter is not possible AFAIK (without rebuilding it from scratch). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 24.12.2010 04:41, Florian Pflug napsal(a): The filter size could be derived from the table's statistics target, or be otherwise user-definable. We could also auto-resize once it gets too full. But still, that all seems awfully complex :-( Using a statistics target is a good idea I think. I think we could use it to determine error rate of the filter. Something like error rate = 10 - 0.9 * (statistics_target - 100) which gives 1% for statistics target 1000 10% for statistics target 100 or maybe something like this (where the error rate grows faster for smaller statistic target values) error rate = 11 - 91000 / (statistics_target^2) which gives about 1% for statistics target 1000 10% for statistics targer 100 36% for statistics target 50 But I guess 10% error rate is the minimum we need so it does not make much sense to use lower values. Another possibility is to collect the data from just a small portion of a table and then use the result to estimate the number of distinct values for the whole table. But I'm not sure we can do this reliably, I see many traps in this. This is how it works currently. The problem with this approach is that it gives you very little guarantees about how precise the result will be. Extrapolating works very well for things like MKVs and histograms, because there you're by definition interested mostly in values which occur often - and thus with a high probability in the relative few rows you sample. For the number of distinct values, however, this isn't true - if ndistinct is an order of magnitude smaller than the number of rows, relatively few rows can account for a large percentage of the distinct values... That basically means we need to sample a large portion of the table :-( Another idea would be to obtain the ndistinct values from an index somehow. Postgres cannot currently scan an index in physical order, only in logical order, due to locking considerations. But since we'd only be interested in an estimate, maybe a scan in physical block order would work for ndistinc estimates? Just a wild idea, mind you, I haven't checked at all if that'd be even remotely feasible. I was thinking about that too, and I think we could do this using pageinspect contrib module. Sure, there might be a problem with bloated indexes. And relying on this actually means it's required to have a multi-column index on all the columns. Individual indexes are not enough as we need to get the number of distinct combinations too. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
Dne 24.12.2010 13:37, Florian Pflug napsal(a): On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: 2010/12/24 Florian Pflug f...@phlo.org: On Dec23, 2010, at 20:39 , Tomas Vondra wrote: I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1% error, i.e. about 10MB of memory for each million of elements. Drat. I had expected these number to come out quite a bit lower than that, at least for a higher error target. But even with 10% false positive rate, it's still 4.5MB per 1e6 elements. Still too much to assume the filter will always fit into memory, I fear :-( I have the impression that both of you are forgetting that there are 8 bits in a byte. 10 bits per element = 1.25MB per milion elements. Uh, of course. So in the real universe, the numbers are ~1.2MB per 1e6 elements for a false positive rate of 1% ~0.5MB per 1e6 elements for a false positive rate of 10% Hm. So for a table with a billion distinct elements, we'd need half a gigabyte per column for the filter. A tuple with two int columns takes at least 24+2*4 = 32bytes to store I think, making such a table at least 32GB in size. The filter size would thus be 1/64 of the table size in the worst case. I've read several papers about estimating the number of distinct values, and I have some bad as well as good news. I don't want this thread to grow infinitely, and it's a rather separate topic so I'll start a new thread for ndistinct estimates. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] estimating # of distinct values
Hi everyone, about two weeks ago I've started a thread about cross-column stats. One of the proposed solutions is based on number of distinct values, and the truth is the current distinct estimator has some problems. I've done some small research - I have read about 20 papers on this, and I'd like to present a short summary here, possible solutions etc. The simple truth is 1) sampling-based estimators are a dead-end 2) there are very interesting stream-based estimators 3) the stream-based estimators have some disadvantages I wrote a more thorough summary on a wiki (http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the most interesting papers etc. so just a very short explanation. 1) sampling-based estimators are a dead-end The paper from Charikar Chaudhuri states (and proves) that for each sampling-based estimator, there's a data set where the estimator produces arbitrary error (with an indispensable probability). So replacing one sampling-based estimator with another generally does not improve the situation - fixes one dataset, breaks another one. The error is directly related to the size of the sample, so the truth is that to get a good distinct estimate you need to scan a significant portion of the table (almost all of it). So the estimators based on tiny samples are a dead end. 2) there are very interesting stream-based estimators A very interesting idea comes from the field of data streams. These estimates are based no a one pass through the data, and then incremental updates. A very nice thing is that these algorithms don't need that much space, as they don't store the actual values. The idea behind this is similar to the Bloom filter, i.e. set bits of a bitmap using a hash of the value. But in this case it's not required to be able to answer questions like 'is this an element of the set X?' so it's actually even more space efficient. For an introduction see the first paper published in 1985 by Flajolet (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100). One of the recent articles (published in June 2010) actually presents an optimal algorithm with O(e^-2 + log(n)) space complexity and O(1) update complexity. The e means precision, i.e. that the estimate will be within [(1-e)F, (1+e)F] where F is the real value. The paper on self-learning bitmap states that to track 10^9 distinct values with 1% relative error you need about 61 kilobits of space (which is absolutely awesome). The optimal algorithm needs a bit more space I think, A very interesting solution id distinct sampling that somehow extends the Wegman's adaptive sampling approach. 3) the stream-based estimators have some disadvantages Not everything is perfect, though - the most serious disadvantage is that those algorithms (usually) don't handle removal of elements. Inserts are easy to handle, but deletes are hard (just as in case of Bloom filters). So using this stream-based approach might lead to degradation in case of heavily updated tables :-( I see two possible solutions here: (a) use counters instead of bits, and track actual counts - but this is tricky, especially with 'probabilistic' algorithms and needs much more space (but still, 32x 61kb is just 250kB) (b) count how many deletes/updates were performed, and if needed rebuild the whole bitmap But even though these disadvantages, there really is no other way to enhance the estimates. I don't think this should be a default behavior - just as in case of cross-column stats this should be optional when the current estimator does not work well. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 27.12.2010 22:46, Robert Haas napsal(a): 2010/12/27 Tomas Vondra t...@fuzzy.cz: But even though these disadvantages, there really is no other way to enhance the estimates. I don't think this should be a default behavior - just as in case of cross-column stats this should be optional when the current estimator does not work well. This is going to be a lot of work to implement, so before you do it, we should try to reach a consensus that (a) it's part of an overall strategy that the community generally supports and (b) we have consensus on the design for this part. Yes, it is going to be a lot of work. But I don't think we need a consensus of the whole community prior to building something that works. I plan to build a very simple prototype soon, so let's talk about this later. I've started these two threads mainly as a 'brainstorming' - to present what I've learned from the various papers, propose a possible solution, highlight issues I see, and get ideas on how to solve them. With respect to (a), I have to admit I've found the discussion on cross-column stats to be quite difficult to follow. I'd like to see a very simple description of exactly what information we're going to store, under what circumstances we'll store it, and how we'll use it to compute selectivity estimates. Yes, I know it was difficult to follow that discussion. That's why I created a 'summary page' on wiki http://wiki.postgresql.org/wiki/Cross_Columns_Stats Generally we need to gather two types of stats: (a) multi-dimensional histogram - this can be done about the same way we create single-dimensional histograms, that's not a big problem (although more data might be necessary to get accurate histograms) (b) # of distinct values - this is the tricky part, as described in this problem With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. Well, it's not going to be easy, that's for sure. My big plan is something like this: (a) Build a simple 'contrib-like' module that allows manual collection of stats and computing of estimates (not integrated with the optimizer in any way). This should help us better understand what stats do we need etc. (b) Minimal integration with the core - still manual collection fo stats, the optimizer uses the stats if available. (c) Enhancements - automatic updates of the stats, etc. And all this should be implemented so that you don't have to pay unless you want to use the new stats. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 28.12.2010 00:04, Kevin Grittner napsal(a): Tom Lane t...@sss.pgh.pa.us wrote: Well, first, those scans occur only once every few hundred million transactions, which is not likely a suitable timescale for maintaining statistics. I was assuming that the pass of the entire table was priming for the incremental updates described at the start of this thread. I'm not clear on how often the base needs to be updated for the incremental updates to keep the numbers close enough. Well, that really depends on the workload. If you never remove all occurences of a given value (e.g. a ZIP code), you don't need to rescan the table at all. All you need is to build the stats once and then update them incrementally - and I belive this could be handled by autovacuum. And second, we keep on having discussions about rejiggering the whole tuple-freezing strategy. Even if piggybacking on those scans looked useful, it'd be unwise to assume it'll continue to work the same way it does now. Sure, it might need to trigger its own scan in the face of heavy deletes anyway, since the original post points out that the algorithm handles inserts better than deletes, but as long as we currently have some sequential pass of the data, it seemed sane to piggyback on it when possible. And maybe we should be considering things like this when we weigh the pros and cons of rejiggering. This issue of correlated values comes up pretty often Again, there are two types of stats - one of them needs to scan the whole table (estimate of distinct values), the other one does not (multi-dimentional histograms). These two cases are independent - you don't necessarily need both. Better ndistinct estimates would actually solve some of the current issues, it's not just a matter of cross-column stats. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 30.12.2010 15:43, Florian Pflug napsal(a): On Dec27, 2010, at 23:49 , Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. I had one random thought on that -- it seemed like a large concern was that there would need to be at least an occasional scan of the entire table to rebuild the distinct value information. I believe we could actually avoid that. First, the paper An Optimal Algorithm for the Distinct Elements Problem actually contains an algorithm with *does* handle deletes - it's called L_0 estimate there. Hmmm, that's interesting. I know there's a part about L_0 estimation, but that's about estimating Hamming norm of a vector - so I've ignored it as I thought we can't use it to estimate number of distinct values. But if it really handles deletions and if we can use it, then it's really interesting. Second, as Tomas pointed out, the stream-based estimator is essentially a simplified version of a bloom filter. It starts out with a field of N zero bits, and sets K of them to 1 for each value v in the stream. Which bits are set to 1 depends on some hash function(s) H_i(v). It's then easy to compute how many 1-bits you'd expect to find in the bit field after seeing D distinct values, and by reversing that you can estimate D from the number of 1-bits in the bit field. No, I haven't said the stream-based estimators are simplified versions of a Bloom filter. I said the approach is very similar - all the algorithms use bitmaps and hash functions, but the algorithms (Bloom filter vs. probabilistic counting and adaptive sampling) are very different. The Bloom filter is much more straightforward. The other algorithms are much more sophisticated which allows to use less space. To avoid having to rescan large tables, instead of storing one such bit field, we'd store one per B pages of data. We'd then only need to scan a range of B pages around every updated or deleted tuple, and could afterwards compute a new global estimate of D by combining the individual bit fields with bitwise and. I don't think this could help. 1) This works just with the Bloom filters, not with the other algorithms (you can't combine the segments using bitmap OR). 2) With heavily modified tables the updates are usually 'spread' through the whole table, so you'll have to rebuild all the segments anyway. Since the need to regularly VACUUM tables hit by updated or deleted won't go away any time soon, we could piggy-back the bit field rebuilding onto VACUUM to avoid a second scan. Well, I guess it's a bit more complicated. First of all, there's a local VACUUM when doing HOT updates. Second, you need to handle inserts too (what if the table just grows?). But I'm not a VACUUM expert, so maybe I'm wrong and this is the right place to handle rebuilds of distinct stats. I was thinking about something else - we could 'attach' the rebuild to an actual seq scan if the amount of changes reaches some threshold (since the last rebuild). Only in case the amount of changes reaches a higher threshold, we would rebuild the stats on our own. Something like IF (# of updates * deletes 5%) THEN wait for seq scan IF (# of updates * deletes 10%) THEN rebuild the stats I've found a nice ppt describing how Oracle does this: http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt and there's PDF version http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf According to this, Oracle is using the probabilistic counting approach (see slide 26). And once they find out there were to many changes in the table, they rebuild the whole thing. I'm not saying we should do exactly the same, just that this might be a good direction. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 7.1.2011 20:56, Jim Nasby napsal(a): On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote: Another thing I'm not sure about is where to store those intermediate stats (used to get the current estimate, updated incrementally). I was thinking about pg_stats but I'm not sure it's the right place - depending on the algorithm, this may be a fet kilobytes up to several megabytes. And it's not needed except when updating it. Any ideas? If you're using it essentially as a queue, maybe a resource fork makes sense? A resource fork? Not sure what you mean, could you describe it in more detail? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Hi, a short update regarding the ndistinct stream estimators - I've implemented the estimators described in the papers I've mentioned in my previous posts. If you want try it, the sources are available at github, at http://tvondra.github.com/pg_estimator (ignore the page, I have to update it, just download the sources). You can build the estimators as standalone benchmarks (build.sh) or as aggregate functions (build-agg.sh) - I guess the latter is much easier to play with. The installation is described on my blog, along with some two simple benchmarks http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/ Feel free to play with it with different data, and if you notice something strange just let me know. Several comment regarding the implemented estimators: 1) PCSA, Probabilistic Counting (both described in the 1985 paper) - quite good performance, especially with respect to the memory requirements (95 and 495 B) 2) Adaptive Sampling (1990), Self-Learning Bitmap (2009) - very different algorithms, almost the same performance (especially with large number of distinct values) - my personal favourites, as the memory requirements are still very reasonable (compared to the Bloom Counter) 3) Rough Estimator (2010) - this algorithm is described as an optimal algorithm (not exactly, it's a building block of the optimal algorithm), but the results are not as good as with (1) and (2) - one of the problems is it needs k-wise independent hash functions, and the MD5 hashing scheme used with the other algorithms probably does not conform to this condition :-( - I've tried to implement such hash functions on my own (using prime field and polynomials in modulo arithmetics), but the results were quite bad - if you know how to get such hash functions, let me know. - this algorithm is much more complex than the other algorithms, so if someone could do a review, that would be nice (maybe there is a bug resulting in the big estimate error) 4) Bloom Counter - basically a Bloom filter + a counter (incremented when an new element is added to the filter) - due to the design (no false negatives, positive false positives) it's significantly biased (gives underestimates), but I think that can be easily fixed with a coefficient (depends on the number of items / false positive eror rate) - needs significantly more memory to get similar performance as the other algorithms (a Bloom counter that can track 1 milion distinct values needs about 1.2MB of memory, while a S-Bitmap that tracks 10 milion items needs just 65kB) - so unless we can use the special feature of Bloom Filter (i.e. the ability to detect items that are in the data), this is a waste of memory I know Google uses Bloom filters in BigTable to limit I/O, i.e. before doing the I/O they ask the Bloom filter if there are such data - if the answer is no, they don't have to do the I/O (false negatives not allowed), if the answer is yes they do the I/O. Not sure if we can / want to do something like this in PostgreSQL, as it significantly depends on the workload (and in most cases we know the data are there, so we have to do the I/O anyway). But I guess it might be a good idea for a contrib module (to maintain the Bloom filter on your own and do the decision on application). The tricky part here is to maintain the bloom filter up to date. I'll try to fix the optimal estimator (not sure how to do that right now), and I'll investigate the proposed 'relation fork' proposed as a way to store the estimator. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 9.1.2011 13:58, Jim Nasby napsal(a): A resource fork? Not sure what you mean, could you describe it in more detail? Ooops, resource forks are a filesystem thing; we call them relation forks. From src/backend/storage/smgr/README: OK, I think using relation forks seems like a good solution. I've done some basic research and I think these are the basic steps when adding a new fork: 1) define a new item in the ForkNum enum (relfilenode.h) - this should be somethink like DISTINCT_FORK I guess 2) modify the ForkNames (catalog.c) and the relpathbackend so that the proper filename is assigned to the fork And then it will be accessed through smgr (smgrcreate etc.). Am I right or is there something else I need to do? There are a few open questions though: 1) Forks are 'per relation' but the distinct estimators are 'per column' (or 'per group of columns') so I'm not sure whether the file should contain all the estimators for the table, or if there should be one fork for each estimator. The former is a bit difficult to manage, the latter somehow breaks the current fork naming convention. 2) Where to keep the info that there is an estimator for a column? I guess we could put this into pg_attribute (it's one boolean). But what about the estimators for groups of columns? Because that's why I'm building this - to get distinct estimates for groups of columns. I guess we'll need a new system catalog to track this? (The same will be true for multi-dimensional histograms anyway). 3) I still am not sure how to manage the updates, i.e. how to track the new values. One possibility might be to do that synchronously - whenever a new item is inserted into the table, check if there's an estimator and update it. Updating the estimators is quite efficient (e.g. the bitmap manages to do 10.000.000 inserts in 9 seconds on my ancient workstation) although there might be issues with locking etc. The other possibility is to update the estimator asynchronously, i.e. store the new values somewhere (or just ctid of the row), and then process it periodically. I'm not sure how to intercept the new rows and where to store them. In another fork? Somewhere else? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 18.1.2011 18:12, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 4:53 AM, t...@fuzzy.cz wrote: So the most important question is how to intercept the new/updated rows, and where to store them. I think each backend should maintain it's own private list of new records and forward them only in case of commit. Does that sound reasonable? At the risk of sounding demoralizing, nothing about this proposal sounds very promising to me, and that sounds like a particularly bad idea. What happens if the transaction updates a billion records? Or even a million records? Are you going to store all of those in backend-local memory until commit time? Or spool them in a potentially-gigantic disk file somewhere? That memory allocation - or file - could grow to be larger than the size of the entire database in the worst case. And COMMIT could take an awfully long time if it has to spool megabytes or gigabytes of data off to some other process. And what happens if there's a crash after the COMMIT but before all that data is sent? The estimates become permanently wrong? Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. I was thinking about it and I believe at least one of the algorithms (the adaptive sampling) might handle merging i.e. the backends would not need to store the list of values, just a private copy of the estimator and update it. And at the end (after commit), the estimator would be merged back into the actual one. So the algorithm would be something like this: 1. create copy for all distinct estimators influenced by the INSERT 2. update the local copy 3. commit and merge the local copies back into the original estimator The amount of copied data is very small, depending on the number of distinct items it can handle and precision (see the adaptive.c for details). Some examples 2^48 items + 1% error = 86 kB 2^48 items + 5% error = 3.4 kB 2^64 items + 1% error = 115 kB 2^64 items + 5% error = 4.6 kB so it's much better than storing all the data. But I really need to check this is possible (right now I'm quite busy with organizing a local conference, so maybe next week). Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the write of the merged estimator fails, but I think it might be possible to write the estimator to xlog or something like that. So it would be replayed during recovery etc. Or is it a stupid idea? And are we doing all of this just to get a more accurate estimate of ndistinct? For the amount of effort that it will probably take to get this working at all, you could probably implement index-only scans and have enough bandwidth left over to tackle global temporary tables. And unless I'm missing something, the chances that the performance consequences will be tolerable are pretty much zero. And it would only benefit the tiny fraction of users for whom bad n_distinct estimates cause bad plans, and then the even tinier percentage of those who can't conveniently fix it by using the manual override that we already have in place - which presumably means people who have gigantic tables that are regularly rewritten with massive changes in the data distribution that affect plan choice. Is that more than the empty set? First of all, I've never proposed to use this as a default behaviour. Actually I've strongly argued agains that idea on several occasions. So the user would have to enable that explicitly for some columns, I really am not an idiot to force everyone to pay for this. So the goal is to help the small fraction of users and keep the current solution for the rest of them. And yes, the main reason why I am working on this are the multi-column stats (in case of discrete columns). You can fix the ndistinct estimate by manually overriding the estimate, but for the multi-column stats you need an estimate of ndistinct for the combination of columns, and that's a bit difficult to do manually. The other reason is I've studied statistics (and I enjoy it, which seems weird to most people). And it's a good way to learn the internals. Maybe the idea here is that this wouldn't fix just ndistinct estimates but would also help with multi-column statistics. Even if that's the case, I think it's almost certainly a dead end from a performance standpoint. Some kind of manual ANALYZE process that can be invoked when needed to scan the entire table would be painful but maybe acceptable for certain people with a problem they can't fix any other way, but doing this automatically for everyone seems like a really bad idea. Yes, as I've mentioned above, the multi-column stats are the reason why I started working on this. And yes, it really should work like this: 1. user realizes there's something wrong as the plans are bad 2. after
Re: [HACKERS] estimating # of distinct values
Dne 19.1.2011 20:25, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. As far as I can see, periodically sampling some or all of the table is really the only thing that has a chance of working OK. You could try to track changes only when they're not too big, but I think that's still going to have nasty performance consequences. Yes, sampling all the table is the only way to get reliable ndistinct estimates. I'm not sure what you mean by 'tracking changes' - as I've mentioned in the previous post, this might be solved by updating a local copy. Which requires a constant space (a few kB, see the previous post). Is that acceptable? I don't know, that's up to the user if he want's to pay this price. So the algorithm would be something like this: 1. create copy for all distinct estimators influenced by the INSERT 2. update the local copy 3. commit and merge the local copies back into the original estimator Well, maybe. But DELETEs seem like a problem. And it's still adding foreground work to every query, which I just have a hard time believing is ever going to work out Yes, deletes are difficult to handle. My idea was to compute something like ((tuples changed + tuples deleted) / tuples total), and indicate that a rebuild of the estimator is needed if this coefficient changes too much - e.g. log a message or something like that. Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the write of the merged estimator fails, but I think it might be possible to write the estimator to xlog or something like that. So it would be replayed during recovery etc. Or is it a stupid idea? It's not stupid, in the sense that that is what you'd need to do if you want to avoid ever having to rescan the table. But it is another thing that I think is going to be way too expensive. Way too expensive? All you need to put into the logfile is a copy of the estimator, which is a few kBs. How is that 'way too expensive'? Sure, it might be expensive when the user does a lot of small changes in separate transactions, that's true, but I guess we could minimize the amount of data written to the xlog by doing a diff of the estimators or something like that. Yes, as I've mentioned above, the multi-column stats are the reason why I started working on this. And yes, it really should work like this: 1. user realizes there's something wrong as the plans are bad 2. after analysis, the user realizes the reason is an imprecise estimate due to dependence between columns 3. the user enables cross-column stats on the columns and checks if it actually solved the issue 4. if the cost outweights the gains, the user drops the stats Does that sound reasonable? Yes. The only caveat I'm trying to insert is that it's hard for me to see how the proposed approach could ever be cheap enough to be a win. IMHO the question is not 'How expensive is that?' but rather 'How expensive is it compared to the gains?' If the user gets much better estimates and a then a much better plan, then this may be a completely acceptable price. If we need some kind of more expensive kind of ANALYZE that scans the whole table, and it's optional, sure, why not? But that's a one-time hit. You run it, it sucks, it's over, and now your queries work. Odds are good you may never need to touch it again. Now, if we can convince ourselves that multi-column stats are likely to require constant updating, then maybe there would be more to recommend the stream-based approach, although even then it seems dreadfully complex and expensive to me. But I bet these things are pretty static. If No, the multi-column statistics do not require constant updating. There are cases where a sampling is perfectly fine, although you may need a bit larger sample. Generally if you can use a multi-dimensional histogram, you don't need to scan the whole table. But the multi-dimensional histograms are not applicable to some cases. Especially to the ZIP code fail case, that was repeatedly discussed. So in case of discrete data, we need a different approach - and the solution I've proposed is based on using ndistinct estimates to get the estimates (actually it's based on one of the papers listed on wiki). and expensive to me. But I bet these things are pretty static. If the city and state tend to imply the zip code when there are 10K rows in the table, they probably still will when there are 100K rows in the table. If users with org_id = 1 have a higher karma score on average OK, how will you measure the implicativeness? We have discussed this in another thread and there is a lot of gotchas although it seems like a really
Re: [HACKERS] estimating # of distinct values
Dne 19.1.2011 20:46, Tom Lane napsal(a): Robert Haas robertmh...@gmail.com writes: ... I guess I'm just saying I'd think really, really hard before abandoning the idea of periodic sampling. You have to get an awful lot of benefit out of those cross-column stats to make it worth paying a run-time cost for them. I've been trying to not discourage Tomas from blue-sky speculation, but I have to say I agree with Robert that the chance of any usable result from this approach is very very small. Feel free to do some experimentation to prove us wrong --- but don't put a lot of effort into it before that. regards, tom lane Discourage? Not really - I have not expected this to be a simple thing to implement. And yes, it might turn out to be a dead end eventually. OTOH the approach it seems really expensive so let's not try to build it is a certain dead end, so I'm not going to surrender due to such speculations (althouh those are valid concerns and I'll have to address them in the future). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 19.1.2011 23:44, Nathan Boley napsal(a): 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of multiple columns. Which is usually pathological in case of dependent columns. Which is exactly the case when the user will explicitly enable this feature to get better estimates (and thus better plans). 2) The choice of plan is fairly robust to mis-estimation of ndistinct, because there are only a few things the executer can choose. It doesn't usually matter if a value composes 5% or 20% ( or 99% ) of the table, we still usually need to scan the entire table. Again, don't think about a single column (although even in that case there are known fail cases). Think about multiple columns and the number of distinct combinations. With attribute value independence assumption, this is usually significantly underestimated. And that's a problem as it often leads to an index scan instead of sequential scan (or something like that). Again, in my *very* humble opinion, If the ultimate goal is to improve the planner, we should try to cut down on the number of cases in which a poor estimate of ndistinct gives a really bad plan instead of chasing after marginal gains from a better ndistinct estimator. Maybe What is a marginal gain? The ultimate goal is to build cross-column stats, which in case of dependent columns usually results is poor plans. How is fixing that a marginal gain? Yes, it may turn out it's not worth it, but it's a bit early to say that without an implementation and some hard data. ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) Which literature? I've read an awful lot of papers on this topic lately, and I don't remember anything like that. If there's an interesting paper, I'd like to read it. Yes, all the papers state estimating a ndistinct is a particularly tricky task, but that's somehow expected? I've even checked how other databases do this - e.g. Oracle does it very similarly to what I proposed (the user has to enable the feature, it has to be rebuilt from time to time, etc.). I'm not saying we should do everything like Oracle, but they are not clueless idiots. pounding away at the ndistinct estimator seems like a dead end. If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? No, I'm not trying to do this just to improve the ndistinct estimate. The goal is to improve the cardinality estimate in case of dependent columns, and one of the papers depends on good ndistinct estimates. And actually it does not depend on ndistinct for the columns only, it depends on ndistinct estimates for the combination of columns. So improving the ndistinct estimates for columns is just a necessary first step (and only if it works reasonably well, we can do the next step). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 20.1.2011 03:06, Nathan Boley napsal(a): And actually it does not depend on ndistinct for the columns only, it depends on ndistinct estimates for the combination of columns. So improving the ndistinct estimates for columns is just a necessary first step (and only if it works reasonably well, we can do the next step). I think that any approach which depends on precise estimates of ndistinct is not practical. I'm not aware of any other approach to the 'discrete fail case' (where the multi-dimensional histograms are not applicable). If someone finds a better solution, I'll be the first one to throw away this stuff. I am very happy that you've spent so much time on this, and I'm sorry if my previous email came off as combative. My point was only that simple heuristics have served us well in the past and, before we go to the effort of new, complicated schemes, we should see how well similar heuristics work in the multiple column case. I am worried that if the initial plan is too complicated then nothing will happen and, even if something does happen, it will be tough to get it committed ( check the archives for cross column stat threads - there are a lot ). If I've leaned one thing over the years in IT, it's not to take critique personally. All the problems mentioned in this thread are valid concerns, pointing out weak points of the approach. And I'm quite happy to receive this feedback - that's why I started it. On the other hand - Jara Cimrman (a famous Czech fictional character, depicted as the best scientist/poet/teacher/traveller/... - see [1]) once said that you can't be really sure you don't get gold by blowing cigarette smoke into a basin drain, until you actually try it. So I'm blowing cigaretter smoke into the drain ... It may wery vell happen this will be a dead end, but I'll do my best to fix all the issues or to prove that the pros outweight the cons. And even if it will be eventually rejected, I hope to get -1 from TL to be eligible for that t-shirt ... [1] http://en.wikipedia.org/wiki/Jara_Cimrman regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 20.1.2011 03:36, Robert Haas napsal(a): On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the write of the merged estimator fails, but I think it might be possible to write the estimator to xlog or something like that. So it would be replayed during recovery etc. Or is it a stupid idea? It's not stupid, in the sense that that is what you'd need to do if you want to avoid ever having to rescan the table. But it is another thing that I think is going to be way too expensive. Way too expensive? All you need to put into the logfile is a copy of the estimator, which is a few kBs. How is that 'way too expensive'? At this point, this is all a matter of religion, right? Neither of us has a working implementation we've benchmarked. But yes, I believe you're going to find that implementing some kind of streaming estimator is going to impose a... pulls number out of rear end 6% performance penalty, even after you've optimized the living daylights out of it. Now you might say... big deal, it improves my problem queries by 100x. OK, but if you could get the same benefit by doing an occasional full table scan during off hours, you could have the same performance with a *0%* performance penalty. Even better, the code changes would be confined to ANALYZE rather than spread out all over the system, which has positive implications for robustness and likelihood of commit. Good point. What I was trying to do was to continuously update the estimator with new data - that was the whole idea behind the collecting of new values (which might lead to problems with memory etc. as you've pointed out) and updating a local copy of the estimator (which is a good idea I think). But this might be another option - let the user decide if he wants to continuously update the estimates (and pay the price) or do that off the hours (and pay almost nothing). That sounds as a very good solution to me. I'm not trying to argue you out of working on this. It's obviously your time to spend, and if works better than I think it will, great! I'm merely offering you an opinion on what will probably happen if you go this route - namely, it'll carry an unpalatable run-time penalty. That opinion may be worth no more than what you paid for it, but there you have it. Yes, and I appreciate all feedback. But I still believe this can be done so that users that don't need the feature don't pay for it. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a): It seems that the suggested multi-column selectivity estimator would be more sensitive to ndistinct of the individual columns. Is that correct? How is it biased? If we routinely under-estimate ndistinct of individual columns, for example, does the bias accumulate or cancel itself in the multi-column estimate? I'd like to see some testing of the suggested selectivity estimator with the ndistinct estimates we have. Who knows, maybe it works fine in practice. The estimator for two columns and query 'A=a AND B=b' is about 0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b)) so it's quite simple. It's not that sensitive to errors or ndistinct estimates for individual columns, but the problem is in the multi-column ndistinct estimates. It's very likely that with dependent colunms (e.g. with the ZIP codes / cities) the distribution is so pathological that the sampling-based estimate will be very off. I guess this was a way too short analysis, but if you can provide more details of the expected tests etc. I'll be happy to provide that. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Dne 20.1.2011 11:05, Csaba Nagy napsal(a): Hi Tomas, On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: No, the multi-column statistics do not require constant updating. There are cases where a sampling is perfectly fine, although you may need a bit larger sample. Generally if you can use a multi-dimensional histogram, you don't need to scan the whole table. In the cases where sampling is enough, you can do that to the updates too: do a sampling on the changes, in that you only process every Nth change to make it to the estimator. If you can also dynamically tune the N to grow it as the statistics stabilize, and lower it if you detect high variance, even better. If the analyze process could be decoupled from the backends, and maybe just get the data passed over to be processed asynchronously, then that could be a feasible way to have always up to date statistics when the bottleneck is IO and CPU power is in excess. If that then leads to better plans, it could really be a win exceeding the overhead. OK, this sounds interesting. I'm not sure how to do that but it might be a good solution. What about transactions? If the client inserts data (and it will be sent asynchronously to update the estimator) and then rolls back, is the estimator 'rolled back' or what happens? This was exactly the reason why I initially wanted to collect all the data at the backend (and send them to the estimator at commit time). Which was then replaced by the idea to keep a local estimator copy and merge it back to the original estimator at commit time. If this analyze process (or more of them) could also just get the data from the modified buffers in a cyclic way, so that backends need nothing extra to do, then I don't see any performance disadvantage other than possible extra locking contention on the buffers and non-determinism of the actual time when a change makes it to the statistics. Then you just need to get more CPU power and higher memory bandwidth to pay for the accurate statistics. Well, the possible locking contention sounds like a quite significant problem to me :-( The lag between an update and a change to the stats is not that big deal I guess - we have the same behaviour with the rest of the stats (updated by autovacuum every once a while). Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 30.1.2011 23:22, Robert Haas napsal(a): On Thu, Dec 23, 2010 at 2:41 PM, Tomas Vondra t...@fuzzy.cz wrote: OK, so here goes the simplified patch - it tracks one reset timestamp for a background writer and for each database. I think you forgot the attachment. Yes, I did. Thanks! This patch no longer applies. Please update. I've updated the patch - rebased to the current HEAD. regards Tomas *** a/doc/src/sgml/monitoring.sgml --- b/doc/src/sgml/monitoring.sgml *** *** 663,668 postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re --- 663,677 /row row + entryliteralfunctionpg_stat_get_db_stat_reset_time/function(typeoid/type)/literal/entry + entrytypetimestamptz/type/entry + entry +Time of the last reset of statistics for this database (as a result of executing +functionpg_stat_reset/function function) or any object within it (table or index). + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry *** *** 1125,1130 postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re --- 1134,1148 varnamebgwriter_lru_maxpages/varname parameter /entry /row + + row + entryliteralfunctionpg_stat_get_bgwriter_stat_reset_time()/function/literal/entry + entrytypetimestamptz/type/entry + entry + Time of the last reset of statistics for the background writer (as a result of executing + functionpg_stat_reset_shared('bgwriter')/function) + /entry + /row row entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry *** a/src/backend/catalog/system_views.sql --- b/src/backend/catalog/system_views.sql *** *** 523,529 CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, ! pg_stat_get_db_conflict_all(D.oid) AS conflicts FROM pg_database D; CREATE VIEW pg_stat_database_conflicts AS --- 523,530 pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, ! pg_stat_get_db_conflict_all(D.oid) AS conflicts, ! pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; CREATE VIEW pg_stat_database_conflicts AS *** *** 570,576 CREATE VIEW pg_stat_bgwriter AS pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, ! pg_stat_get_buf_alloc() AS buffers_alloc; CREATE VIEW pg_user_mappings AS SELECT --- 571,578 pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, ! pg_stat_get_buf_alloc() AS buffers_alloc, ! pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; CREATE VIEW pg_user_mappings AS SELECT *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *** *** 3160,3165 pgstat_get_db_entry(Oid databaseid, bool create) --- 3160,3167 result-n_conflict_bufferpin = 0; result-n_conflict_startup_deadlock = 0; + result-stat_reset_timestamp = GetCurrentTimestamp(); + memset(hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatTabEntry); *** *** 3438,3443 pgstat_read_statsfile(Oid onlydb, bool permanent) --- 3440,3451 * load an existing statsfile. */ memset(globalStats, 0, sizeof(globalStats)); + + /* +* Set the current timestamp (will be kept only in case we can't load an +* existing statsfile. +*/ + globalStats.stat_reset_timestamp = GetCurrentTimestamp(); /* * Try to open the status file. If it doesn't exist, the backends simply *** *** 4052,4057 pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len) --- 4060,4067 dbentry-n_tuples_deleted = 0; dbentry-last_autovac_time = 0; + dbentry-stat_reset_timestamp = GetCurrentTimestamp(); + memset(hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatTabEntry); *** *** 4083,4088 pgstat_recv_resetsharedcounter
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Dne 4.2.2011 03:37, Greg Smith napsal(a): Thinking I should start with why I think this patch is neat...most of the servers I deal with are up 24x7 minus small amounts of downtime, presuming everyone does their job right that is. In that environment, having a starting timestamp for when the last stats reset happened lets you quickly compute some figures in per-second terms that are pretty close to actual average activity on the server. Some examples of how I would use this: psql -c SELECT CAST(buffers_backend * block_size AS numeric) / seconds_uptime / (1024*1024) AS backend_mb_per_sec FROM (SELECT *,extract(epoch from now()-stats_reset) AS seconds_uptime, (SELECT cast(current_setting('block_size') AS int8)) AS block_size FROM pg_stat_bgwriter) AS raw WHERE raw.seconds_uptime 0 backend_mb_per_sec 4.27150807681618 psql -c SELECT datname,CAST(xact_commit AS numeric) / seconds_uptime AS commits_per_sec FROM (SELECT *,extract(epoch from now()-stats_reset) AS seconds_uptime FROM pg_stat_database) AS raw WHERE raw.seconds_uptime 0 datname | commits_per_sec ---+ template1 | 0.0338722604313051 postgres | 0.0363144438470267 gsmith| 0.0820573653236174 pgbench | 0.059147072347085 Now I reset, put some load on the system and check the same stats afterward; watch how close these match up: $ psql -d pgbench -c select pg_stat_reset() $ pgbench -j 4 -c 32 -T 30 pgbench transaction type: TPC-B (sort of) scaling factor: 100 query mode: simple number of clients: 32 number of threads: 4 duration: 30 s number of transactions actually processed: 6604 tps = 207.185627 (including connections establishing) tps = 207.315043 (excluding connections establishing) datname | commits_per_sec ---+ pgbench | 183.906308135572 Both these examples work as I expected, and some playing around with the patch didn't find any serious problems with the logic it implements. One issue though, an oversight I think can be improved upon; watch what happens when I create a new database: $ createdb blank $ psql -c select datname,stats_reset from pg_stat_database where datname='blank' datname | stats_reset -+- blank | That's not really what I would hope for here. One major sort of situation I'd like this feature to work against is the one where someone asks for help but has never touched their database stats before, which is exactly what I'm simulating here. In this case that person would be out of luck, the opposite of the experience I'd like a newbie to have at this point. Are you sure about it? Because when I create a database, the field is NULL - that's true. But once I connect to the database, the stats are updated and the field is set (thanks to the logic in pgstat.c). In that case 'stat_reset=NULL' would mean 'no-one ever touched this db' which seems quite reasonable to me. $ createdb testdb1 $ createdb testdb2 $ psql -d testdb1 -c select stats_reset from pg_stat_database where datname = 'testdb2' stats_reset - (1 row) $ psql -d testdb2 -c \q $ psql -d testdb1 -c select stats_reset from pg_stat_database where datname = 'testdb2' stats_reset --- 2011-02-04 19:03:23.938929+01 (1 row) But maybe I've missed something and it does not work the way I think it does. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
On 6.2.2011 08:17, Greg Smith wrote: Below is what changed since the last posted version, mainly as feedback for Tomas: -Explained more clearly that pg_stat_reset and pg_stat_reset_single_counters will both touch the database reset time, and that it's initialized upon first connection to the database. -Added the reset time to the list of fields in pg_stat_database and pg_stat_bgwriter. -Fixed some tab/whitespace issues. It looks like you had tab stops set at 8 characters during some points when you were editing non-code files. Also, there were a couple of spot where you used a tab while text in the area used spaces. You can normally see both types of errors if you read a patch, they showed up as misaligned things in the context diff. -Removed some extra blank lines that didn't fit the style of the surrounding code. Basically, all the formatting bits I'm nitpicking about I found just by reading the patch itself; they all stuck right out. I'd recommend a pass of that before submitting things if you want to try and avoid those in the future. Thanks for fixing the issues and for the review in general. I have to tune the IDE a bit to produce the right format and be a bit more careful when creating the patches. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum stress-testing our system
Hi! On 26.9.2012 19:18, Jeff Janes wrote: On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: On 26-09-2012 09:43, Tomas Vondra wrote: 5) splitting the single stat file into multiple pieces - e.g. per database, written separately, so that the autovacuum workers don't need to read all the data even for databases that don't need to be vacuumed. This might be combined with (4). IMHO that's the definitive solution. It would be one file per database plus a global one. That way, the check would only read the global.stat and process those database that were modified. Also, an in-memory map could store that information to speed up the checks. +1 That would help for the case of hundreds of databases, but how much does it help for lots of tables in a single database? It doesn't help that case, but that case doesn't need much help. If you have N statistics-kept objects in total spread over M databases, of which T objects need vacuuming per naptime, the stats file traffic is proportional to N*(M+T). If T is low, then there is generally is no problem if M is also low. Or at least, the problem is much smaller than when M is high for a fixed value of N. I've done some initial hacking on splitting the stat file into multiple smaller pieces over the weekend, and it seems promising (at least with respect to the issues we're having). See the patch attached, but be aware that this is a very early WIP (or rather a proof of concept), so it has many rough edges (read sloppy coding). I haven't even added it to the commitfest yet ... The two main changes are these: (1) The stats file is split into a common db file, containing all the DB Entries, and per-database files with tables/functions. The common file is still called pgstat.stat, the per-db files have the database OID appended, so for example pgstat.stat.12345 etc. This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile functions, introducing two new functions: pgstat_read_db_statsfile pgstat_write_db_statsfile that do the trick of reading/writing stat file for one database. (2) The pgstat_read_statsfile has an additional parameter onlydbs that says that you don't need table/func stats - just the list of db entries. This is used for autovacuum launcher, which does not need to read the table/stats (if I'm reading the code in autovacuum.c correctly - it seems to be working as expected). So what are the benefits? (a) When a launcher asks for info about databases, something like this is called in the end: pgstat_read_db_statsfile(InvalidOid, false, true) which means all databases (InvalidOid) and only db info (true). So it reads only the one common file with db entries, not the table/func stats. (b) When a worker asks for stats for a given DB, something like this is called in the end: pgstat_read_db_statsfile(MyDatabaseId, false, false) which reads only the common stats file (with db entries) and only one file for the one database. The current implementation (with the single pgstat.stat file), all the data had to be read (and skipped silently) in both cases. That's a lot of CPU time, and we're seeing ~60% of CPU spent on doing just this (writing/reading huge statsfile). So with a lot of databases/objects, this pgstat.stat split saves us a lot of CPU ... (c) This should lower the space requirements too - with a single file, you actually need at least 2x the disk space (or RAM, if you're using tmpfs as we are), because you need to keep two versions of the file at the same time (pgstat.stat and pgstat.tmp). Thanks to this split you only need additional space for a copy of the largest piece (with some reasonable safety reserve). Well, it's very early patch, so there are rough edges too (a) It does not solve the many-schema scenario at all - that'll need a completely new approach I guess :-( (b) It does not solve the writing part at all - the current code uses a single timestamp (last_statwrite) to decide if a new file needs to be written. That clearly is not enough for multiple files - there should be one timestamp for each database/file. I'm thinking about how to solve this and how to integrate it with pgstat_send_inquiry etc. One way might be adding the timestamp(s) into PgStat_StatDBEntry and the other one is using an array of inquiries for each database. And yet another one I'm thinking about is using a fixed-length array of timestamps (e.g. 256), indexed by mod(dboid,256). That would mean stats for all databases with the same mod(oid,256) would be written at the same time. Seems like an over-engineering though. (c) I'm a bit worried about the number of files - right now there's one
Re: [HACKERS] autovacuum stress-testing our system
On 26.9.2012 18:29, Tom Lane wrote: What seems to me like it could help more is fixing things so that the autovac launcher needn't even launch a child process for databases that haven't had any updates lately. I'm not sure how to do that, but it probably involves getting the stats collector to produce some kind of summary file. Couldn't we use the PgStat_StatDBEntry for this? By splitting the pgstat.stat file into multiple pieces (see my other post in this thread) there's a file with StatDBEntry items only, so maybe it could be used as the summary file ... I've been thinking about this: (a) add needs_autovacuuming flag to PgStat_(TableEntry|StatDBEntry) (b) when table stats are updated, run quick check to decide whether the table needs to be processed by autovacuum (vacuumed or analyzed), and if yes then set needs_autovacuuming=true and both for table and database The worker may read the DB entries from the file and act only on those that need to be processed (those with needs_autovacuuming=true). Maybe the DB-level field might be a counter of tables that need to be processed, and the autovacuum daemon might act on those first? Although the simpler the better I guess. Or did you mean something else? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm partial-match
On 15.11.2012 20:39, Fujii Masao wrote: Hi, I'd like to propose to extend pg_trgm so that it can compare a partial-match query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial' GIN method for pg_trgm. Currently, when the query key is less than three characters, we cannot use a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a partial-match method. In this case, seq scan or index full scan would be executed, and its response time would be very slow. I'd like to alleviate this problem. Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, i.e., if query key contains multibyte characters. In this case, byte length of the trigram string might be larger than three, and its CRC is used as a trigram key instead of the trigram string itself. Because of using CRC, we cannot do a partial-match. Attached patch extends pg_trgm so that it compares a partial-match query key only when KEEPONLYALNUM is enabled. Attached patch is WIP yet. What I should do next is: * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc. * write the regression test Comments? Review? Objection? Hi, I've done a quick review of the current patch: (1) It applies cleanly on the current master and builds fine. (2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented differently (using tabs instead of spaces). (3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef KEEPONLYALNUM, although trgm_pmatch is not used at all. (4) The patch removes some commented-out variables, but there still remain various commented-out variables. Will this be cleaned too? (5) I've done basic functionality of the patch, it really seems to work: CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,100) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%'; QUERY PLAN Bitmap Heap Scan on test (cost=1655.96..11757.63 rows=141414 width=33) Recheck Cond: (val ~~ '%aa%'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414 width=0) Index Cond: (val ~~ '%aa%'::text) (4 rows) Without the patch, this gives a seq scan (as expected). Do you expect to update the docs too? IMHO it's worth mentioning that the pg_trgm can handle even patterns shorter than 2 chars ... regards Tomas Vondra -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too much pgbench init output
On 19.11.2012 11:59, Jeevan Chalke wrote: Hi, I gone through the discussion for this patch and here is my review: The main aim of this patch is to reduce the number of log lines. It is also suggested to use an options to provide the interval but few of us are not much agree on it. So final discussion ended at keeping 5 sec interval between each log line. However, I see, there are two types of users here: 1. Who likes these log lines, so that they can troubleshoot some slowness and all 2. Who do not like these log lines. Who likes these lines / needs them for something useful? So keeping these in mind, I rather go for an option which will control this. People falling in category one can set this option to very low where as users falling under second category can keep it high. So what option(s) would you expect? Something that tunes the interval length or something else? A switch that'd choose between the old and new behavior might be a good idea, but I'd strongly vote against automagic heuristics. It makes the behavior very difficult to predict and I really don't want to force the users to wonder whether the long delay is due to general slowness of the machine or some clever logic that causes long delays between log messages. That's why I choose a very simple approach with constant time interval. It does what I was aiming for (less logs) and it's easy to predict. Sure, we could choose different interval (or make it an option). However, assuming we settled on 5 sec delay, here are few comments on that patch attached: Comments: = Patch gets applied cleanly with some whitespace errors. make and make install too went smooth. make check was smooth. Rather it should be smooth since there are NO changes in other part of the code rather than just pgbench.c and we do not have any test-case as well. However, here are few comments on changes in pgbench.c 1. Since the final discussion ended at keeping a 5 seconds interval will be good enough, Author used a global int variable for that. Given that it's just a constant, #define would be a better choice. Good point. Although if we add an option to supply different values, a variable is a better match. 2. +/* let's not call the timing for each row, but only each 100 rows */ Why only 100 rows ? Have you done any testing to come up with number 100 ? To me it seems very low. It will be good to test with 1K or even 10K. On my machine (2.4 GHz Intel core 2 duo Macbook PRO, running Ubuntu in VM with 4GB RAM, 1067 DDR3), in 5 Sec, approx 1M rows were inserted. So checking every 100 rows looks overkill. Well, the 100 is clearly a magical constant. The goal was to choose a value large enough to amortize the getlocaltime() cost, but small enough to print info even in cases when the performance sucks for some reason. I've seen issues where the speed suddenly dropped to a fraction of the expected value, e.g. 100 rows/second, and in those cases you'd have to wait for a very long time to actually get the next log message (with the suggested 10k step). So 100 seems like a good compromise to me ... 3. Please indent following block as per the indentation just above that /* used to track elapsed time and estimate of the remaining time */ instr_timestart, diff; double elapsed_sec, remaining_sec; int log_interval = 1; OK 4. +/* have ve reached the next interval? */ Do you mean have WE reached... OK 5. While applying a patch, I got few white-space errors. But I think every patch goes through pgindent which might take care of this. OK Thanks Thanks for the review. I'll wait for a bit more discussion about the choices before submitting another version of the patch. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On 19.11.2012 22:58, Alexander Korotkov wrote: Hi! New version of patch is attached. Changes are following: 1) Right way to convert from pg_wchar to multibyte. 2) Optimization of producing CFNA-like graph on trigrams (produce smaller, but equivalent, graphs in less time). 3) Comments and refactoring. Hi, thanks for the updated message-id. I've done the initial review: 1) Patch applies fine to the master. 2) It's common to use upper-case names for macros, but trgm.h defines macro iswordchr - I see it's moved from trgm_op.c but maybe we could make it a bit more correct? 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each other in trgm.h - maybe it'd be a good idea to join them? 4) The two new method prototypes at the end of trgm.h use different indendation than the rest (spaces only instead of tabs). 5) There are no regression tests / updated docs (yet). 6) It does not compile - I do get a bunch of errors like this trgm_regexp.c:73:2: error: expected specifier-qualifier-list before ‘TrgmStateKey’ trgm_regexp.c: In function ‘addKeys’: trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’ trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’ ... It seems this is cause by the order of typedefs in trgm_regexp.c, namely TrgmState referencing TrgmStateKey before it's defined. Moving the TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4 but I think it's not compiler-dependent.) 7) Once fixed, it seems to work CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,100) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*'; QUERY PLAN - Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33) Recheck Cond: (val ~ '.*qqq.*'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100 width=0) Index Cond: (val ~ '.*qqq.*'::text) (4 rows) but I do get a bunch of NOTICE messages with debugging info (no matter if the GIN index is used or not, so it's somewhere in the common regexp code). But I guess that's due to WIP status. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: store additional info in GIN index
On 18.11.2012 22:54, Alexander Korotkov wrote: Hackers, Patch completely changes storage in posting lists and leaf pages of posting trees. It uses varbyte encoding for BlockNumber and OffsetNumber. BlockNumber are stored incremental in page. Additionally one bit of OffsetNumber is reserved for additional information NULL flag. To be able to find position in leaf data page quickly patch introduces small index in the end of page. Hi, I've tried to apply the patch with the current HEAD, but I'm getting segfaults whenever VACUUM runs (either called directly or from autovac workers). The patch applied cleanly against 9b3ac49e and needed a minor fix when applied on HEAD (because of an assert added to ginRedoCreatePTree), but that shouldn't be a problem. The backtrace always looks like this: Program received signal SIGSEGV, Segmentation fault. 0x004dea3b in processPendingPage (accum=0x7fff15ab8aa0, ka=0x7fff15ab8a70, page=0x7f88774a7ea0 , startoff=1) at ginfast.c:785 785 addInfo = index_getattr(itup, 2, accum-ginstate-tupdesc[curattnum - 1], addInfoIsNull); (gdb) bt #0 0x004dea3b in processPendingPage (accum=0x7fff15ab8aa0, ka=0x7fff15ab8a70, page=0x7f88774a7ea0 , startoff=1) at ginfast.c:785 #1 0x004df3c6 in ginInsertCleanup (ginstate=0x7fff15ab97c0, vac_delay=1 '\001', stats=0xfb0050) at ginfast.c:909 #2 0x004dbe8c in ginbulkdelete (fcinfo=0x7fff15abbfb0) at ginvacuum.c:747 Reproducing the issue is quite simple: 1) create table messages (id int, txt text, ts tsvector); 2) insert into messages select i, substr(md5(i::text), 0, 4), to_tsvector('english', substr(md5(i::text), 0, 4)) from generate_series(1,10) s(i); 3) vacuum messages 4) ... segfault :-( regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum stress-testing our system
On 21.11.2012 19:02, Robert Haas wrote: On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra t...@fuzzy.cz wrote: The two main changes are these: (1) The stats file is split into a common db file, containing all the DB Entries, and per-database files with tables/functions. The common file is still called pgstat.stat, the per-db files have the database OID appended, so for example pgstat.stat.12345 etc. This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile functions, introducing two new functions: pgstat_read_db_statsfile pgstat_write_db_statsfile that do the trick of reading/writing stat file for one database. (2) The pgstat_read_statsfile has an additional parameter onlydbs that says that you don't need table/func stats - just the list of db entries. This is used for autovacuum launcher, which does not need to read the table/stats (if I'm reading the code in autovacuum.c correctly - it seems to be working as expected). I'm not an expert on the stats system, but this seems like a promising approach to me. (a) It does not solve the many-schema scenario at all - that'll need a completely new approach I guess :-( We don't need to solve every problem in the first patch. I've got no problem kicking this one down the road. (b) It does not solve the writing part at all - the current code uses a single timestamp (last_statwrite) to decide if a new file needs to be written. That clearly is not enough for multiple files - there should be one timestamp for each database/file. I'm thinking about how to solve this and how to integrate it with pgstat_send_inquiry etc. Presumably you need a last_statwrite for each file, in a hash table or something, and requests need to specify which file is needed. And yet another one I'm thinking about is using a fixed-length array of timestamps (e.g. 256), indexed by mod(dboid,256). That would mean stats for all databases with the same mod(oid,256) would be written at the same time. Seems like an over-engineering though. That seems like an unnecessary kludge. (c) I'm a bit worried about the number of files - right now there's one for each database and I'm thinking about splitting them by type (one for tables, one for functions) which might make it even faster for some apps with a lot of stored procedures etc. But is the large number of files actually a problem? After all, we're using one file per relation fork in the base directory, so this seems like a minor issue. I don't see why one file per database would be a problem. After all, we already have on directory per database inside base/. If the user has so many databases that dirent lookups in a directory of that size are a problem, they're already hosed, and this will probably still work out to a net win. Attached is a v2 of the patch, fixing some of the issues and unclear points from the initial version. The main improvement is that it implements writing only stats for the requested database (set when sending inquiry). There's a dynamic array of request - for each DB only the last request is kept. I've done a number of changes - most importantly: - added a stats_timestamp field to PgStat_StatDBEntry, keeping the last write of the database (i.e. a per-database last_statwrite), which is used to decide whether the file is stale or not - handling of the 'permanent' flag correctly (used when starting or stopping the cluster) for per-db files - added a very simple header to the per-db files (basically just a format ID and a timestamp) - this is needed for checking of the timestamp of the last write from workers (although maybe we could just read the pgstat.stat, which is now rather small) - a 'force' parameter (true - write all databases, even if they weren't specifically requested) So with the exception of 'multi-schema' case (which was not the aim of this effort), it should solve all the issues of the initial version. There are two blocks of code dealing with clock glitches. I haven't fixed those yet, but that can wait I guess. I've also left there some logging I've used during development (printing inquiries and which file is written and when). The main unsolved problem I'm struggling with is what to do when a database is dropped? Right now, the statfile remains in pg_stat_tmp forewer (or until the restart) - is there a good way to remove the file? I'm thinking about adding a message to be sent to the collector from the code that handles DROP TABLE. I've done some very simple performance testing - I've created 1000 databases with 1000 tables each, done ANALYZE on all of them. With only autovacum running, I've seen this: Without the patch - %CPU %MEM TIME+ COMMAND 183.00:10.10 postgres: autovacuum launcher process 172.60:11.44 postgres: stats collector process The I/O
Re: [HACKERS] WIP: store additional info in GIN index
On 4.12.2012 20:12, Alexander Korotkov wrote: Hi! On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: I've tried to apply the patch with the current HEAD, but I'm getting segfaults whenever VACUUM runs (either called directly or from autovac workers). The patch applied cleanly against 9b3ac49e and needed a minor fix when applied on HEAD (because of an assert added to ginRedoCreatePTree), but that shouldn't be a problem. Thanks for testing! Patch is rebased with HEAD. The bug you reported was fixed. Applies fine, but I get a segfault in dataPlaceToPage at gindatapage.c. The whole backtrace is here: http://pastebin.com/YEPuWeuV The messages written into PostgreSQL log are quite variable - usually it looks like this: 2012-12-04 22:31:08 CET 31839 LOG: database system was not properly shut down; automatic recovery in progress 2012-12-04 22:31:08 CET 31839 LOG: redo starts at 0/68A76E48 2012-12-04 22:31:08 CET 31839 LOG: unexpected pageaddr 0/1BE64000 in log segment 00010069, offset 15089664 2012-12-04 22:31:08 CET 31839 LOG: redo done at 0/69E63638 but I've seen this message too 2012-12-04 22:20:29 CET 31709 LOG: database system was not properly shut down; automatic recovery in progress 2012-12-04 22:20:29 CET 31709 LOG: redo starts at 0/AEAFAF8 2012-12-04 22:20:29 CET 31709 LOG: record with zero length at 0/C7D5698 2012-12-04 22:20:29 CET 31709 LOG: redo done at 0/C7D55E I wasn't able to prepare a simple testcase to reproduce this, so I've attached two files from my fun project where I noticed it. It's a simple DB + a bit of Python for indexing mbox archives inside Pg. - create.sql - a database structure with a bunch of GIN indexes on tsvector columns on messages table - load.py - script for parsing mbox archives / loading them into the messages table (warning: it's a bit messy) Usage: 1) create the DB structure $ createdb archives $ psql archives create.sql 2) fetch some archives (I consistently get SIGSEGV after first three) $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-01.gz $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-02.gz $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-03.gz 3) gunzip and load them using the python script $ gunzip pgsql-hackers.*.gz $ ./load.py --db archives pgsql-hackers.* 4) et voila - a SIGSEGV :-( I suspect this might be related to the fact that the load.py script uses savepoints quite heavily to handle UNIQUE_VIOLATION (duplicate messages). Tomas #!/bin/env python import argparse import datetime import getpass import multiprocessing import os import psycopg2 import psycopg2.extras import psycopg2.errorcodes import quopri import random import re import sys import traceback import UserDict from multiprocessing import Process, JoinableQueue class Message(dict): def __init__(self, message): self.message = message self.body= self.body(message) self.headers = self.headers(message) self.parts = self.parts(message) def __getitem__(self, key): if self.headers.has_key(key.lower()): return self.headers[key.lower()] else: return None def __setitem__(self, key, value): self.headers.update({key.lower() : value}) def __delitem__(self, key): if self.headers.has_key(key.lower()): del self.headers[key.lower()] def keys(self): return self.headers.keys() def get_body(self): return self.body def get_raw(self): return self.message def get_parts(self): return self.parts def get_headers(self): return self.headers def get_content_type(self): if self.headers.has_key('content-type'): return self.headers['content-type'].split(';')[0] else: return None def __repr__(self): return '%s %s' % (type(self).__name__, self.headers) def is_multipart(self): ctype = self.get_content_type() if ctype != None and re.match('multipart/.*', ctype): return True else: return False def part_boundary(self): if not self.is_multipart(): return None else: r = re.match('.*boundary=?([^]*)?', self.headers['content-type'], re.IGNORECASE) if r: return '--' + r.group(1) # FIXME this keeps only the last value - needs to keep a list def headers(self, message): lines = message.split(\n) key = '' value = '' headers = {}; for l in lines: if l == '': if key != '': headers.update({key.lower() : value}) break r = re.match('([a-zA-Z0-9-]*):\s*(.*)', l) if r: if key != '': headers.update({key.lower() : value}) key = r.group(1) value = r.group(2) else: value += ' ' + l.strip() r = re.match('^From .*@.*\s+([a-zA-Z]*\s+[a-zA-Z]*\s+[0-9]+ [0-9]+:[0-9]+:[0-9]+\s+[0-9]{4})$', lines[0]) if r: headers.update({'message-date' : r.group(1)}) r = re.match('^From bouncefilter\s+([a-zA-Z]*\s+[a-zA-Z]*\s+[0
Re: [HACKERS] WIP: store additional info in GIN index
On 5.12.2012 09:10, Alexander Korotkov wrote: On Wed, Dec 5, 2012 at 1:56 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: Thanks for bug report. It is fixed in the attached patch. Hi, I gave it another try and this time it went fine - I didn't get any segfault when loading the data, which is a good news. Then I've run a simple benchmarking script, and the results are not as good as I expected, actually I'm getting much worse performance than with the original GIN index. The following table contains the time of loading the data (not a big difference), and number of queries per minute for various number of words in the query. The queries looks like this SELECT id FROM messages WHERE body_tsvector @@ plainto_tsquery('english', 'word1 word2 ...') so it's really the simplest form of FTS query possible. without patch | with patch loading 750 sec| 770 sec 1 word 1500|1100 2 words 23000|9800 3 words 24000|9700 4 words 16000|7200 I'm not saying this is a perfect benchmark, but the differences (of querying) are pretty huge. Not sure where this difference comes from, but it seems to be quite consistent (I usually get +-10% results, which is negligible considering the huge difference). Is this an expected behaviour that will be fixed by another patch? The database contains ~680k messages from the mailing list archives, i.e. about 900 MB of data (in the table), and the GIN index on tsvector is about 900MB too. So the whole dataset nicely fits into memory (8GB RAM), and it seems to be completely CPU bound (no I/O activity at all). The configuration was exactly the same in both cases shared buffers = 1GB work mem = 64 MB maintenance work mem = 256 MB I can either upload the database somewhere, or provide the benchmarking script if needed. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 6.12.2012 05:47, Shigeru Hanada wrote: On Mon, Nov 12, 2012 at 4:36 AM, Tomas Vondra t...@fuzzy.cz wrote: Hi, I've prepared a slightly updated patch, based on the previous review. See it attached. All changes in v3 patch seem good, however I found some places which requires cosmetic changes. Please see attached v3.1 patch for those changes. OK, thanks! Oops, you're right. I omitted 1-3 and 1-4 in actual measurement, but IMO loading data is necessary to fill the shared buffer up, because # of buffers which are deleted during COMMIT is major factor of this patch. And, yes, I verified that all shared buffers are used after all loading have been finished. I don't think it's an important factor, as the original code does this: for (i = 0; i NBuffers; i++) { volatile BufferDesc *bufHdr = BufferDescriptors[i]; ... } in the DropRelFileNodeAllBuffers. That loops through all shared buffers no matter what, so IMHO the performance in this case depends on the total size of the shared buffers. But maybe I'm missing something important. I worry the effect of continue in the loop to the performance. If most of shared buffers are not used at the moment of DROP, the load of DROP doesn't contain the overhead of LockBufHdr and subsequent few lines. Do you expect such situation in your use cases? I still fail to see the issue here - can you give an example of when this would be a problem? The code was like this before the patch, I only replaced the simple comparison to a binary search ... Or do you think that this check means if the buffer is empty? /* buffer does not belong to any of the relations */ if (rnode == NULL) continue; Because it does not - notice that the 'rnode' is a result of the binary search, not a direct check of the buffer. So the following few lines (lock and recheck) will be skipped either for unused buffers and buffers belonging to relations that are not being dropped. Maybe we could do a simple 'is the buffer empty' check before the bsearch call, but I don't expect that to happen very often in real world (the cache tends to be used). I've done a simple benchmark on my laptop with 2GB shared buffers, it's attached in the drop-test.py (it's a bit messy, but it works). [snip] With those parameters, I got these numbers on the laptop: creating 1 tables all tables created in 3.298694 seconds dropping 1 tables one by one ... all tables dropped in 32.692478 seconds creating 1 tables all tables created in 3.458178 seconds dropping 1 tables in batches by 100... all tables dropped in 3.28268 seconds So it's 33 seconds vs. 3.3 seconds, i.e. 10x speedup. On AWS we usually get larger differences, as we use larger shared buffers and the memory is significantly slower there. Do you have performance numbers of same test on not-patched PG? This patch aims to improve performance of bulk DROP, so 4th number in the result above should be compared to 4th number of not-patched PG? I've re-run the tests with the current patch on my home workstation, and the results are these (again 10k tables, dropped either one-by-one or in batches of 100). 1) unpatched dropping one-by-one:15.5 seconds dropping in batches of 100: 12.3 sec 2) patched (v3.1) dropping one-by-one:32.8 seconds dropping in batches of 100: 3.0 sec The problem here is that when dropping the tables one-by-one, the bsearch overhead is tremendous and significantly increases the runtime. I've done a simple check (if dropping a single table, use the original simple comparison) and I got this: 3) patched (v3.2) dropping one-by-one:16.0 seconds dropping in batches of 100: 3.3 sec i.e. the best of both. But it seems like an unnecessary complexity to me - if you need to drop a lot of tables you'll probably do that in a transaction anyway. Tomas diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 2446282..9e92230 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -312,6 +312,10 @@ smgrDoPendingDeletes(bool isCommit) PendingRelDelete *pending; PendingRelDelete *prev; PendingRelDelete *next; + SMgrRelation *srels = (SMgrRelation *) palloc(sizeof(SMgrRelation)); + int nrels = 0, + i = 0, + maxrels = 1; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) @@ -335,14 +339,33 @@ smgrDoPendingDeletes(bool isCommit) SMgrRelation srel; srel = smgropen(pending-relnode, pending-backend); - smgrdounlink(srel, false); - smgrclose(srel); + + /* extend the array if needed (double the size
Re: [HACKERS] Serious problem: media recovery fails after system or PostgreSQL crash
Hi, On 6.12.2012 23:45, MauMau wrote: From: Tom Lane t...@sss.pgh.pa.us Well, that's unfortunate, but it's not clear that automatic recovery is possible. The only way out of it would be if an undamaged copy of the segment was in pg_xlog/ ... but if I recall the logic correctly, we'd not even be trying to fetch from the archive if we had a local copy. No, PG will try to fetch the WAL file from pg_xlog when it cannot get it from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains the following description: http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL WAL segments that cannot be found in the archive will be sought in pg_xlog/; this allows use of recent un-archived segments. However, segments that are available from the archive will be used in preference to files in pg_xlog/. So why don't you use an archive command that does not create such incomplete files? I mean something like this: archive_command = 'cp %p /arch/%f.tmp mv /arch/%f.tmp /arch/%f' Until the file is renamed, it's considered 'incomplete'. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 8.12.2012 15:26, Andres Freund wrote: On 2012-12-06 23:38:59 +0100, Tomas Vondra wrote: I've re-run the tests with the current patch on my home workstation, and the results are these (again 10k tables, dropped either one-by-one or in batches of 100). 1) unpatched dropping one-by-one:15.5 seconds dropping in batches of 100: 12.3 sec 2) patched (v3.1) dropping one-by-one:32.8 seconds dropping in batches of 100: 3.0 sec The problem here is that when dropping the tables one-by-one, the bsearch overhead is tremendous and significantly increases the runtime. I've done a simple check (if dropping a single table, use the original simple comparison) and I got this: 3) patched (v3.2) dropping one-by-one:16.0 seconds dropping in batches of 100: 3.3 sec i.e. the best of both. But it seems like an unnecessary complexity to me - if you need to drop a lot of tables you'll probably do that in a transaction anyway. Imo that's still a pretty bad performance difference. And your single-table optimization will probably fall short as soon as the table has indexes and/or a toast table... Why? I haven't checked the code but either those objects are droppped one-by-one (which seems unlikely) or they're added to the pending list and then the new code will kick in, making it actually faster. Yes, the original code might be faster even for 2 or 3 objects, but I can't think of a simple solution to fix this, given that it really depends on CPU/RAM speed and shared_buffers size. + +/* + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so + * that it's suitable for bsearch. + */ +static int +rnode_comparator(const void * p1, const void * p2) +{ +RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1; +RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2; + +if (n1.node.relNode n2.node.relNode) +return -1; +else if (n1.node.relNode n2.node.relNode) +return 1; + +if (n1.node.dbNode n2.node.dbNode) +return -1; +else if (n1.node.dbNode n2.node.dbNode) +return 1; + +if (n1.node.spcNode n2.node.spcNode) +return -1; +else if (n1.node.spcNode n2.node.spcNode) +return 1; +else +return 0; +} ISTM that this whole comparator could be replaced by memcmp(). That could quite possibly lower the overhead of the bsearch() in simple cases. We already rely on the fact that RelFileNode's have no padding atm (c.f. buf_internal.h), so a memcmp() seems fine to me. Good point, I'll try that. The original code used a macro that simply compared the fields and I copied that logic, but if we can remove the code ... Thanks for the review! Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 8.12.2012 15:49, Tomas Vondra wrote: On 8.12.2012 15:26, Andres Freund wrote: On 2012-12-06 23:38:59 +0100, Tomas Vondra wrote: I've re-run the tests with the current patch on my home workstation, and the results are these (again 10k tables, dropped either one-by-one or in batches of 100). 1) unpatched dropping one-by-one:15.5 seconds dropping in batches of 100: 12.3 sec 2) patched (v3.1) dropping one-by-one:32.8 seconds dropping in batches of 100: 3.0 sec The problem here is that when dropping the tables one-by-one, the bsearch overhead is tremendous and significantly increases the runtime. I've done a simple check (if dropping a single table, use the original simple comparison) and I got this: 3) patched (v3.2) dropping one-by-one:16.0 seconds dropping in batches of 100: 3.3 sec i.e. the best of both. But it seems like an unnecessary complexity to me - if you need to drop a lot of tables you'll probably do that in a transaction anyway. Imo that's still a pretty bad performance difference. And your single-table optimization will probably fall short as soon as the table has indexes and/or a toast table... Why? I haven't checked the code but either those objects are droppped one-by-one (which seems unlikely) or they're added to the pending list and then the new code will kick in, making it actually faster. Yes, the original code might be faster even for 2 or 3 objects, but I can't think of a simple solution to fix this, given that it really depends on CPU/RAM speed and shared_buffers size. I've done some test and yes - once there are other objects the optimization falls short. For example for tables with one index, it looks like this: 1) unpatched one by one: 28.9 s 100 batches: 23.9 s 2) patched one by one: 44.1 s 100 batches: 4.7 s So the patched code is by about 50% slower, but this difference quickly disappears with the number of indexes / toast tables etc. I see this as an argument AGAINST such special-case optimization. My reasoning is this: * This difference is significant only if you're dropping a table with low number of indexes / toast tables. In reality this is not going to be very frequent. * If you're dropping a single table, it really does not matter - the difference will be like 100 ms vs. 200 ms or something like that. * If you're dropping a lot of tables, you should do than in a transaction anyway, or you should be aware that doing that in a transaction will be faster (we can add this info into DROP TABLE docs). IMHO this is similar to doing a lot of INSERTs - doing all of them in a single transaction is faster. The possibility that someone needs to drop a lot of tables in separate transactions exists in theory, but I don't know of a realistic real-world usage. So I'd vote to ditch that special case optimization. + +/* + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so + * that it's suitable for bsearch. + */ +static int +rnode_comparator(const void * p1, const void * p2) +{ + RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1; + RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2; + + if (n1.node.relNode n2.node.relNode) + return -1; + else if (n1.node.relNode n2.node.relNode) + return 1; + + if (n1.node.dbNode n2.node.dbNode) + return -1; + else if (n1.node.dbNode n2.node.dbNode) + return 1; + + if (n1.node.spcNode n2.node.spcNode) + return -1; + else if (n1.node.spcNode n2.node.spcNode) + return 1; + else + return 0; +} ISTM that this whole comparator could be replaced by memcmp(). That could quite possibly lower the overhead of the bsearch() in simple cases. We already rely on the fact that RelFileNode's have no padding atm (c.f. buf_internal.h), so a memcmp() seems fine to me. Good point, I'll try that. The original code used a macro that simply compared the fields and I copied that logic, but if we can remove the code ... Hmmm, I've replaced the comparator with this one: static int rnode_comparator(const void * p1, const void * p2) { return memcmp(p1, p2, sizeof(RelFileNode)); } and while it's not significantly faster (actually it's often a bit slower than the original comparator), it's a much simpler code. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On 8.12.2012 16:33, Andres Freund wrote: Hi Tomas, On 2012-11-27 14:55:59 +0100, Pavel Stehule wrote: attached is a v4 of the patch. There are not many changes, mostly some simple tidying up, except for handling the Windows. After a quick look I am not sure what all the talk about windows is about? instr_time.h seems to provide all you need, even for windows? The only issue of gettimeofday() for windows seems to be that it is that its not all that fast an not too high precision, but that shouldn't be a problem in this case? Could you expand a bit on the problems? Well, in the current pgbench.c, there's this piece of code #ifndef WIN32 /* This is more than we really ought to know about instr_time */ fprintf(logfile, %d %d %.0f %d %ld %ld\n, st-id, st-cnt, usec, st-use_file, (long) now.tv_sec, (long) now.tv_usec); #else /* On Windows, instr_time doesn't provide a timestamp anyway */ fprintf(logfile, %d %d %.0f %d 0 0\n, st-id, st-cnt, usec, st-use_file); #endif and the Windows-related discussion in this patch is about the same issue. As I understand it the problem seems to be that INSTR_TIME_SET_CURRENT does not provide the timestamp at all. I was thinking about improving this by combining gettimeofday and INSTR_TIME, but I have no Windows box to test it on :-( * I had a problem with doc The current patch has conflict markers in the sgml source, there seems to have been some unresolved merge. Maybe that's all that causes the errors? Ah, I see. Probably my fault, I'll fix that. Whats your problem with setting up the doc toolchain? issues: * empty lines with invisible chars (tabs) + and sometimes empty lines after and before {} * adjustment of start_time + * the desired interval */ + while (agg-start_time + agg_interval INSTR_TIME_GET_DOUBLE(now)) + agg-start_time = agg-start_time + agg_interval; can skip one interval - so when transaction time will be larger or similar to agg_interval - then results can be strange. We have to know real length of interval Could you post a patch that adresses these issues? Yes, I'll work on it today/tomorrow and I'll send an improved patch. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
Hi, attached is a v5 of this patch. Details below: On 8.12.2012 16:33, Andres Freund wrote: Hi Tomas, On 2012-11-27 14:55:59 +0100, Pavel Stehule wrote: attached is a v4 of the patch. There are not many changes, mostly some simple tidying up, except for handling the Windows. After a quick look I am not sure what all the talk about windows is about? instr_time.h seems to provide all you need, even for windows? The only issue of gettimeofday() for windows seems to be that it is that its not all that fast an not too high precision, but that shouldn't be a problem in this case? Could you expand a bit on the problems? As explained in the previous message, this is an existing problem with unavailable timestamp. I'm not very fond of adding Linux-only features, but fixing that was not the goal of this patch. * I had a problem with doc The current patch has conflict markers in the sgml source, there seems to have been some unresolved merge. Maybe that's all that causes the errors? Whats your problem with setting up the doc toolchain? Yeah, my fault because of incomplete merge. But the real culprit was a missing refsect2. Fixed. issues: * empty lines with invisible chars (tabs) + and sometimes empty lines after and before {} Fixed (I've removed the lines etc.) * adjustment of start_time + * the desired interval */ + while (agg-start_time + agg_interval INSTR_TIME_GET_DOUBLE(now)) + agg-start_time = agg-start_time + agg_interval; can skip one interval - so when transaction time will be larger or similar to agg_interval - then results can be strange. We have to know real length of interval Could you post a patch that adresses these issues? So, in the end I've rewritten the section advancing the start_time. Now it works so that when skipping an interval (because of a very long transaction), it will print lines even for those empty intervals. So for example with a transaction file containing a single query SELECT pg_sleep(1.5); and an interval length of 1 second, you'll get something like this: 1355009677 0 0 0 0 0 1355009678 1 1501028 2253085056784 1501028 1501028 1355009679 0 0 0 0 0 1355009680 1 1500790 2252370624100 1500790 1500790 1355009681 1 1500723 2252169522729 1500723 1500723 1355009682 0 0 0 0 0 1355009683 1 1500719 2252157516961 1500719 1500719 1355009684 1 1500703 2252109494209 1500703 1500703 1355009685 0 0 0 0 0 which is IMHO the best way to deal with this. I've fixed several minor issues, added a few comments. regards Tomas diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e376452..5a03796 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -150,6 +150,7 @@ char *index_tablespace = NULL; #define naccounts 10 bool use_log;/* log transaction latencies to a file */ +intagg_interval; /* log aggregates instead of individual transactions */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ intmain_pid; /* main process id used in log filename */ @@ -245,6 +246,18 @@ typedef struct char *argv[MAX_ARGS]; /* command word list */ } Command; +typedef struct +{ + + longstart_time; /* when does the interval start */ + int cnt;/* number of transactions */ + double min_duration; /* min/max durations */ + double max_duration; + double sum;/* sum(duration), sum(duration^2) - for estimates */ + double sum2; + +} AggVals; + static Command **sql_files[MAX_FILES]; /* SQL script files */ static int num_files; /* number of script files */ static int num_commands = 0; /* total number of Command structs */ @@ -377,6 +390,10 @@ usage(void) -l write transaction times to log file\n --sampling-rate NUM\n fraction of transactions to log (e.g. 0.01 for 1%% sample)\n +#ifndef WIN32 +--aggregate-interval NUM\n + aggregate data over NUM seconds\n +#endif -M simple|extended|prepared\n protocol for submitting queries to server (default: simple)\n -n do not run VACUUM before tests\n @@ -830,9 +847,25 @@ clientDone(CState *st, bool ok) return false; /* always false */ } +static +void agg_vals_init(AggVals * aggs, instr_time start) +{ + /* basic counters */ + aggs-cnt = 0; /* number of transactions */ + aggs-sum =
Re: [HACKERS] too much pgbench init output
On 20.11.2012 08:22, Jeevan Chalke wrote: Hi, On Tue, Nov 20, 2012 at 12:08 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 19.11.2012 11:59, Jeevan Chalke wrote: Hi, I gone through the discussion for this patch and here is my review: The main aim of this patch is to reduce the number of log lines. It is also suggested to use an options to provide the interval but few of us are not much agree on it. So final discussion ended at keeping 5 sec interval between each log line. However, I see, there are two types of users here: 1. Who likes these log lines, so that they can troubleshoot some slowness and all 2. Who do not like these log lines. Who likes these lines / needs them for something useful? No idea. I fall in second category. But from the discussion, I believe some people may need detailed (or lot more) output. I've read the thread again and my impression is that no one really needs or likes those lines, but (1) it's rather pointless to print a message every 100k rows, as it usually fills the console with garbabe (2) it's handy to have regular updates of the progress I don't think there're people (in the thread) that require to keep the current amount of log messages. But there might be users who actually use the current logs to do something (although I can't imagine what). If we want to do this in a backwards compatible way, we should probably use a new option (e.g. -q) to enable the new (less verbose) logging. Do we want to allow both types of logging, or shall we keep only the new one? If both, which one should be the default one? So keeping these in mind, I rather go for an option which will control this. People falling in category one can set this option to very low where as users falling under second category can keep it high. So what option(s) would you expect? Something that tunes the interval length or something else? Interval length. Well, I can surely imagine something like --interval N. A switch that'd choose between the old and new behavior might be a good idea, but I'd strongly vote against automagic heuristics. It makes the behavior very difficult to predict and I really don't want to force the users to wonder whether the long delay is due to general slowness of the machine or some clever logic that causes long delays between log messages. That's why I choose a very simple approach with constant time interval. It does what I was aiming for (less logs) and it's easy to predict. Sure, we could choose different interval (or make it an option). I am preferring an option for choosing an interval, say from 1 second to 10 seconds. U, why not to allow arbitrary integer? Why saying 1 to 10 seconds? BTW, what if, we put one log message every 10% (or 5%) with time taken (time taken for last 10% (or 5%) and cumulative) over 5 seconds ? This will have only 10 (or 20) lines per pgbench initialisation. And since we are showing time taken for each block, if any slowness happens, one can easily find a block by looking at the timings and troubleshoot it. Though 10% or 5% is again a debatable number, but keeping it constant will eliminate the requirement of an option. That's what I originally proposed in September (see the messages from 17/9), and Alvaro was not relly excited about this. Attached is a patch with fixed whitespace / indentation errors etc. Otherwise it's the same as the previous version. Tomas diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e376452..334ce4c 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -135,6 +135,11 @@ intunlogged_tables = 0; double sample_rate = 0.0; /* + * logging steps (seconds between log messages) + */ +intlog_step_seconds = 5; + +/* * tablespace selection */ char *tablespace = NULL; @@ -1362,6 +1367,11 @@ init(bool is_no_vacuum) charsql[256]; int i; + /* used to track elapsed time and estimate of the remaining time */ + instr_time start, diff; + double elapsed_sec, remaining_sec; + int log_interval = 1; + if ((con = doConnect()) == NULL) exit(1); @@ -1430,6 +1440,8 @@ init(bool is_no_vacuum) } PQclear(res); + INSTR_TIME_SET_CURRENT(start); + for (i = 0; i naccounts * scale; i++) { int j = i + 1; @@ -1441,10 +1453,27 @@ init(bool is_no_vacuum) exit(1); } - if (j % 10 == 0) - fprintf(stderr, %d of %d tuples (%d%%) done.\n, - j, naccounts * scale
Re: [HACKERS] New statistics for WAL buffer dirty writes
On 29.10.2012 04:58, Satoshi Nagayasu wrote: 2012/10/24 1:12, Alvaro Herrera wrote: Satoshi Nagayasu escribi�: With this patch, walwriter process and each backend process would sum up dirty writes, and send it to the stat collector. So, the value could be saved in the stat file, and could be kept on restarting. The statistics could be retreive with using pg_stat_get_xlog_dirty_writes() function, and could be reset with calling pg_stat_reset_shared('walwriter'). Now, I have one concern. The reset time could be captured in globalStats.stat_reset_timestamp, but this value is the same with the bgwriter one. So, once pg_stat_reset_shared('walwriter') is called, stats_reset column in pg_stat_bgwriter does represent the reset time for walwriter, not for bgwriter. How should we handle this? Should we split this value? And should we have new system view for walwriter? I think the answer to the two last questions is yes. It doesn't seem to make sense, to me, to have a single reset timings for what are effectively two separate things. Please submit an updated patch to next CF. I'm marking this one returned with feedback. Thanks. I attached the latest one, which splits the reset_time for bgwriter and walwriter, and provides new system view, called pg_stat_walwriter, to show the dirty write counter and the reset time. I've done a quick review of the v4 patch: 1) applies fine on HEAD, compiles fine 2) make installcheck fails because of a difference in the 'rules' test suite (there's a new view pg_stat_walwriter - see the attached patch for a fixed version or expected/rules.out) 3) I do agree with Alvaro that using the same struct for two separate components (bgwriter and walwriter) seems a bit awkward. For example you need to have two separate stat_reset fields, the reset code becomes much more verbose (because you need to list individual fields) etc. So I'd vote to either split this into two structures or keeping it as a single structure (although with two views on top of it). 4) Are there any other fields that might be interesting? Right now there's just dirty_writes but I guess there are other values. E.g. how much data was actually written etc.? Tomas diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e376452..334ce4c 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -135,6 +135,11 @@ intunlogged_tables = 0; double sample_rate = 0.0; /* + * logging steps (seconds between log messages) + */ +intlog_step_seconds = 5; + +/* * tablespace selection */ char *tablespace = NULL; @@ -1362,6 +1367,11 @@ init(bool is_no_vacuum) charsql[256]; int i; + /* used to track elapsed time and estimate of the remaining time */ + instr_time start, diff; + double elapsed_sec, remaining_sec; + int log_interval = 1; + if ((con = doConnect()) == NULL) exit(1); @@ -1430,6 +1440,8 @@ init(bool is_no_vacuum) } PQclear(res); + INSTR_TIME_SET_CURRENT(start); + for (i = 0; i naccounts * scale; i++) { int j = i + 1; @@ -1441,10 +1453,27 @@ init(bool is_no_vacuum) exit(1); } - if (j % 10 == 0) - fprintf(stderr, %d of %d tuples (%d%%) done.\n, - j, naccounts * scale, - (int) (((int64) j * 100) / (naccounts * scale))); + /* let's not call the timing for each row, but only each 100 rows */ + if (j % 100 == 0 || j == scale * naccounts) + { + INSTR_TIME_SET_CURRENT(diff); + INSTR_TIME_SUBTRACT(diff, start); + + elapsed_sec = INSTR_TIME_GET_DOUBLE(diff); + remaining_sec = (scale * naccounts - j) * elapsed_sec / j; + + /* have we reached the next interval? */ + if (elapsed_sec = log_interval * log_step_seconds) { + + fprintf(stderr, %d of %d tuples (%d%%) done (elapsed %.2f s, remaining %.2f s).\n, + j, naccounts * scale, + (int) (((int64) j * 100) / (naccounts * scale)), elapsed_sec, remaining_sec); + + /* skip to the next interval */ + log_interval = (int)ceil(elapsed_sec/log_step_seconds); + } + } + } if (PQputline(con, \\.\n)) { diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b7df8ce..bf9acc5 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@
Re: [HACKERS] CommitFest #3 and upcoming schedule
On 9.12.2012 16:56, Simon Riggs wrote: On 16 November 2012 07:20, Greg Smith g...@2ndquadrant.com wrote: Project guidelines now ask each patch submitter to review patches of the same number and approximate complexity as they submit. If you've submitted some items to the CommitFest, please look at the open list and try to find something you can review. The deadline for 9.3 is looming and many patches have not yet been reviewed. I'm sending a public reminder to all patch authors that they should review other people's patches if they expect their own to be reviewed. Simply put, if you don't help each other by reviewing other patches then the inevitable result will be your patch will not be neither reviewed nor committed. IMHO many of the patches that are currently marked as needs review and have no reviewers, were actually reviewed or are being discussed thoroughly on the list, but this information was not propagated to the CF page. Not sure how to fix this except for updating patches that I've reviewed and urging the others to do the same. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CommitFest #3 and upcoming schedule
On 9.12.2012 22:41, Jeff Janes wrote: On Sun, Dec 9, 2012 at 10:47 AM, Tomas Vondra t...@fuzzy.cz wrote: IMHO many of the patches that are currently marked as needs review and have no reviewers, were actually reviewed or are being discussed thoroughly on the list, but this information was not propagated to the CF page. Should active discussion on the hackers list prevent someone from doing a review? I know I am reluctant to review a patch when it seems it is still being actively redesigned/debated by others. Maybe a new status of needs design consensus would be useful. IMHO introducing new statuses won't improve the state. Moreover reaching a design consensus is a natural part of the review process. I see those discussions as a part of the review process, so it's not that an active discussion means 'no review' (although the CF page states needs review or no reviewer for such patches). There's nothing wrong with doing yet another review for a patch, but in most cases I tend to agree with the points already raised in the discussion so it's not really productive. Thus I share the same reluctance to do more reviews for those actively discussed patches. My point is that some of the idle patches are actually quite active in the background, no one just updated the CF page. And I see many such patches moved forward over the last few days. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
Dne 10.12.2012 16:38, Andres Freund napsal: On 2012-12-08 17:07:38 +0100, Tomas Vondra wrote: I've done some test and yes - once there are other objects the optimization falls short. For example for tables with one index, it looks like this: 1) unpatched one by one: 28.9 s 100 batches: 23.9 s 2) patched one by one: 44.1 s 100 batches: 4.7 s So the patched code is by about 50% slower, but this difference quickly disappears with the number of indexes / toast tables etc. I see this as an argument AGAINST such special-case optimization. My reasoning is this: * This difference is significant only if you're dropping a table with low number of indexes / toast tables. In reality this is not going to be very frequent. * If you're dropping a single table, it really does not matter - the difference will be like 100 ms vs. 200 ms or something like that. I don't particularly buy that argument. There are good reasons (like avoiding deadlocks, long transactions) to drop multiple tables in individual transactions. Not that I have a good plan to how to work around that though :( Yeah, if you need to drop the tables one by one for some reason, you can't get rid of the overhead this way :-( OTOH in the example above the overhead is ~50%, i.e. 1.5ms / table with a single index. Each such associated relation (index, TOAST table, ...) means a relation that needs to be dropped and on my machine, once I reach ~5 relations there's almost no difference as the overhead is balanced by the gains. Not sure how to fix that in an elegant way, though :-( Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Serious problem: media recovery fails after system or PostgreSQL crash
On 8.12.2012 03:08, Jeff Janes wrote: On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra t...@fuzzy.cz wrote: Hi, On 6.12.2012 23:45, MauMau wrote: From: Tom Lane t...@sss.pgh.pa.us Well, that's unfortunate, but it's not clear that automatic recovery is possible. The only way out of it would be if an undamaged copy of the segment was in pg_xlog/ ... but if I recall the logic correctly, we'd not even be trying to fetch from the archive if we had a local copy. No, PG will try to fetch the WAL file from pg_xlog when it cannot get it from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains the following description: http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL WAL segments that cannot be found in the archive will be sought in pg_xlog/; this allows use of recent un-archived segments. However, segments that are available from the archive will be used in preference to files in pg_xlog/. So why don't you use an archive command that does not create such incomplete files? I mean something like this: archive_command = 'cp %p /arch/%f.tmp mv /arch/%f.tmp /arch/%f' Until the file is renamed, it's considered 'incomplete'. Wouldn't having the incomplete file be preferable over having none of it at all? It seems to me you need considerable expertise to figure out how to do optimal recovery (i.e. losing the least transactions) in this situation, and that that expertise cannot be automated. Do you trust a partial file from a good hard drive, or a complete file from a partially melted pg_xlog? It clearly is a rather complex issue, no doubt about that. And yes, reliability of the devices with pg_xlog on them is an important detail. Alghough if the WAL is not written in a reliable way, you're hosed anyway I guess. The recommended archive command is based on the assumption that the local pg_xlog is intact (e.g. because it's located on a reliable RAID1 array), which seems to be the assumption of the OP too. In my opinion it's more likely to meet an incomplete copy of WAL in the archive than a corrupted local WAL. And if it really is corrupted, it would be identified during replay. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
Hi, I've updated the patch to include the optimization described in the previous post, i.e. if the number of relations is below a certain threshold, use a simple for loop, for large numbers of relations use bsearch calls. This is done by a new constant BSEARCH_LIMIT, which is set to 10 in the patch. Then I've modified the 'drop-test' script to take yet another argument - number of indexes for each table. So for example this ./drop-test.py 1 100 3 'dbname=test' means create 1 tables, 3 indexes for each of them, and then drop them in batches of 100 tables. Then I've run the test with 0, 1, 2, ... 11 indexes for each table for (a) unpatched HEAD (b) patch v3.1 (without the optimization) (c) patch v3.3 (with BSEARCH_LIMIT=10) and I got these results: 1) dropping one-by-one -- This is the really interesting part - the issue with the v3.1 is that for a single table, it's ~2x slower than unpatched PostgreSQL. 0 1 2 3 4 5 6 789 10 11 -- unpatched 16 28 40 52 63 75 87 99 110 121 135 147 v3.1 33 43 46 56 58 60 63 72 75 76 79 80 v3.3 16 20 23 25 29 33 36 40 44 47 79 82 The values are durations in seconds, rounded to integer values. I've run the test repeatedly and there's very small variance in the numbers. The v3.3 improves that and it's actually even faster than unpatched PostgreSQL. How can this happen? I believe it's because the code is rewritten from for each relation (r) in the drop list DropRelFileNodeAllBuffers (r) for each shared buffer check and invalidate to copy the relations from drop list to array (a) DropRelFileNodeAllBuffers(a) for each shared buffer for each relation (r) in the array check and invalidate At least that's the only explanation I was able to come up with. Yet another interesting observation is that even the v3.1 is about as fast as the unpatched code once there are 3 or more indexes (or TOAST tables). So my opinion is that the optimizated patch works as expected, and that even without the optimization the performance would be acceptable for most real-world cases. 2) dropping in transaction -- This is mostly to verify that the code did not break anything, because the optimization should not kick-in in this case at all. And that seems to be the case: 0 1 2 3 4 5 6 789 10 11 -- unpatched 13 24 35 46 58 69 81 92 103 115 127 139 v3.13 5 7 8 10 12 14 15 16 18 20 21 v3.33 4 6 7 8 10 11 13 14 15 18 20 The differences between v3.1 and v3.3 are mostly due to rounding etc. Attached is the v3.3 patch and the testing script I've been using for the tests above. Feel free to run the tests on your hardware, with your hardware, shared buffers size etc. I've run that on a 4-core i5-2500 CPU with 2GB shared buffers. Tomas #!/bin/env python import datetime import psycopg2 import sys if __name__ == '__main__': if len(sys.argv) 4: print ERORR: not enough parameters print HINT: drop-test.py num-of-tables drop-num num-of-indexes 'connection string' sys.exit(1) ntables = int(sys.argv[1]) nlimit = int(sys.argv[2]) nindex = int(sys.argv[3]) connstr = str(sys.argv[4]) debug = False conn = psycopg2.connect(connstr) cur = conn.cursor() # print 'creating %s tables' % (ntables,) start = datetime.datetime.now() for i in range(ntables): cur.execute('CREATE TABLE tab_%s (id INT)' % (i,)) for j in range(nindex): cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i)) conn.commit(); if (i % 1000 == 0) and debug: print ' tables created: %s' % (i,) conn.commit() end = datetime.datetime.now() # print ' all tables created in %s seconds' % ((end-start).total_seconds(),) # set to autocommit mode conn.autocommit = True start = datetime.datetime.now() # print 'dropping %s tables one by one ...' % (ntables,) for i in range(ntables): cur.execute('DROP TABLE tab_%s' % (i,)) if (i % 1000 == 0) and debug: print ' tables dropped: %s' % (i,) end = datetime.datetime.now() print 'dropped one-by-one in %s seconds' % ((end-start).total_seconds(),) # cancel the autocommit mode conn.autocommit = False # recreate tables # print 'creating %s tables' % (ntables,) start = datetime.datetime.now() for i in range(ntables): cur.execute('CREATE TABLE tab_%s (id INT)' % (i,)) for j in range(nindex): cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i)) conn.commit(); if (i % 1000 == 0) and debug: print ' tables created: %s' % (i,) conn.commit() end = datetime.datetime.now() # print ' all tables created in %s seconds' % ((end-start).total_seconds(),) # drop the tables in batches
Re: [HACKERS] too much pgbench init output
Hi, attached is a new version of the patch that (a) converts the 'log_step_seconds' variable to a constant (and does not allow changing it using a command-line option etc.) (b) keeps the current logging as a default (c) adds a -q switch that enables the new logging with a 5-second interval I'm still not convinced there should be yet another know for tuning the log interval - opinions? Tomas On 11.12.2012 10:23, Jeevan Chalke wrote: On Sun, Dec 9, 2012 at 8:11 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 20.11.2012 08:22, Jeevan Chalke wrote: Hi, On Tue, Nov 20, 2012 at 12:08 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz mailto:t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 19.11.2012 11:59, Jeevan Chalke wrote: Hi, I gone through the discussion for this patch and here is my review: The main aim of this patch is to reduce the number of log lines. It is also suggested to use an options to provide the interval but few of us are not much agree on it. So final discussion ended at keeping 5 sec interval between each log line. However, I see, there are two types of users here: 1. Who likes these log lines, so that they can troubleshoot some slowness and all 2. Who do not like these log lines. Who likes these lines / needs them for something useful? No idea. I fall in second category. But from the discussion, I believe some people may need detailed (or lot more) output. I've read the thread again and my impression is that no one really needs or likes those lines, but (1) it's rather pointless to print a message every 100k rows, as it usually fills the console with garbabe (2) it's handy to have regular updates of the progress I don't think there're people (in the thread) that require to keep the current amount of log messages. But there might be users who actually use the current logs to do something (although I can't imagine what). If we want to do this in a backwards compatible way, we should probably use a new option (e.g. -q) to enable the new (less verbose) logging. Do we want to allow both types of logging, or shall we keep only the new one? If both, which one should be the default one? Both the options are fine with me, but the default should be the current behaviour. So keeping these in mind, I rather go for an option which will control this. People falling in category one can set this option to very low where as users falling under second category can keep it high. So what option(s) would you expect? Something that tunes the interval length or something else? Interval length. Well, I can surely imagine something like --interval N. +1 A switch that'd choose between the old and new behavior might be a good idea, but I'd strongly vote against automagic heuristics. It makes the behavior very difficult to predict and I really don't want to force the users to wonder whether the long delay is due to general slowness of the machine or some clever logic that causes long delays between log messages. That's why I choose a very simple approach with constant time interval. It does what I was aiming for (less logs) and it's easy to predict. Sure, we could choose different interval (or make it an option). I am preferring an option for choosing an interval, say from 1 second to 10 seconds. U, why not to allow arbitrary integer? Why saying 1 to 10 seconds? Hmm.. actually, I have no issues with any number there. Just put 1..10 as we hard-coded it 5. No particular reason as such. BTW, what if, we put one log message every 10% (or 5%) with time taken (time taken for last 10% (or 5%) and cumulative) over 5 seconds ? This will have only 10 (or 20) lines per pgbench initialisation. And since we are showing time taken for each block, if any slowness happens, one can easily find a block by looking at the timings and troubleshoot it. Though 10% or 5% is again a debatable number, but keeping it constant will eliminate the requirement of an option. That's what I originally proposed in September (see the messages from 17/9), and Alvaro was not relly excited about this. Attached is a patch with fixed whitespace / indentation errors etc. Otherwise it's the same as the previous version. OK. Looks good now. Any other views / suggestions are welcome. Thanks
Re: [HACKERS] system administration functions with hardcoded superuser checks
On 18.12.2012 18:38, Pavel Stehule wrote: 2012/12/18 Peter Eisentraut pete...@gmx.net: There are some system administration functions that have hardcoded superuser checks, specifically: pg_reload_conf pg_rotate_logfile Some of these are useful in monitoring or maintenance tools, and the hardcoded superuser checks require that these tools run with maximum privileges. Couldn't we just install these functions without default privileges and allow users to grant privileges as necessary? isn't it too strong gun for some people ??? I believe so some one can decrease necessary rights and it opens doors to system. No one was speaking about making them executable by a wider group of users by default (i.e. decreasing necessary rights). Today, when you need to provide the EXECUTE privilege on those functions, you have three options (a) make him a superuser - obviously not a good choice (b) create a SECURITY DEFINER wrapper **for each function separately** (c) deny to do that Being able to do a plain GRANT on the function is merely a simpler way to do (b). It has advantages (less objects/functions to care about) and disadvantages (e.g. you can't do additional parameter values checks). pg_read_file pg_read_file_all pg_read_binary_file pg_read_binary_file_all pg_stat_file pg_ls_dir is relative dangerous and I am not for opening these functions. power user can simply to write extension, but he knows what he does/ I see only dangers that are already present. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too much pgbench init output
On 19.12.2012 06:30, Jeevan Chalke wrote: On Mon, Dec 17, 2012 at 5:37 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: Hi, attached is a new version of the patch that (a) converts the 'log_step_seconds' variable to a constant (and does not allow changing it using a command-line option etc.) (b) keeps the current logging as a default (c) adds a -q switch that enables the new logging with a 5-second interval I'm still not convinced there should be yet another know for tuning the log interval - opinions? It seems that you have generated a patch over your earlier version and due to that it is not cleanly applying on fresh sources. Please generate patch on fresh sources. Seems you're right - I've attached the proper patch against current master. However, I absolutely no issues with the design. Also code review is already done and looks good to me. I think to move forward on this we need someone from core-team. So I am planning to change its status to ready-for-committor. Before that please provide updated patch for final code review. thanks Tomas diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e376452..f3953a7 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -39,6 +39,7 @@ #include portability/instr_time.h #include ctype.h +#include math.h #ifndef WIN32 #include sys/time.h @@ -102,6 +103,7 @@ extern int optind; #define MAXCLIENTS 1024 #endif +#define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ intnxacts = 0; /* number of transactions per client */ @@ -150,6 +152,7 @@ char *index_tablespace = NULL; #define naccounts 10 bool use_log;/* log transaction latencies to a file */ +bool use_quiet; /* quiet logging onto stderr */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ intmain_pid; /* main process id used in log filename */ @@ -389,6 +392,7 @@ usage(void) -v vacuum all four standard tables before tests\n \nCommon options:\n -d print debugging output\n +-q quiet logging (a message each 5 seconds)\n -h HOSTNAMEdatabase server host or socket directory\n -p PORTdatabase server port number\n -U USERNAMEconnect as specified database user\n @@ -1362,6 +1366,11 @@ init(bool is_no_vacuum) charsql[256]; int i; + /* used to track elapsed time and estimate of the remaining time */ + instr_time start, diff; + double elapsed_sec, remaining_sec; + int log_interval = 1; + if ((con = doConnect()) == NULL) exit(1); @@ -1430,6 +1439,8 @@ init(bool is_no_vacuum) } PQclear(res); + INSTR_TIME_SET_CURRENT(start); + for (i = 0; i naccounts * scale; i++) { int j = i + 1; @@ -1441,10 +1452,33 @@ init(bool is_no_vacuum) exit(1); } - if (j % 10 == 0) + /* If we want to stick with the original logging, print a message each +* 100k inserted rows. */ + if ((! use_quiet) (j % 10 == 0)) fprintf(stderr, %d of %d tuples (%d%%) done.\n, - j, naccounts * scale, - (int) (((int64) j * 100) / (naccounts * scale))); + j, naccounts * scale, + (int) (((int64) j * 100) / (naccounts * scale))); + /* let's not call the timing for each row, but only each 100 rows */ + else if (use_quiet (j % 100 == 0)) + { + INSTR_TIME_SET_CURRENT(diff); + INSTR_TIME_SUBTRACT(diff, start); + + elapsed_sec = INSTR_TIME_GET_DOUBLE(diff); + remaining_sec = (scale * naccounts - j) * elapsed_sec / j; + + /* have we reached the next interval (or end)? */ + if ((j == scale * naccounts) || (elapsed_sec = log_interval * LOG_STEP_SECONDS)) { + + fprintf(stderr, %d of %d tuples (%d%%) done (elapsed %.2f s, remaining %.2f s).\n, + j, naccounts * scale
Re: [HACKERS] system administration functions with hardcoded superuser checks
On 19.12.2012 07:34, Magnus Hagander wrote: On Wed, Dec 19, 2012 at 1:58 AM, Tomas Vondra t...@fuzzy.cz wrote: On 18.12.2012 18:38, Pavel Stehule wrote: 2012/12/18 Peter Eisentraut pete...@gmx.net: There are some system administration functions that have hardcoded superuser checks, specifically: pg_reload_conf pg_rotate_logfile Some of these are useful in monitoring or maintenance tools, and the hardcoded superuser checks require that these tools run with maximum privileges. Couldn't we just install these functions without default privileges and allow users to grant privileges as necessary? isn't it too strong gun for some people ??? I believe so some one can decrease necessary rights and it opens doors to system. No one was speaking about making them executable by a wider group of users by default (i.e. decreasing necessary rights). Today, when you need to provide the EXECUTE privilege on those functions, you have three options Given how limited these functions are in scope, I don't see a problem here. pg_read_file pg_read_file_all pg_read_binary_file pg_read_binary_file_all pg_stat_file pg_ls_dir is relative dangerous and I am not for opening these functions. power user can simply to write extension, but he knows what he does/ I see only dangers that are already present. Granting executability on pg_read_xyz is pretty darn close to granting superuser, without explicitly asking for it. Well, you get read only superuser. If we want to make that step as easy as just GRANT, we really need to write some *very* strong warnings in the documentation so that people realize this. I doubt most people will realize it unless we do that (and those who don't read the docs, whch is probably a majority, never will). Yup, that's what I meant by possibility to perform additional parameter values checks ;-) Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] strange OOM errors with EXECUTE in PL/pgSQL
Hi, one of our local users reported he's getting OOM errors on 9.2, although on 9.1 the code worked fine. Attached is a simple test-case that should give you an OOM error almost immediately. What it does: 1) creates a simple table called test with one text column. 2) creates a plpgsql function with one parameter, and all that function does is passing the parameter to EXECUTE 3) calls the function with a string containing many INSERTs into the test table The way the EXECUTE is used is a bit awkward, but the failures seem a bit strange to me. The whole script is ~500kB and most of that is about 11k of very simple INSERT statements: insert into test(value) values (''aa''); all of them are exactly the same. Yet when it fails with OOM, the log contains memory context stats like these: TopMemoryContext: 5303376 total in 649 blocks; 2648 free ... PL/pgSQL function context: 8192 total in 1 blocks; 3160 free ... TopTransactionContext: 8192 total in 1 blocks; 6304 free ... ExecutorState: 8192 total in 1 blocks; 7616 free ... ExprContext: 8192 total in 1 blocks; 8160 free ... SPI Exec: 33554432 total in 14 blocks; 6005416 free ... CachedPlanSource: 3072 total in 2 blocks; 1856 free ... CachedPlanSource: 538688 total in 3 blocks; 1744 free ... CachedPlanQuery: 3072 total in 2 blocks; 1648 free ... CachedPlanSource: 538688 total in 3 blocks; 1744 free ... CachedPlanQuery: 3072 total in 2 blocks; 1648 free ... CachedPlanSource: 538688 total in 3 blocks; 1744 free ... CachedPlanQuery: 3072 total in 2 blocks; 1648 free ... CachedPlanSource: 538688 total in 3 blocks; 1744 free ... CachedPlanQuery: 3072 total in 2 blocks; 1648 free ... CachedPlanSource: 538688 total in 3 blocks; 1744 free ... ... There is ~9500 of these CachedPlanSource + CachedPlanQuery row pairs (see the attached log). That seems a bit strange to me, because all the queries are exactly the same in this test case. The number of queries needed to get OOM is inversely proportional to the query length - by using a longer text (instead of 'aaa') you may use much less queries. I am no expert in this area, but it seems to me that the code does not expect that many INSERTs in EXECUTE and does not release the memory for some reason (e.g. because the plans are allocated in SPI Exec memory context, etc.). regards Tomas pg-oom.log.gz Description: application/gzip test2.sql.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 19.12.2012 02:18, Andres Freund wrote: On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote: I think except of the temp buffer issue mentioned below its ready. -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode) +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes) { -int i; +int i, j; + +/* sort the list of rnodes */ +pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator); /* If it's a local relation, it's localbuf.c's problem. */ -if (RelFileNodeBackendIsTemp(rnode)) +for (i = 0; i nnodes; i++) { -if (rnode.backend == MyBackendId) -DropRelFileNodeAllLocalBuffers(rnode.node); -return; +if (RelFileNodeBackendIsTemp(rnodes[i])) +{ +if (rnodes[i].backend == MyBackendId) +DropRelFileNodeAllLocalBuffers(rnodes[i].node); +} } While you deal with local buffers here you don't anymore in the big loop over shared buffers. That wasn't needed earlier since we just returned after noticing we have local relation, but thats not the case anymore. Hmm, but that would require us to handle the temp relations explicitly wherever we call DropRelFileNodeAllBuffers. Currently there are two such places - smgrdounlink() and smgrdounlinkall(). By placing it into DropRelFileNodeAllBuffers() this code is shared and I think it's a good thing. But that does not mean the code is perfect - it was based on the assumption that if there's a mix of temp and regular relations, the temp relations will be handled in the first part and the rest in the second one. Maybe it'd be better to improve it so that the temp relations are removed from the array after the first part (and skip the second one if there are no remaining relations). for (i = 0; i NBuffers; i++) { +RelFileNodeBackend *rnode = NULL; volatile BufferDesc *bufHdr = BufferDescriptors[i]; - + /* * As in DropRelFileNodeBuffers, an unlocked precheck should be safe * and saves some cycles. */ -if (!RelFileNodeEquals(bufHdr-tag.rnode, rnode.node)) + +/* + * For low number of relations to drop just use a simple walk through, + * to save the bsearch overhead. The BSEARCH_LIMIT is rather a guess + * than a exactly determined value, as it depends on many factors (CPU + * and RAM speeds, amount of shared buffers etc.). + */ +if (nnodes = BSEARCH_LIMIT) I think thats a sensible plan. It makes sense that for a small number of relations a sequential scan of the rnodes array is faster than a bsearch and 10 sounds like a good value although I would guess the optimal value is slightly higher on most machines. But if it works fine without regressions thats pretty good... I think it's pointless to look for the optimal value in this case, given on how many factors it depends. We could use 20 instead of 10, but I wouldn't go higher probably. + +/* + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so + * that it's suitable for bsearch. + */ +static int +rnode_comparator(const void * p1, const void * p2) +{ +RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1; +RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2; + +if (n1.node.relNode n2.node.relNode) +return -1; +else if (n1.node.relNode n2.node.relNode) +return 1; + +if (n1.node.dbNode n2.node.dbNode) +return -1; +else if (n1.node.dbNode n2.node.dbNode) +return 1; + +if (n1.node.spcNode n2.node.spcNode) +return -1; +else if (n1.node.spcNode n2.node.spcNode) +return 1; +else +return 0; +} Still surprised this is supposed to be faster than a memcmp, but as you seem to have measured it earlier.. It surprised me too. These are the numbers with the current patch: 1) one by one = 0246810 12 14 16 18 20 -- current 15 22 28 34 4175 77 82 92 99 106 memcmp 16 23 29 36 44 122 125 128 153 154 158 Until the number of indexes reaches ~10, the numbers are almost exactly the same. Then the bsearch branch kicks in and it's clear how much slower the memcmp comparator is. 2) batches of 100 = 0246810 12 14 16 18 20 -- current 358 10 1215 17 21 23 27 31 memcmp47 10 13 1619 22 28 30 32 36 Here the difference is much smaller, but even here the memcmp is consistently a bit slower. My
Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL
On 20.12.2012 02:29, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: What it does: 1) creates a simple table called test with one text column. 2) creates a plpgsql function with one parameter, and all that function does is passing the parameter to EXECUTE 3) calls the function with a string containing many INSERTs into the test table The way the EXECUTE is used is a bit awkward, but the failures seem a bit strange to me. The whole script is ~500kB and most of that is about 11k of very simple INSERT statements: insert into test(value) values (''aa''); The reason this fails is that you've got a half-megabyte source string, and each of the 11000 plans that are due to be created from it saves its own copy of the source string. Hence, 5500 megabytes needed just for source strings. We could possibly fix this by inventing some sort of reference-sharing arrangement (which'd be complicated and fragile) or by not storing the source strings with the plans (which'd deal a serious blow to our ability to provide helpful error messages). Neither answer seems appealing. Thanks for the explanation, I didn't occur to me that each plan keeps a copy of the whole source string. I think it would be a better idea to adopt a less brain-dead way of processing the data. Can't you convert this to a single INSERT with a lot of VALUES rows? Or split it into multiple EXECUTE chunks? Well, it's not my app but I'll recommend it to them. Actually I already did but I didn't have an explanation of why it behaves like this. The really annoying bit is that in 9.1 this works fine (it's just as crazy approach as on but it does not end with OOM error). Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers