Re: [HACKERS] PATCH: tracking temp files in pg_stat_database

2011-12-20 Thread Tomas Vondra
On 20.12.2011 11:20, Magnus Hagander wrote:
 2011/12/20 Tomas Vondra t...@fuzzy.cz:

 I haven't updated the docs yet - let's see if the patch is acceptable at
 all first.
 
 Again, without having reviewed the code, this looks like a feature
 we'd want, so please add some docs, and then submit it for the next
 commitfest!

I've added the docs (see the attachment) and rebased to current head.

Tomas
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a12a9a2..3635c3f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -691,6 +691,26 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_db_temp_files/function(typeoid/type)/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Nuber of temporary files written for the database. All temporary files 
are
+   counted, regardless of why the temporary file was created (sorting or 
hash
+   join) or file size (log_temp_file does not affect this).
+  /entry
+ /row
+
+ row
+  
entryliteralfunctionpg_stat_get_db_temp_bytes/function(typeoid/type)/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Amount of data written to temporary files for the database. All 
temporary
+   files are counted, regardless of why the temporary file was created 
(sorting
+   or hash join) or file size (log_temp_file does not affect this).
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 2253ca8..55d20dc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -574,6 +574,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
 pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
 pg_stat_get_db_conflict_all(D.oid) AS conflicts,
+pg_stat_get_db_temp_files(D.oid) AS temp_files,
+pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes,
 pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
 FROM pg_database D;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 24f4cde..97c7004 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -286,7 +286,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, 
int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int 
len);
-
+static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
 /* 
  * Public functions called from postmaster follow
@@ -1339,6 +1339,29 @@ pgstat_report_recovery_conflict(int reason)
pgstat_send(msg, sizeof(msg));
 }
 
+
+/* 
+ * pgstat_report_tempfile() -
+ *
+ * Tell the collector about a temporary file.
+ * 
+ */
+void
+pgstat_report_tempfile(size_t filesize)
+{
+   PgStat_MsgTempFile msg;
+
+   if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+   return;
+
+   pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
+   msg.m_databaseid = MyDatabaseId;
+   msg.m_filesize = filesize;
+   pgstat_send(msg, sizeof(msg));
+}
+
+;
+
 /* --
  * pgstat_ping() -
  *
@@ -3185,6 +3208,10 @@ PgstatCollectorMain(int argc, char *argv[])

pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len);
break;
 
+   case PGSTAT_MTYPE_TEMPFILE:
+   
pgstat_recv_tempfile((PgStat_MsgTempFile *) msg, len);
+   break;
+
default:
break;
}
@@ -3266,6 +3293,8 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result-n_conflict_snapshot = 0;
result-n_conflict_bufferpin = 0;
result-n_conflict_startup_deadlock = 0;
+   result-n_temp_files = 0;
+   result-n_temp_bytes = 0;
 
result-stat_reset_timestamp = GetCurrentTimestamp();
 
@@ -4177,6 +4206,8 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int 
len)
dbentry-n_tuples_updated = 0;
dbentry-n_tuples_deleted = 0;
dbentry-last_autovac_time = 0;
+   dbentry-n_temp_bytes = 0;
+   dbentry-n_temp_files = 0;
 
dbentry-stat_reset_timestamp = GetCurrentTimestamp();
 
@@ -4403,6 +4434,24 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict 
*msg, int len

[HACKERS] WIP: explain analyze with 'rows' but not timing

2011-12-22 Thread Tomas Vondra
Hi all,

most of the time I use auto_explain, all I need is duration of the query
and the plan with estimates and actual row counts. And it would be handy
to be able to catch long running queries with estimates that are
significantly off (say 100x lower or higher compared to actual row numbers).

The gettimeofday() calls are not exactly cheap in some cases, so why to
pay that price when all you need is the number of rows?

The patch attached does this:

1) adds INSTRUMENT_ROWS, a new InstrumentOption

   - counts rows without timing (no gettimeofday() callse)
   - if you want timing info, use INSTRUMENT_TIMER

2) adds new option TIMING to EXPLAIN, i.e.

EXPLAIN (ANALYZE ON, TIMING ON) SELECT ...

3) adds auto_explain.log_rows_only (false by default)

   - if you set this to 'true', then the instrumentation will just
 count rows, without calling gettimeofday()


It works quite well, except one tiny issue - when the log_rows_only is
set to false (so that auto_explain requires timing), it silently
overrides the EXPLAIN option. So that even when the user explicitly
disables timing (TIMING OFF), it's overwritten and the explain collects
the timing data.

I could probably hide the timing info, but that'd make the issue even
worse (the user would not notice that the timing was actually enabled).

Maybe the right thing would be to explicitly disable timing for queries
executed with EXPLAIN (TIMING OFF). Any other ideas how to make this
work reasonably?

The patch does not implement any checks (how far is the estimate from
the reality) yet, that'll be round two.

regards
Tomas
diff --git a/contrib/auto_explain/auto_explain.c 
b/contrib/auto_explain/auto_explain.c
index b320698..34015b8 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or 
-1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
+static bool auto_explain_log_rows_only = false;
 static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
 
@@ -133,6 +134,17 @@ _PG_init(void)
 NULL,
 NULL);
 
+   DefineCustomBoolVariable(auto_explain.log_rows_only,
+Do not collect timing 
data, just row number.,
+NULL,
+
auto_explain_log_rows_only,
+false,
+PGC_SUSET,
+0,
+NULL,
+NULL,
+NULL);
+
EmitWarningsOnPlaceholders(auto_explain);
 
/* Install hooks. */
@@ -170,7 +182,11 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
/* Enable per-node instrumentation iff log_analyze is required. 
*/
if (auto_explain_log_analyze  (eflags  
EXEC_FLAG_EXPLAIN_ONLY) == 0)
{
-   queryDesc-instrument_options |= INSTRUMENT_TIMER;
+   if (auto_explain_log_rows_only)
+   queryDesc-instrument_options |= 
INSTRUMENT_ROWS;
+   else
+   queryDesc-instrument_options |= 
(INSTRUMENT_TIMER | INSTRUMENT_ROWS);
+
if (auto_explain_log_buffers)
queryDesc-instrument_options |= 
INSTRUMENT_BUFFERS;
}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e38de5c..4e3f343 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -133,6 +133,8 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString,
es.costs = defGetBoolean(opt);
else if (strcmp(opt-defname, buffers) == 0)
es.buffers = defGetBoolean(opt);
+   else if (strcmp(opt-defname, timing) == 0)
+   es.timing = defGetBoolean(opt);
else if (strcmp(opt-defname, format) == 0)
{
char   *p = defGetString(opt);
@@ -229,6 +231,7 @@ ExplainInitState(ExplainState *es)
/* Set default options. */
memset(es, 0, sizeof(ExplainState));
es-costs = true;
+   es-timing = true;
/* Prepare output buffer. */
es-str = makeStringInfo();
 }
@@ -355,8 +358,11 @@ ExplainOnePlan(PlannedStmt *plannedstmt, ExplainState *es,
int eflags;
int   

Re: [HACKERS] WIP: explain analyze with 'rows' but not timing

2011-12-23 Thread Tomas Vondra
On 23.12.2011 14:57, Robert Haas wrote:
 2011/12/22 Tomas Vondra t...@fuzzy.cz:
 The gettimeofday() calls are not exactly cheap in some cases, so why to
 pay that price when all you need is the number of rows?
 
 Fair point.
 
 The patch attached does this:

 1) adds INSTRUMENT_ROWS, a new InstrumentOption

   - counts rows without timing (no gettimeofday() callse)
   - if you want timing info, use INSTRUMENT_TIMER

 2) adds new option TIMING to EXPLAIN, i.e.

EXPLAIN (ANALYZE ON, TIMING ON) SELECT ...

 3) adds auto_explain.log_rows_only (false by default)

   - if you set this to 'true', then the instrumentation will just
 count rows, without calling gettimeofday()
 
 This seems like an unnecessarily confusing interface, because you've
 named the auto_explain option differently from the EXPLAIN option and
 given it (almost) the opposite sense: timing=off means the same thing
 as log_rows_only=on.
 
 I think the EXPLAIN (TIMING) option is good the way you have it, but
 then just have auto_explain.log_timing, with a default value of on.

Makes sense. I've updated the patch to reflect this, so the option is
called auto_explain.log_timing and is true by default. I'll add the
patch to the next commit fest.

One thing I'm wondering about is that the InstrumentOptions are not
exclusive - INSTRUMENT_TIMER means 'collect timing and row counts' while
INSTRUMENT_ROWS means 'collect row counts'. Wouldn't it be better to
redefine the INSTRUMENT_TIMER so that it collects just timing info. I.e.
to get the current behaviour, you'd have to do this

   instrument_options |= (INSTRUMENT_TIMER | INSTRUMENT_ROWS)

It's quite trivial change in explain.c, the problem I have with that is
that it might break extensions.

Tomas
diff --git a/contrib/auto_explain/auto_explain.c 
b/contrib/auto_explain/auto_explain.c
index b320698..4b52b26 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or 
-1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
+static bool auto_explain_log_timing  = false;
 static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
 
@@ -133,6 +134,17 @@ _PG_init(void)
 NULL,
 NULL);
 
+   DefineCustomBoolVariable(auto_explain.log_timing,
+Collect timing data, 
not just row counts.,
+NULL,
+
auto_explain_log_timing,
+true,
+PGC_SUSET,
+0,
+NULL,
+NULL,
+NULL);
+
EmitWarningsOnPlaceholders(auto_explain);
 
/* Install hooks. */
@@ -170,7 +182,12 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
/* Enable per-node instrumentation iff log_analyze is required. 
*/
if (auto_explain_log_analyze  (eflags  
EXEC_FLAG_EXPLAIN_ONLY) == 0)
{
-   queryDesc-instrument_options |= INSTRUMENT_TIMER;
+   if (auto_explain_log_timing)
+   queryDesc-instrument_options |= 
INSTRUMENT_TIMER;
+   else
+   queryDesc-instrument_options |= 
INSTRUMENT_ROWS;
+
+
if (auto_explain_log_buffers)
queryDesc-instrument_options |= 
INSTRUMENT_BUFFERS;
}
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 8a9c9de..4488956 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -40,6 +40,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable 
class=parameterstatement/replac
 VERBOSE [ replaceable class=parameterboolean/replaceable ]
 COSTS [ replaceable class=parameterboolean/replaceable ]
 BUFFERS [ replaceable class=parameterboolean/replaceable ]
+TIMING [ replaceable class=parameterboolean/replaceable ]
 FORMAT { TEXT | XML | JSON | YAML }
 /synopsis
  /refsynopsisdiv
@@ -172,6 +173,20 @@ ROLLBACK;
/varlistentry
 
varlistentry
+termliteralTIMING/literal/term
+listitem
+ para
+  Include information on timing for each node. Specifically, include the
+  actual startup time and time spent in the node. This may involve a lot
+  of gettimeofday calls, and on some systems this means significant
+  slowdown

Re: [HACKERS] WIP: explain analyze with 'rows' but not timing

2011-12-23 Thread Tomas Vondra
On 23.12.2011 16:14, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 One thing I'm wondering about is that the InstrumentOptions are not
 exclusive - INSTRUMENT_TIMER means 'collect timing and row counts' while
 INSTRUMENT_ROWS means 'collect row counts'. Wouldn't it be better to
 redefine the INSTRUMENT_TIMER so that it collects just timing info. I.e.
 to get the current behaviour, you'd have to do this
 
instrument_options |= (INSTRUMENT_TIMER | INSTRUMENT_ROWS)
 
 It's quite trivial change in explain.c, the problem I have with that is
 that it might break extensions.
 
 I'm not especially concerned by that angle --- we make bigger API
 changes all the time.  But you could change the name, eg
 
   instrument_options |= (INSTRUMENT_TIMING | INSTRUMENT_ROWS)
 
 and then #define INSTRUMENT_TIMER as the OR of the two real bits
 for backward compatibility.

OK, that seems like a good solution. But is it worth the additional
complexity in explain.c?

The motivation for this patch was that collection timing data often
causes performance issues and in some cases it's not needed. But is this
true for row counts?

Are there machines where collecting row counts is above noise level?
I've never seen that, but that's not a proof of nonexistence.

If the overhead of this is negligible, then I could just hide the row
counts from the output.

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: explain analyze with 'rows' but not timing

2011-12-23 Thread Tomas Vondra

Dne 23.12.2011 22:37, Pavel Stehule napsal(a):

2011/12/23 Tom Lanet...@sss.pgh.pa.us:

Tomas Vondrat...@fuzzy.cz  writes:

The motivation for this patch was that collection timing data often
causes performance issues and in some cases it's not needed. But is this
true for row counts?


Perhaps more to the point, is there a use case for collecting timing
data without row counts?  I find it hard to visualize a valid reason.


yes - a searching of bad prediction


But that's the purpose of collecting row counts without timing data.

TL is asking about the opposite case - collecting timing data without 
row counts. I can't imagine such use case ...


Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: ALTER TABLE IF EXISTS

2012-01-02 Thread Tomas Vondra
On 2 Leden 2012, 14:11, Pavel Stehule wrote:
 Hello

 this is relative simple patch that add possibility to skip noexisting
 tables. It is necessary for silent cleaning when dump is loaded.

Just a curious question - what use case is solved by this? Under what
circumstances you get an ALTER TABLE without a preceding CREATE TABLE? I
can't think of such scenario ...

Or is this meant for scripts written manually so that it's possible to do
alter if the table already exists and create if it does not exist?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat wait timeout

2012-01-04 Thread Tomas Vondra
On 4 Leden 2012, 13:17, pratikchirania wrote:
 I have installed RAMdisk and pointed the parameter:

 #stats_temp_directory = 'B:\pg_stat_tmp'
 I also tried #stats_temp_directory = 'B:/pg_stat_tmp'

 But, still there is no file created in the RAM disk.
 The previous stat file is touched even after the change is made. (I have
 restarted the service after effecting the change)

You have to remove the '#' at the beginning, this way it's commented out.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] easy way of copying regex_t ?

2012-01-05 Thread Tomas Vondra
Hi all,

I've been working on moving an extension that allows to move the ispell
dictionaries to shared segment. It's almost complete, the last FIXME is
about copying regex_t structure (stored in AFFIX).

According to regex.h the structure is fairly complex and not exactly easy
to understand, so I'd like to know if anyone here already implemented that
or something that may serve the same purpose. Any ideas?

kind regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-01-13 Thread Tomas Vondra
On 13.1.2012 18:07, Josh Berkus wrote:
 Eric's review follows:
 
 Compiling on Ubuntu 10.04 LTS AMD64 on a GoGrid virtual machine from
 2012-01-12 git checkout.
 
 Patch applied fine.
 
 'make check' results in failures when this patch is put into place.
 
 
  6 of 128 tests failed.
 
 
 Here are the relevant failures.
 
 parallel group (2 tests):  create_view create_index
  create_index ... FAILED
 parallel group (9 tests):  create_cast create_aggregate drop_if_exists
 typed_table vacuum constraints create_table_like triggers inherit
  inherit  ... FAILED
 parallel group (20 tests):  select_distinct_on btree_index update random
 select_distinct select_having namespace delete case hash_index union
 select_implicit select_into transactions portals subselect arrays
 aggregates join prepared_xacts
  join ... FAILED
  aggregates   ... FAILED
 parallel group (15 tests):  combocid portals_p2 advisory_lock xmlmap
 tsdicts guc functional_deps dependency select_views cluster tsearch
 window foreign_data foreign_key bitmapops
  foreign_data ... FAILED
 parallel group (19 tests):  limit prepare copy2 conversion xml plancache
 returning temp sequence without_oid with rowtypes truncate polymorphism
 domain largeobject rangefuncs alter_table plpgsql
  alter_table  ... FAILED
 

Fixed. The default value of TIMING option did not work as intended, it
was set to true even for plain EXPLAIN (without ANALYZE). In that case
the EXPLAIN failed.

Tomas
diff --git a/contrib/auto_explain/auto_explain.c 
b/contrib/auto_explain/auto_explain.c
index 61da6a2..9fc0ae1 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -23,6 +23,7 @@ static intauto_explain_log_min_duration = -1; /* msec or 
-1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
+static bool auto_explain_log_timing  = false;
 static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
 
@@ -133,6 +134,17 @@ _PG_init(void)
 NULL,
 NULL);
 
+   DefineCustomBoolVariable(auto_explain.log_timing,
+Collect timing data, 
not just row counts.,
+NULL,
+
auto_explain_log_timing,
+true,
+PGC_SUSET,
+0,
+NULL,
+NULL,
+NULL);
+
EmitWarningsOnPlaceholders(auto_explain);
 
/* Install hooks. */
@@ -170,7 +182,12 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
/* Enable per-node instrumentation iff log_analyze is required. 
*/
if (auto_explain_log_analyze  (eflags  
EXEC_FLAG_EXPLAIN_ONLY) == 0)
{
-   queryDesc-instrument_options |= INSTRUMENT_TIMER;
+   if (auto_explain_log_timing)
+   queryDesc-instrument_options |= 
INSTRUMENT_TIMER;
+   else
+   queryDesc-instrument_options |= 
INSTRUMENT_ROWS;
+
+
if (auto_explain_log_buffers)
queryDesc-instrument_options |= 
INSTRUMENT_BUFFERS;
}
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 8a9c9de..4488956 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -40,6 +40,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable 
class=parameterstatement/replac
 VERBOSE [ replaceable class=parameterboolean/replaceable ]
 COSTS [ replaceable class=parameterboolean/replaceable ]
 BUFFERS [ replaceable class=parameterboolean/replaceable ]
+TIMING [ replaceable class=parameterboolean/replaceable ]
 FORMAT { TEXT | XML | JSON | YAML }
 /synopsis
  /refsynopsisdiv
@@ -172,6 +173,20 @@ ROLLBACK;
/varlistentry
 
varlistentry
+termliteralTIMING/literal/term
+listitem
+ para
+  Include information on timing for each node. Specifically, include the
+  actual startup time and time spent in the node. This may involve a lot
+  of gettimeofday calls, and on some systems this means significant
+  slowdown of the query. 
+  This parameter may only be used when literalANALYZE/literal is also
+  enabled.  It defaults to 

Re: [HACKERS] PATCH: tracking temp files in pg_stat_database

2012-01-17 Thread Tomas Vondra
On 20.12.2011 19:59, Tomas Vondra wrote:
 On 20.12.2011 11:20, Magnus Hagander wrote:
 2011/12/20 Tomas Vondra t...@fuzzy.cz:

 I haven't updated the docs yet - let's see if the patch is acceptable at
 all first.

 Again, without having reviewed the code, this looks like a feature
 we'd want, so please add some docs, and then submit it for the next
 commitfest!
 
 I've added the docs (see the attachment) and rebased to current head.
 
 Tomas

Fixed a failing regression test (check of pg_stat_database structure).

Tomas
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a12a9a2..3635c3f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -691,6 +691,26 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_db_temp_files/function(typeoid/type)/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Nuber of temporary files written for the database. All temporary files 
are
+   counted, regardless of why the temporary file was created (sorting or 
hash
+   join) or file size (log_temp_file does not affect this).
+  /entry
+ /row
+
+ row
+  
entryliteralfunctionpg_stat_get_db_temp_bytes/function(typeoid/type)/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Amount of data written to temporary files for the database. All 
temporary
+   files are counted, regardless of why the temporary file was created 
(sorting
+   or hash join) or file size (log_temp_file does not affect this).
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 50ba20c..3a0dca3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -574,6 +574,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
 pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
 pg_stat_get_db_conflict_all(D.oid) AS conflicts,
+pg_stat_get_db_temp_files(D.oid) AS temp_files,
+pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes,
 pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
 FROM pg_database D;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 323d42b..6a2ff1d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -286,7 +286,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, 
int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int 
len);
-
+static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
 /* 
  * Public functions called from postmaster follow
@@ -1339,6 +1339,29 @@ pgstat_report_recovery_conflict(int reason)
pgstat_send(msg, sizeof(msg));
 }
 
+
+/* 
+ * pgstat_report_tempfile() -
+ *
+ * Tell the collector about a temporary file.
+ * 
+ */
+void
+pgstat_report_tempfile(size_t filesize)
+{
+   PgStat_MsgTempFile msg;
+
+   if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+   return;
+
+   pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
+   msg.m_databaseid = MyDatabaseId;
+   msg.m_filesize = filesize;
+   pgstat_send(msg, sizeof(msg));
+}
+
+;
+
 /* --
  * pgstat_ping() -
  *
@@ -3185,6 +3208,10 @@ PgstatCollectorMain(int argc, char *argv[])

pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len);
break;
 
+   case PGSTAT_MTYPE_TEMPFILE:
+   
pgstat_recv_tempfile((PgStat_MsgTempFile *) msg, len);
+   break;
+
default:
break;
}
@@ -3266,6 +3293,8 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result-n_conflict_snapshot = 0;
result-n_conflict_bufferpin = 0;
result-n_conflict_startup_deadlock = 0;
+   result-n_temp_files = 0;
+   result-n_temp_bytes = 0;
 
result-stat_reset_timestamp = GetCurrentTimestamp();
 
@@ -4177,6 +4206,8 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int 
len)
dbentry-n_tuples_updated = 0;
dbentry-n_tuples_deleted = 0;
dbentry-last_autovac_time = 0;
+   dbentry-n_temp_bytes = 0;
+   dbentry-n_temp_files = 0;
 
dbentry-stat_reset_timestamp

Re: [HACKERS] PATCH: tracking temp files in pg_stat_database

2012-01-21 Thread Tomas Vondra
On 20 Leden 2012, 13:23, Magnus Hagander wrote:
 On Tue, Jan 17, 2012 at 21:39, Tomas Vondra t...@fuzzy.cz wrote:
 On 20.12.2011 19:59, Tomas Vondra wrote:
 On 20.12.2011 11:20, Magnus Hagander wrote:
 2011/12/20 Tomas Vondra t...@fuzzy.cz:

 I haven't updated the docs yet - let's see if the patch is acceptable
 at
 all first.

 Again, without having reviewed the code, this looks like a feature
 we'd want, so please add some docs, and then submit it for the next
 commitfest!

 I've added the docs (see the attachment) and rebased to current head.

 Tomas

 Fixed a failing regression test (check of pg_stat_database structure).

 I'm wondering if this (and also my deadlocks stats patch that's int he
 queue) should instead of inventing new pgstats messages, add fields to
 the tabstat message. It sounds like that one is just for tables, but
 it's already the one collecting info about commits and rollbacks, and
 it's already sent on every commit.

Hmmm, I'm not against that, but I'd recommend changing the message name to
something that reflects the reality. If it's not just about table
statistics, it should not be named 'tabstats' IMHO. Or maybe split that
into two messages, both sent at the commit time.

I do like the idea of not sending the message for each temp file, though.
One thing that worries me are long running transactions (think about a
batch process that runs for several hours within a single transaction). By
sending the data only at the commit, such transactions would not be
accounted properly. So I'd suggest sending the message either at commit
time or after collecting enough data (increment a counter whenever the
struct is updated and send a message when the counter = N for a
reasonable value of N, say 20). But maybe it already works that way - I
haven't checked the current 'tabstat' implementation.

 Adding two fields to that one would add some extra bytes on every
 send, but I wonder if that woudl ever affect performance, given the
 total size of the packet? And it would certainly be lower overhead in
 the cases that there *have* been temp tables used.

It's not about temp tables, it's about temp files. Which IMHO implies that
there would be exactly 0.01% performance difference because temporary
files are quite expensive.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finer Extension dependencies

2012-01-21 Thread Tomas Vondra
On 21 Leden 2012, 18:20, Dimitri Fontaine wrote:
 Hi,

 Thank you for reviewing this patch!

 Hitoshi Harada umi.tan...@gmail.com writes:

 Next, some questions:
 - Why is the finer dependency needed? Do you have tangible example
 that struggles with the dependency granularity? I feel so good about
 the existing dependency on extension as an extension developer of
 several ones.

 The problem is not yet very apparent only because extensions are very
 new. The main thing we address with this patch is depending on a feature
 that appeared while developing an extension or that gets removed down
 the line. It allows to depend on features and avoid needing to compare
 version numbers and maintain a list of which version number is providing
 which feature.

 This feature has been asked by several extension users, beginning even
 before 9.1 got released.

It's also about several extension providing the same sort of functionality
implemented in different ways. Think about extensions that allow you to
send e-mails right from the database. One could do that directly, the
other one could implement queuing, another one could be optimized for a
specific SMTP server. With features, you could just say 'require = mail'
and use the extension that's installed. Yes, this needs a common API.

I personally see this as a major step towards fully-fledged package
management, similar to those available in modern Linux distributions (apt,
yum, portage, ...).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: tracking temp files in pg_stat_database

2012-01-21 Thread Tomas Vondra
On 21 Leden 2012, 18:13, Magnus Hagander wrote:
 On Sat, Jan 21, 2012 at 18:02, Tomas Vondra t...@fuzzy.cz wrote:
 On 20 Leden 2012, 13:23, Magnus Hagander wrote:
 I'm wondering if this (and also my deadlocks stats patch that's int he
 queue) should instead of inventing new pgstats messages, add fields to
 the tabstat message. It sounds like that one is just for tables, but
 it's already the one collecting info about commits and rollbacks, and
 it's already sent on every commit.

 Hmmm, I'm not against that, but I'd recommend changing the message name
 to
 something that reflects the reality. If it's not just about table
 statistics, it should not be named 'tabstats' IMHO. Or maybe split that
 into two messages, both sent at the commit time.

 Yes, renaming it might be a good idea. If you split it into two
 messages that would defeat much of the point.

Not really, if combined with the 'send after each N updates or at the
transaction end' because then those parts might be sent independently
(thus not transmitting the whole message).

 Though I just looked at the tabstat code again, and we already split
 that message up at regular intervals. Which would make it quite weird
 to have global counters in it as well.

 But instead of there, perhaps we need a general non table, but more
 than one type of data message sent out at the same time. There is
 other stuff in the queue for it.

 I'm not convinced either way - I'm not against the original way in
 your patch either. I just wanted to throw the idea out there, and was
 hoping somebody else would have an opinion too :-)

Let's make that in a separate patch. It seems like a good thing to do, but
I don't think it should be mixed with this patch.

 I do like the idea of not sending the message for each temp file,
 though.
 One thing that worries me are long running transactions (think about a
 batch process that runs for several hours within a single transaction).
 By
 sending the data only at the commit, such transactions would not be
 accounted properly. So I'd suggest sending the message either at commit

 By that argument they are *already* not accounted properly, because
 the number of rows and those counters are wrong. By sending the temp
 file data in the middle of the transaction, you won't be able to
 correlate those numbers with the temp file usage.

 I'm not saying the other usecase isn't more common, but whichever way
 you do it, it's going to get inconsistent with *something*.

The question is whether the patch should be modified to send the message
only at commit (i.e. just like tabstats) or if it can remain the way it is
now. And maybe we could change the way all those messages are sent (e.g.
to the regular submits I've proposed in the previous post) to make them
more consistent.

I'm not asking for 100% consistency. What I don't like is a batch process
consuming resources but not reflected in the stats until the final commit.
With a huge spike in the stats right after that, although the activity was
actually spread over several hours.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: tracking temp files in pg_stat_database

2012-01-26 Thread Tomas Vondra
On 26 Leden 2012, 14:44, Magnus Hagander wrote:
 On Sat, Jan 21, 2012 at 20:12, Tomas Vondra t...@fuzzy.cz wrote:
 On 21 Leden 2012, 18:13, Magnus Hagander wrote:
 On Sat, Jan 21, 2012 at 18:02, Tomas Vondra t...@fuzzy.cz wrote:
 Though I just looked at the tabstat code again, and we already split
 that message up at regular intervals. Which would make it quite weird
 to have global counters in it as well.

 But instead of there, perhaps we need a general non table, but more
 than one type of data message sent out at the same time. There is
 other stuff in the queue for it.

 I'm not convinced either way - I'm not against the original way in
 your patch either. I just wanted to throw the idea out there, and was
 hoping somebody else would have an opinion too :-)

 Let's make that in a separate patch. It seems like a good thing to do,
 but
 I don't think it should be mixed with this patch.

 Yeah, I'm not sure we even want to do that yet, when we go down this
 road. We might eventually, of course.

Yes, that's one of the reasons why I suggested to do that in a separate
patch (that might be rejected if we find it's a bad idea).

 I've cleaned up the patch a bit and applied it - thanks!

 Other than cosmetic, I changed the text in the description around a
 bit (sol if that's worse now, that's purely my fault) and added docs
 to the section about pg_stat_database that you'd forgotten.

Great. Thanks for the fixes.

 Also, I'd like to point you in the direction of the
 src/include/catalog/find_unused_oids and duplicate_oids scripts. The
 oids you'd picked conflicted with the pg_extension catalog from a year
 ago. Easy enough to fix, and there are often conflicts with more
 recent oids that the committer has to deal with anyway, but cleaning
 those up before submission always helps a little bit. For the next one
 :-)

Aha! I've been wondering how the commiters identify duplicate oids etc.
but I was unaware of those scripts.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3 Únor 2012, 17:12, Robert Haas wrote:
 On Sat, Jan 21, 2012 at 10:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jan 13, 2012 at 3:07 PM, Tomas Vondra t...@fuzzy.cz wrote:

 Fixed. The default value of TIMING option did not work as intended, it
 was set to true even for plain EXPLAIN (without ANALYZE). In that case
 the EXPLAIN failed.


 I've applied this over the show Heap Fetches in EXPLAIN ANALYZE
 output for index-only scans patch.  It applied and built and passes
 installcheck.

 I have not done a full review of this, but I want to say that I want
 this very much.  Not only can I get the row counts, but also the Heap
 Fetches and the output of BUFFERS, without the overhead of timing.  I
 haven't looked at contrib aspects of it at all.

 Thank you for making this.

 +1.

 After looking at this a bit, I think we can make the interface a bit
 more convenient.  I'd like to propose that we call the EXPLAIN option
 ROWS rather than TIMING, and that we provide the row-count
 information if *either* ROWS or ANALYZE is specified.

 Then, if you want rows without timing, you can say this:

 EXPLAIN (ROWS) query...

 Rather than, as in the approach taken by the current patch:

 EXPLAIN (ANALYZE, TIMING off) query...

 ...which is a little more long-winded.  This also has the nice
 advantage of making the name of the EXPLAIN option match the name of
 the INSTRUMENT flag.

 Revised patch along those lines attached.  Barring objections, I'll
 commit this version.

I don't think changing the EXPLAIN syntax this way is a good idea. I think
that one option should not silently enable/disable others, so ROWS should
not enable ANALYZE automatically. It should throw an error just like the
TIMING option does when invoked without ANALYZE (IIRC).

Which leads to syntax like this

  EXPLAIN (ANALYZE, ROWS)

and that's a bit strange because it mentions ROWS but actually manipulates
TIMING behind the curtains. You can't actually do

  EXPLAIN (ANALYZE, ROWS off)

because row counts are always collected. So I think the syntax I proposed
makes more sense.

kind regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3.2.2012 18:28, Robert Haas wrote:
 On Fri, Feb 3, 2012 at 12:08 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I don't think changing the EXPLAIN syntax this way is a good idea. I think
 that one option should not silently enable/disable others, so ROWS should
 not enable ANALYZE automatically.
 
 I didn't propose that.  The point is that the desired behavior
 (however we name it) is a SUBSET of what ANALYZE does.
 
 So we can either:
 
 1. Have ANALYZE enable all the behavior, and have another option
 (TIMING) that can be used to turn some of it back on again.
 
 2. Have ANALYZE enable all the behavior, and have another option
 (ROWS) that enables just the subset of it that we want.
 
 I prefer #2 to #1, because I think it's a bit confusing to have an
 option whose effect is to partially disable another option.  YMMV, of
 course.

OK, thanks for the explanation. I don't like the idea of subsets as it
IMHO makes it less obvious what options are enabled. For example this

   EXPLAIN (ROWS) query...

does not immediately show it's actually going to do ANALYZE.

I prefer to keep the current 'ANALYZE' definition, i.e. collecting both
row counts and timing data (which is what 99% of people wants anyway),
and an option to disable the timing.

And the BUFFERS option currently works exactly like that, so defining
ROWS the way you proposed would be inconsistent with the current behavior.

Sure, we could redefine BUFFERS as a subset, so you could do

  EXPLAIN (ROWS) ... instead of ... EXPLAIN (ANALYZE, TIMING off)
  EXPLAIN (BUFFERS) ... instead of ... EXPLAIN (ANALYZE, BUFFERS on)

but what if someone wants both at the same time? Maybe he could do

  EXPLAIN (ROWS, BUFFERS)

and treat that as a union of those subsets. I don't think it's worth it.

I surely can live with both solutions (mine or the one you proposed).


kind regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-03 Thread Tomas Vondra
On 3.2.2012 22:51, Robert Haas wrote:
 On Fri, Feb 3, 2012 at 2:56 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK, thanks for the explanation. I don't like the idea of subsets as it
 IMHO makes it less obvious what options are enabled. For example this

   EXPLAIN (ROWS) query...

 does not immediately show it's actually going to do ANALYZE.
 
 Well, it isn't, if ANALYZE means rows + timing...

Yeah, sure. It depends on how you define ANALYZE. I see that as 'stats
collected at runtime' (in contrast to a query plan, which is just a ...
well, plan).

 but what if someone wants both at the same time? Maybe he could do

  EXPLAIN (ROWS, BUFFERS)

 and treat that as a union of those subsets. I don't think it's worth it.
 
 Yeah, I forgot that we'd have to allow that, though I don't think it
 would be a big deal to fix that.
 
 I surely can live with both solutions (mine or the one you proposed).
 
 Let's wait and see if anyone else has an opinion.

Sure. And thanks for your feedback!

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance-test farm

2012-03-05 Thread Tomas Vondra
On 12.5.2011 08:54, Greg Smith wrote:
 Tomas Vondra wrote:

 The idea is that buildfarm systems that are known to have a) reasonable
 hardware and b) no other concurrent work going on could also do
 performance tests.  The main benefit of this approach is it avoids
 duplicating all of the system management and source code building work
 needed for any sort of thing like this; just leverage the buildfarm
 parts when they solve similar enough problems.  Someone has actually
 done all that already; source code was last sync'd to the build farm
 master at the end of March:  https://github.com/greg2ndQuadrant/client-code
 
 By far the #1 thing needed to move this forward from where it's stuck at
 now is someone willing to dig into the web application side of this. 
 We're collecting useful data.  It needs to now be uploaded to the
 server, saved, and then reports of what happened generated.  Eventually
 graphs of performance results over time will be straighforward to
 generate.  But the whole idea requires someone else (not Andrew, who has
 enough to do) sits down and figures out how to extend the web UI with
 these new elements.

Hi,

I'd like to revive this thread. A few days ago we have finally got our
buildfarm member working (it's called magpie) - it's spending ~2h a
day chewing on the buildfarm tasks, so we can use the other 22h to do
some useful work.

I suppose most of you are busy with 9.2 features, but I'm not so I'd
like to spend my time on this.

Now that I had to set up the buildfarm member I'm somehow aware of how
the buildfarm works. I've checked the PGBuildFarm/server-code and
greg2ndQuadrant/client-code repositories and while I certainly am not a
perl whiz, I believe I can tweak it to handle the performance-related
result too.

What is the current state of this effort? Is there someone else working
on that? If not, I propose this (for starters):

  * add a new page Performance results to the menu, with a list of
members that uploaded the perfomance-results

  * for each member, there will be a list of tests along with a running
average for each test, last test and indicator if it improved, got
worse or is the same

  * for each member/test, a history of runs will be displayed, along
with a simple graph


I'm not quite sure how to define which members will run the performance
tests - I see two options:

  * for each member, add a flag run performance tests so that we can
choose which members are supposed to be safe

  OR

  * run the tests on all members (if enabled in build-farm.conf) and
then decide which results are relevant based on data describing the
environment (collected when running the tests)


I'm also wondering if

  * using the buildfarm infrastructure the right thing to do, if it can
provide some 'advanced features' (see below)

  * we should use the current buildfarm members (although maybe not all
of them)

  * it can handle one member running the tests with different settings
(various shared_buffer/work_mem sizes, num of clients etc.) and
various hw configurations (for example magpie contains a regular
SATA drive as well as an SSD - would be nice to run two sets of
tests, one for the spinner, one for the SSD)

  * this can handle 'pushing' a list of commits to test (instead of
just testing the HEAD) so that we can ask the members to run the
tests for particular commits in the past (I consider this to be
very handy feature)


regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)

2012-03-06 Thread Tomas Vondra
Hi,

I've noticed a locale-specific bug in regression tests, I discovered
thanks to the new magpie buildfarm member (testing cs_CZ locale).
The problem is in foreign_data where the output is sorted by a column,
and cs_CZ behaves differently from C and en_US.

More precisely, in C it's true that ('s4'  'sc') but that's not true
in cs_CZ (and supposedly some other locales).

I've fixed this by replacing 'sc' with 't0' which seems to fix the
ordering (and should work with other locales too). See the patch attached.

kind regards
Tomas
diff --git a/src/test/regress/expected/foreign_data.out 
b/src/test/regress/expected/foreign_data.out
index 2b3eddf..3f6bce2 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -631,25 +631,25 @@ DROP SERVER s7;
 
 -- CREATE FOREIGN TABLE
 CREATE SCHEMA foreign_schema;
-CREATE SERVER sc FOREIGN DATA WRAPPER dummy;
+CREATE SERVER t0 FOREIGN DATA WRAPPER dummy;
 CREATE FOREIGN TABLE ft1 ();-- ERROR
 ERROR:  syntax error at or near ;
 LINE 1: CREATE FOREIGN TABLE ft1 ();
^
 CREATE FOREIGN TABLE ft1 () SERVER no_server;   -- ERROR
 ERROR:  server no_server does not exist
-CREATE FOREIGN TABLE ft1 (c1 serial) SERVER sc; -- ERROR
+CREATE FOREIGN TABLE ft1 (c1 serial) SERVER t0; -- ERROR
 NOTICE:  CREATE FOREIGN TABLE will create implicit sequence ft1_c1_seq for 
serial column ft1.c1
 ERROR:  default values on foreign tables are not supported
-CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;-- ERROR
+CREATE FOREIGN TABLE ft1 () SERVER t0 WITH OIDS;-- ERROR
 ERROR:  syntax error at or near WITH OIDS
-LINE 1: CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;
+LINE 1: CREATE FOREIGN TABLE ft1 () SERVER t0 WITH OIDS;
   ^
 CREATE FOREIGN TABLE ft1 (
c1 integer NOT NULL,
c2 text,
c3 date
-) SERVER sc OPTIONS (delimiter ',', quote '');
+) SERVER t0 OPTIONS (delimiter ',', quote '');
 COMMENT ON FOREIGN TABLE ft1 IS 'ft1';
 COMMENT ON COLUMN ft1.c1 IS 'ft1.c1';
 \d+ ft1
@@ -659,14 +659,14 @@ COMMENT ON COLUMN ft1.c1 IS 'ft1.c1';
  c1 | integer | not null  | plain| ft1.c1
  c2 | text|   | extended | 
  c3 | date|   | plain| 
-Server: sc
+Server: t0
 Has OIDs: no
 
 \det+
 List of foreign tables
  Schema | Table | Server |  Options   
 +---++
- public | ft1   | sc | {delimiter=,,quote=\}
+ public | ft1   | t0 | {delimiter=,,quote=\}
 (1 row)
 
 CREATE INDEX id_ft1_c2 ON ft1 (c2); -- ERROR
@@ -737,7 +737,7 @@ Foreign table foreign_schema.foreign_table_1
  c7   | integer | 
  c8   | text| 
  c10  | integer | 
-Server: sc
+Server: t0
 
 -- Information schema
 SELECT * FROM information_schema.foreign_data_wrappers ORDER BY 1, 2;
@@ -761,7 +761,7 @@ SELECT * FROM information_schema.foreign_servers ORDER BY 
1, 2;
  regression | s5  | regression   | 
foo   | | 15.0   | 
regress_test_role
  regression | s6  | regression   | 
foo   | | 16.0   | 
regress_test_indirect
  regression | s8  | regression   | 
postgresql| || 
foreign_data_user
- regression | sc  | regression   | 
dummy | || 
foreign_data_user
+ regression | t0  | regression   | 
dummy | || 
foreign_data_user
  regression | t1  | regression   | 
foo   | || 
regress_test_indirect
  regression | t2  | regression   | 
foo   | || 
regress_test_role
 (7 rows)
@@ -823,7 +823,7 @@ SELECT * FROM information_schema.role_usage_grants WHERE 
object_type LIKE 'FOREI
 SELECT * FROM information_schema.foreign_tables ORDER BY 1, 2, 3;
  foreign_table_catalog | foreign_table_schema | foreign_table_name | 
foreign_server_catalog | foreign_server_name 
 
---+--+++-
- regression| foreign_schema   | foreign_table_1| 
regression | sc
+ regression| foreign_schema   | foreign_table_1| 
regression | t0
 (1 row)
 
 

Re: [HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)

2012-03-07 Thread Tomas Vondra
On 7.3.2012 17:56, Robert Haas wrote:
 On Tue, Mar 6, 2012 at 1:59 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I've noticed a locale-specific bug in regression tests, I discovered
 thanks to the new magpie buildfarm member (testing cs_CZ locale).
 The problem is in foreign_data where the output is sorted by a column,
 and cs_CZ behaves differently from C and en_US.

 More precisely, in C it's true that ('s4'  'sc') but that's not true
 in cs_CZ (and supposedly some other locales).

 I've fixed this by replacing 'sc' with 't0' which seems to fix the
 ordering (and should work with other locales too). See the patch attached.
 
 This was fixed on master in commit
 3e9a2672d25aed15ae6b4a09decbd8927d069868, but that picked the name s0
 rather than t0.  I suggest we make the same naming decision in the
 back-branch to avoid future confusion...

Yes, that's a better solution - I haven't noticed that commit. Should I
prepare a modified patch or is it possible to apply the fix from master
to this branch?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for a locale-specific bug in regression tests (REL9_1_STABLE)

2012-03-07 Thread Tomas Vondra
On 7.3.2012 21:39, Robert Haas wrote:
 On Wed, Mar 7, 2012 at 3:08 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 7.3.2012 17:56, Robert Haas wrote:
 On Tue, Mar 6, 2012 at 1:59 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I've noticed a locale-specific bug in regression tests, I discovered
 thanks to the new magpie buildfarm member (testing cs_CZ locale).
 The problem is in foreign_data where the output is sorted by a column,
 and cs_CZ behaves differently from C and en_US.

 More precisely, in C it's true that ('s4'  'sc') but that's not true
 in cs_CZ (and supposedly some other locales).

 I've fixed this by replacing 'sc' with 't0' which seems to fix the
 ordering (and should work with other locales too). See the patch attached.

 This was fixed on master in commit
 3e9a2672d25aed15ae6b4a09decbd8927d069868, but that picked the name s0
 rather than t0.  I suggest we make the same naming decision in the
 back-branch to avoid future confusion...

 Yes, that's a better solution - I haven't noticed that commit. Should I
 prepare a modified patch or is it possible to apply the fix from master
 to this branch?
 
 I tried to cherry-pick it, but there were conflicts, so I guess
 someone will need to go through and adjust.  It's probably only 10
 minutes work, but if you don't mind doing it, I'd be grateful.

Ok, so here's a fixed patch. I haven't used the 3e9a2672 commit
directly, because there seem to be additional changes. I've simply
renamed the 'sc' to 's0' and fixed the differences in output.

Tomas
diff --git a/src/test/regress/expected/foreign_data.out 
b/src/test/regress/expected/foreign_data.out
index 2b3eddf..2d7e884 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -631,25 +631,25 @@ DROP SERVER s7;
 
 -- CREATE FOREIGN TABLE
 CREATE SCHEMA foreign_schema;
-CREATE SERVER sc FOREIGN DATA WRAPPER dummy;
+CREATE SERVER s0 FOREIGN DATA WRAPPER dummy;
 CREATE FOREIGN TABLE ft1 ();-- ERROR
 ERROR:  syntax error at or near ;
 LINE 1: CREATE FOREIGN TABLE ft1 ();
^
 CREATE FOREIGN TABLE ft1 () SERVER no_server;   -- ERROR
 ERROR:  server no_server does not exist
-CREATE FOREIGN TABLE ft1 (c1 serial) SERVER sc; -- ERROR
+CREATE FOREIGN TABLE ft1 (c1 serial) SERVER s0; -- ERROR
 NOTICE:  CREATE FOREIGN TABLE will create implicit sequence ft1_c1_seq for 
serial column ft1.c1
 ERROR:  default values on foreign tables are not supported
-CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;-- ERROR
+CREATE FOREIGN TABLE ft1 () SERVER s0 WITH OIDS;-- ERROR
 ERROR:  syntax error at or near WITH OIDS
-LINE 1: CREATE FOREIGN TABLE ft1 () SERVER sc WITH OIDS;
+LINE 1: CREATE FOREIGN TABLE ft1 () SERVER s0 WITH OIDS;
   ^
 CREATE FOREIGN TABLE ft1 (
c1 integer NOT NULL,
c2 text,
c3 date
-) SERVER sc OPTIONS (delimiter ',', quote '');
+) SERVER s0 OPTIONS (delimiter ',', quote '');
 COMMENT ON FOREIGN TABLE ft1 IS 'ft1';
 COMMENT ON COLUMN ft1.c1 IS 'ft1.c1';
 \d+ ft1
@@ -659,14 +659,14 @@ COMMENT ON COLUMN ft1.c1 IS 'ft1.c1';
  c1 | integer | not null  | plain| ft1.c1
  c2 | text|   | extended | 
  c3 | date|   | plain| 
-Server: sc
+Server: s0
 Has OIDs: no
 
 \det+
 List of foreign tables
  Schema | Table | Server |  Options   
 +---++
- public | ft1   | sc | {delimiter=,,quote=\}
+ public | ft1   | s0 | {delimiter=,,quote=\}
 (1 row)
 
 CREATE INDEX id_ft1_c2 ON ft1 (c2); -- ERROR
@@ -737,7 +737,7 @@ Foreign table foreign_schema.foreign_table_1
  c7   | integer | 
  c8   | text| 
  c10  | integer | 
-Server: sc
+Server: s0
 
 -- Information schema
 SELECT * FROM information_schema.foreign_data_wrappers ORDER BY 1, 2;
@@ -757,11 +757,11 @@ SELECT * FROM 
information_schema.foreign_data_wrapper_options ORDER BY 1, 2, 3;
 SELECT * FROM information_schema.foreign_servers ORDER BY 1, 2;
  foreign_server_catalog | foreign_server_name | foreign_data_wrapper_catalog | 
foreign_data_wrapper_name | foreign_server_type | foreign_server_version | 
authorization_identifier 
 
+-+--+---+-++--
+ regression | s0  | regression   | 
dummy | || 
foreign_data_user
  regression | s4  | regression   | 
foo   | oracle  || 
foreign_data_user
  regression | s5  | regression

[HACKERS] proposal : cross-column stats

2010-12-11 Thread Tomas Vondra
Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

 Short introduction 

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

   WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===||==
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
  20%  ||  4% |  4% |  4% |  4% |  4% |
===||==

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===||==
  20%  || 20% |  0% |  0% |  0% |  0% |
  20%  ||  0% | 20% |  0% |  0% |  0% |
  20%  ||  0% |  0% | 20% |  0% |  0% |
  20%  ||  0% |  0% |  0% | 20% |  0% |
  20%  ||  0% |  0% |  9% |  0% | 20% |
===||==

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

--- Proof of concept ---

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
  - this one is used to collect the stats, the parameters represent
number of bins for the columns (sides of the contingency table)
  - collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
  - computes estimate for the condition listed above (ranges on both
columns)

So to run the PoC, just do something like this:

1) fill the table with some data

   INSERT INTO test_table SELECT round(random()*1000),
   round(random()*1000) FROM generate_series(1,10);

2) collect the cross-column statistics

   SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

   EXPLAIN ANALYZE SELECT * FROM test_table
   WHERE (col_a BETWEEN 30 AND 129)
 AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

   SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
   as the current code)

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 |
  [50,70]  |  [50,70]  |41  | 40   |   41  |47 |
  [50,250] |  [50,250] |  4023  |   4024   | 4436  |  3944 |
  [50,250] | [750,950] |  4023  |   3955   | 4509  |  

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Hi!

Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
 Lets talk about one special case - I'll explain how the proposed
 solution works, and then I'll explain how to make it more general, what
 improvements are possible, what issues are there. Anyway this is by no
 means a perfect or complete solution - it's just a starting point.
 
 It looks like you handled most of the issues. Just a few points:
 
 - This is obviously applicable to more than just integers, probably
   anything with a b-tree operator class. What you've coded seems rely
   on calculations on the values. Have you thought about how it could
   work for, for example, strings?

Yes, I know, I just forgot to address this in my previous e-mail. The
contingency tables have a really nice feature - they are based on
splitting the sets into groups (~ bins of the histograms for each
column). And this can be done if you can sort the values, you really
don't need any calculations. So it should work with strings.

And another thing I somehow forgot is handling the case when there is no
histogram, just MCV. That's mostly the same - each of the values might
be a separate group, or the values might be grouped to form less groups,
etc.

 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 Not that I suggest you fix this, but it's food for though. Though
 strictly speaking this is a different kind of correlation than what
 you're looking at.

Hmmm, I see. I think the proposal does not fix this particular case,
although it might improve the situation a little bit (limit the error
between expected and observed number of rows).

The problem is that once we get to a cell-level of the contingency
table, there is no additional (more detailed) information. So we're
stuck with the multiplication estimate, or something like that.

I was thinking about it actually, and I think we could collect some more
info - a correlation coefficient for each bin, or something like that.
But that was not part of my proposal, and I'm not sure how to do that.

 2) I really don't think we should collect stats for all combinations of
columns of a table - I do like the Oracle-like approach where a DBA
has to enable cross-column stats using an ALTER TABLE (for a
particular list of columns).

The only exception might be columns from a multi-column index. It
might be quite efficient I guess?
 
 In the past it has been suggested to only do it for multi-column
 indexes, but I find these days I find in some situations I prefer to
 make individual indexes and let the bitmap scan code combine them. So
 perhaps it would be best to let it be configured by the DBA.

Yes, I prefer individual indexes too.

The idea behind collecting cross-column stats for multi-column indexes
was that maybe we could 'append' this to the current functionality
(building the index or something like that) so that it does not
introduce significant performance problems.

 3) There are independence tests for contingency tables (e.g. Pearson's
Chi-squared test), so that it's easy to find out whether the columns
are independent. In that case we can just throw away these stats and
use the simple estimation.

http://mathworld.wolfram.com/Chi-SquaredTest.html
 
 I think this would be good to include, if possible. 
 
 Actually, I wonder if the existing stats collection code could be
 altered to attempt to calculate the correlation between columns as part
 of its other work.

I guess that would be rather expensive - to compute correlation you need
two passes, and you need to do that for each pair or columns. So I'd be
surprised if it is possible (and effective).

Another thing is that you can compute correlation only for numeric
columns, so it's not possible to do that for city/ZIP code mentioned
above. More precisely - it's possible to do that (if you map strings to
numbers somehow), but I doubt you'll get useful results as the
assignment is rather random.

Well, you could ask the governments to assign the ZIP codes to cities in
strictly alphabecital order, but I guess they'll say no.

 4) Or we could store just those cells where expected and observed values
differ significantly (may help if most of the values are indendent,
but there's a small glitch somewhere).
 
 Comrpessing that grid would be useful, given that for many dimensions
 most of the grid will be not interesting. In fact, storing the 20
 largest values may be enough. Worth an experiment.

Not exactly just the largest values - rather values that are
significantly different from the expected values. Generally there are
two interesting cases

expected  observed - The optimizer may choose index scan, although
   the seq scan would be better.

expected  observed - The optimizer may choose seq scan, although
   the index scan would be better.

regards
Tomas

-- 
Sent via pgsql-hackers mailing 

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 
 Yeah, and that's actually analogous to the example I used in my
 presentation.
 
 The way I think of that problem is that once you know the postcode,
 knowing the city name doesn't add any information. The postcode implies
 the city name. So the selectivity for postcode = ? AND city = ? should
 be the selectivity of postcode = ? alone. The measurement we need is
 implicativeness: How strongly does column A imply a certain value for
 column B. Perhaps that could be measured by counting the number of
 distinct values of column B for each value of column A, or something
 like that. I don't know what the statisticians call that property, or if
 there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
 On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
 The way I think of that problem is that once you know the postcode, knowing 
 the city name doesn't add any information. The postcode implies the city 
 name. So the selectivity for postcode = ? AND city = ? should be the 
 selectivity of postcode = ? alone. The measurement we need is 
 implicativeness: How strongly does column A imply a certain value for 
 column B. Perhaps that could be measured by counting the number of distinct 
 values of column B for each value of column A, or something like that. I 
 don't know what the statisticians call that property, or if there's some 
 existing theory on how to measure that from a sample.
 
 The statistical term for this is conditional probability, written P(A|B), 
 meaning the probability of A under the assumption or knowledge of B. The 
 basic tool for working with conditional probabilities is bayes' theorem which 
 states that
 
 P(A|B) = P(A and B) / P(B).
 
 Currently, we assume that P(A|B) = P(A), meaning the probability (or 
 selectivity as we call it) of an event (like a=3) does not change under 
 additional assumptions like b=4. Bayes' theorem thus becomes
 
 P(A) = P(A and B) / P(B)=
 P(A and B) = P(A)*P(B)
 
 which is how we currently compute the selectivity of a clause such as WHERE 
 a=3 AND b=4.
 
 I believe that measuring this by counting the number of distinct values of 
 column B for each A is basically the right idea. Maybe we could count the 
 number of distinct values of b for every one of the most common values of 
 a, and compare that to the overall number of distinct values of b...

Good point!

Well, I was thinking about this too - generally this means creating a
contingency table with the MCV as bins. Then you can compute these
interesting probabilities P(A and B). (OK, now I definitely look like
some contingency table weirdo, who tries to solve everything with a
contingency table. OMG!)

The question is - what are we going to do when the values in the query
are not in the MCV list? Is there some heuristics to estimate the
probability from MCV, or something like that? Could we use some
average probability or what?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 01:05, Robert Haas napsal(a):
 This is a good idea, but I guess the question is what you do next.  If
 you know that the applicability is 100%, you can disregard the
 restriction clause on the implied column.  And if it has no
 implicatory power, then you just do what we do now.  But what if it
 has some intermediate degree of implicability?

Well, I think you've missed the e-mail from Florian Pflug - he actually
pointed out that the 'implicativeness' Heikki mentioned is called
conditional probability. And conditional probability can be used to
express the AND probability we are looking for (selectiveness).

For two columns, this is actually pretty straighforward - as Florian
wrote, the equation is

   P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

where P(B) may be estimated from the current histogram, and P(A|B) may
be estimated from the contingency (see the previous mails). And P(A and
B) is actually the value we're looking for.

Anyway there really is no intermediate degree of aplicability, it just
gives you the right estimate.

And AFAIR this is easily extensible to more than two columns, as

  P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C)

so it's basically a recursion.

Well, I hope my statements are really correct - it's been a few years
since I gained my degree in statistics ;-)

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
 P(A|B) = P(A and B) / P(B).

Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).

Sometimes that may return nearly 2x the actual selectivity, but in
general it's a reasonable estimate. Are there any severe failure cases
that produce much worse estimates?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 03:00, Robert Haas napsal(a):
 Well, the question is what data you are actually storing.  It's
 appealing to store a measure of the extent to which a constraint on
 column X constrains column Y, because you'd only need to store
 O(ncolumns^2) values, which would be reasonably compact and would
 potentially handle the zip code problem - a classic hard case rather
 neatly.  But that wouldn't be sufficient to use the above equation,
 because there A and B need to be things like column X has value x,
 and it's not going to be practical to store a complete set of MCVs for
 column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:34, Tom Lane napsal(a):
 Tomas Vondra t...@fuzzy.cz writes:
 Well, until this point we've discussed failure cases involving 'AND'
 conditions. What about 'OR' conditions? I think the current optimizer
 computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
 found in backend/optimizer/path/clausesel.c:630).
 
 If you can solve the AND case, the OR case falls out of that.  Just
 replace s1*s2 with a more accurate AND calculation.

Oh yeah, now I see - it's just the usual equation

   P(A or B) = P(A) + P(B) - P(A and B)

and we're estimating P(A and B) as P(A)*P(B).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:38, Tom Lane napsal(a):
 The reason that this wasn't done years ago is precisely that nobody's
 figured out how to do it with a tolerable amount of stats data and a
 tolerable amount of processing time (both at ANALYZE time and during
 query planning).  It's not hard to see what we'd ideally like to do;
 it's getting from there to something useful in production that's hard.

OK, I fully realize that. My plan is to simply

  (a) find out what statistics do we need to collect and how to use it
  (b) implement a really stupid inefficient solution
  (c) optimize in iterations, i.e. making it faster, consuming less
  space etc.

It will take time, it won't be perfect for a long time, but every
journey starts somewhere.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
 On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
 Another quick note: I think that storing the full contingency table is
 wasteful since the marginals are already stored in the single column
 statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
 looking at this a couple years back ).
 
 Josh Tolley still looks at it occasionally, though time hasn't permitted any
 sort of significant work for quite some time. The multicolstat branch on my
 git.postgresql.org repository will create an empirical copula each
 multi-column index, and stick it in pg_statistic. It doesn't yet do anything
 useful with that information, nor am I convinced it's remotely bug-free. In a
 brief PGCon discussion with Tom a while back, it was suggested a good place
 for the planner to use these stats would be clausesel.c, which is responsible
 for handling code such as ...WHERE foo  4 AND foo  5.

Well, that's good news ;-)

I've done a bit of research today, and I've found some quite interesting
papers on this topic (probably, I did not have time to read them, in
most cases I've read just the title and abstract).

[1] Selectivity estimators for multidimensional range queries over real
attributes
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914

This seems like a very good starting point. AFAIK it precisely
describes what data need to be collected, how to do the math etc.

[2] Selectivity Estimation Without the Attribute Value Independence
Assumption
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126

This obviously deals with the independence problem. Haven't
investigated it further, but it seems worth to read.

[3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions
and Sort Operations (A Cost Function for Uniformly Partitioned
UB-Trees)
http://mistral.in.tum.de/results/publications/MB00_ideas.pdf

Describes something called UB-Tree, and shows how it may be used to
do estimates. Might be interesting as an alternative to the
traditional histograms.

There are more details about UB-Trees at

http://mistral.in.tum.de/results/publications/

[4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/

A rather nice collection of papers related to estimation (including
some of the papers listed above).

Hm, I planned to finally read the Understanding MySQL Internals over
the Xmas ... that obviously won't happen.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
 Tomas,
 
   (a) find out what statistics do we need to collect and how to use it
   (b) implement a really stupid inefficient solution
   (c) optimize in iterations, i.e. making it faster, consuming less
   space etc.
 
 I'll suggest again how to decide *which* columns to cross: whichever
 columns are combined in composite indexes.  In version 2, allow the DBA
 to specify combinations.
 
 In the unlikely event that correlation could be reduced to a single
 float number, it would be conceivable for each column to have an array
 of correlation stats for every other column where correlation was
 non-random; on most tables (i.e. ones with less than 100 columns) we're
 not talking about that much storage space.
 
 The main cost would be the time spent collecting that info ...

I think this is a bit early to discuss this, given the fact that we
don't have a working solution yet. But OK, let's discuss these options
anyway

1) collecting the automatically for composite indexes

   I don't think this is wise idea. The first versions definitely won't
   be very efficient, and collecting the data for each composite
   index means everyone will be hit by this inefficiency, even if he
   actually does not need that (e.g. the columns are independent so the
   current estimates are quite accurate or he's not using those columns
   very often in the same WHERE clause).

   Another reason against this is that many DBAs don't actually use
   composed indexes - they simply create indexes on each column and let
   the bitmap index scan to work it out. And this would not work for
   this case.

   And actually it's not very complicated to allow the DBA to do this,
   this can be a quite simple PL/pgSQL procedure.

2) collecting correlation for each pair of columns

   Again, you're effectively forcing everyone to pay the price even
   though he may not need the feature. Maybe we'll get there one day,
   but it's not a good idea to do that from the beginning.

   And the correlation itself has a very limited use in real life, as
   it's not possible to compute it for character columns and is not
   very useful in case of some numerical columns (e.g. ZIP codes).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
 On 12.12.2010 15:17, Martijn van Oosterhout wrote:
 On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
 Very cool that you're working on this.
 
 +1
 
 Lets talk about one special case - I'll explain how the proposed
 solution works, and then I'll explain how to make it more general, what
 improvements are possible, what issues are there. Anyway this is by no
 means a perfect or complete solution - it's just a starting point.

 It looks like you handled most of the issues. Just a few points:

 - This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

 The classic failure case has always been: postcodes and city names.
 Strongly correlated, but in a way that the computer can't easily see.
 
 Yeah, and that's actually analogous to the example I used in my
 presentation.
 
 The way I think of that problem is that once you know the postcode,
 knowing the city name doesn't add any information. The postcode implies
 the city name. So the selectivity for postcode = ? AND city = ? should
 be the selectivity of postcode = ? alone. The measurement we need is
 implicativeness: How strongly does column A imply a certain value for
 column B. Perhaps that could be measured by counting the number of
 distinct values of column B for each value of column A, or something
 like that. I don't know what the statisticians call that property, or if
 there's some existing theory on how to measure that from a sample.
 
 That's assuming the combination has any matches. It's possible that the
 user chooses a postcode and city combination that doesn't exist, but
 that's no different from a user doing city = 'fsdfsdfsd' on a single
 column, returning no matches. We should assume that the combination
 makes sense.

Well, I've read several papers this week, in an attempt to find out what
possibilities are there when implementing cross-column statistics. One
of the papers seems like a reasonable solution to this particular
problem - discrete data + equality queries.

The article is A Bayesian Approach to Estimating The Selectivity of
Conjuctive Predicates (written by Heimel, Markl and Murthy). It's
freely available as a PDF

http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf

I do have a PDF with my own notes in it (as annotations), let me know if
you're interested ...

The basic principle is that instead of 'attribute value independence'
(which is the problem with correlated columns) they assume 'uniform
correlation' which is a much weaker assumption.

In the end, all they need to compute an estimate is number of distinct
values for each of the columns (we already have that in pg_stats) and a
number of distinct values for the group of columns in a query. They
really don't need any multidimensional histogram or something like that.

The funny thing is I've implemented most of the PoC before reading the
article - the only difference was the way to combine the estimates (

 pros and cons --

So let's see the pros:

* it's reasonably simple, the overhead should be minimal I guess (we're
  already estimating distinct values for the columns, so I guess we can
  do that for multiple columns with reasonable impact)

* there are no 'advanced data structures' as multi-dimensional
   histograms that are expensive to build and maintain

* it consistently produces better estimates than the current estimator
  based on attribute value independence assumption, and it's reasonably
  accurate (I've been playing with it for some time, so we'll need
  more tests of course)

* it's easily extensible to more columns

* I guess we could improve the estimated by our own heuristicts, to
  catch the special case when one column is perfectly implied by
  another one (e.g. when state is implied by ZIP code) - we can catch
  that by 'distinct for combination = distinct for column' and use just
  the estimate for one of the column

but there are some cons too:

* this really is not designed to work with real-valued data, it's a
  very nice solution for discrete data (namely 'labels' as for example
  city/state names, ZIP codes etc.)

* you need to know the list of columns in advance (there's nothing like
  'let's build' the stats for columns (a,b,c) and then query just (a,b))
  as you need to know the count of distinct values for the queried
  columns (well, maybe it's possible - I guess we could 'integrate'
  over all possible values of the omited column)

* it's built for 'equality' qeuries, although maybe we could modify it
  for inequalities too (but it won't be as elegant), but I guess the
  equality queries are much more common in case of discrete data (OK,
  you can run something like

WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Hi,

I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics

  http://wiki.postgresql.org/wiki/Cross_Columns_Stats

and I'll put the list of papers and some basic info there. There was a
lot of discussion in this thread, and it's difficult to follow it, so
I'll put there some details about the proposal etc.

So I'll just briefly describe the interesting things I've learned from
those papers.

-- instances of the problem --

Generally, there are four quite different cases of the problem.

First, the columns may be either real-valued or discrete. And by
'discrete' I mean the value is rather a label than a value. It does not
make sense to add or subtract them, etc. So for example city names or
zip codes are discrete values (although zip codes are numbers etc).

Second, the queries are consist of equality or inequality (range)
conditions.

So actually there are four instances of the problem:

  | equality conditions | inequality conditions |
 
  discrete values |  A  |   D   |
  numeric values  |  C  |   B   |
 

I think those four problems should be treated separately, although some
of the techniques may be common.

A) discrete values and inequality conditions

   One of the papers (A Bayesian Approach to Estimating The Selectivity
   of Conjuctive Predicates) describes a quite interesting solution to
   this problem - I've already posted a description on how to apply it
   to the ZIP code / city name problem - see this

   http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php

B) numeric values and inequality conditions

   Most of the papers dealing with this problem are based on
   discretization and multi-dimensional histograms to approximate the
   joint distribution. So I believe my initial proposal was not a
   complete nonsense.

   Once we have a working histogram-based solution, we can work on
   precision and efficiency (how much space is needed to store it, how
   long does it take to compute an estimate, etc.). There are two
   ways to do that.

   First, most of the papers offer an enhanced type of histogram
   (although the histogram proposed in the paper is always evaluated as
   the most efficient one, which is a bit suspicious). Anyway there are
   advanced quite-promising ways to build multi-dimensional histograms.

   Second, the paper Selectivity estimation using probabilistic
   models) describes a solution based on bayesian networks. That means
   really really promising (actually it addresses join selectivity
   estimation too).

   So yeas, I'm quite confident we can solve this issue and improve the
   efficiency - either by an advanced histogram or using bayesian
   networks.

C) numeric values and equality conditions

   OK, I'm not sure how to solve this case. But the queries against
   numeric queries are range queries in most cases I guess, so maybe
   that's not that big deal.

D) discrete values and inequality conditions

   Basically, this can be handled just like numeric values after
   discretization, i.e. using a histogram. But this is usually

E) a combination of discrete / numeric columns

   I'm not sure how to deal with this. Obviously it's possible to
   build multi-dimensional histogram, and estimate as many queries as
   possible.

---

The papers describe some interesting alternative approaches, e.g. based
on SVD (singular value decomposition) or random sampling (or kernel
estimators, which an enhanced version of sampling).

But there are various issues associated with those solutions. SVD is
applicable to 2D only, so we'd be stuck with 2 columns, and random
sampling sounds a bit suspicious to me).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 19:58, Robert Haas napsal(a):
 I haven't read the paper yet (sorry) but just off the top of my head,
 one possible problem here is that our n_distinct estimates aren't
 always very accurate, especially for large tables.  As we've discussed
 before, making them accurate requires sampling a significant
 percentage of the table, whereas all of our other statistics can be
 computed reasonably accurately by sampling a fixed amount of an
 arbitrarily large table.  So it's possible that relying more heavily
 on n_distinct could turn out worse overall even if the algorithm is
 better.  Not sure if that's an issue here, just throwing it out
 there...

Yes, you're right - the paper really is based on (estimates of) number
of distinct values for each of the columns as well as for the group of
columns.

AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations of these two columns.

And I think we're not collecting this right now, so this solution
requires scanning the table (or some part of it).

I know this is a weak point of the whole solution, but the truth is
every cross-column stats solution will have to do something like this. I
don't think we'll find a solution with 0 performance impact, without the
need to scan sufficient part of a table.

That's why I want to make this optional so that the users will use it
only when really needed.

Anyway one possible solution might be to allow the user to set these
values manually (as in case when ndistinct estimates are not precise).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:24, Tom Lane napsal(a):
 That seems likely to be even more unreliable than our existing
 single-column estimates :-(
 
   regards, tom lane

Well, yes :-(

I guess this is a place where we could use a multi-column index, if it
contains all the interesting columns. We could scan the index, not the
whole table.

That could save us tremendous amount of I/O and should be quite precise
(unless it's severely bloated).

Another thing is those 'discrete' columns are usually quite stable, i.e.
there's usually a limited list of values and it does not change very
often. Think about ZIP codes, states, etc. And the combinations are
quite stable too (counties do not move to other states, etc.).

So I think scanning a reasonable part of a table should be enough in
these cases.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
 On 17.12.2010 23:13, Tomas Vondra wrote:
 Dne 17.12.2010 19:58, Robert Haas napsal(a):
 I haven't read the paper yet (sorry) but just off the top of my head,
 one possible problem here is that our n_distinct estimates aren't
 always very accurate, especially for large tables.  As we've discussed
 before, making them accurate requires sampling a significant
 percentage of the table, whereas all of our other statistics can be
 computed reasonably accurately by sampling a fixed amount of an
 arbitrarily large table.  So it's possible that relying more heavily
 on n_distinct could turn out worse overall even if the algorithm is
 better.  Not sure if that's an issue here, just throwing it out
 there...

 Yes, you're right - the paper really is based on (estimates of) number
 of distinct values for each of the columns as well as for the group of
 columns.

 AFAIK it will work with reasonably precise estimates, but the point is
 you need an estimate of distinct values of the whole group of columns.
 So when you want to get an estimate for queries on columns (a,b), you
 need the number of distinct value combinations of these two columns.

 And I think we're not collecting this right now, so this solution
 requires scanning the table (or some part of it).
 
 Any idea how sensitive it is to the accuracy of that estimate on
 distinct value combinations? If we get that off by a factor of ten or a
 hundred, what kind of an effect does it have on the final cost estimates?

Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is

  (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)


So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
---

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by

  (sel(A) + sel(B)) / 2

I guess we can say that (sel(A)  sel(A,B)) is generally the same as

  sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
---

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Tomas Vondra
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
 On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
 
 On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is

 (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.

 Huh? This is the selectivity estimate for A = x AND B = y? Surely,
 if A and B are independent, the formula must reduce to sel(A) * sel(B),
 and I cannot see how that'd work with the formula above.

 Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
 conditional probability, as

   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

 and uniform correlation assumption so that it's possible to replace the
 conditional probabilities with constants. And those constants are then
 estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
 
 Ok, I think I understand this now. The selectivity equation actually
 *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
 for sel(A).
 
 Take the clause A = x AND B = y for example. Without knowing anything
 about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
 
   sel(A=x) = 1 / dist(A)
   sel(B=y) = 1 / dist(B).
 
 This is also what we do currently, according to var_eq_non_const() in
 src/backend/utils/adt/selfuncs.c, if we don't have any additional
 knowledge about x (Actually, we also factor the probability of A being
 NULL into this).
 
 With these estimates, your formula becomes
 
   sel(A=x,B=y) = 1 / dist(A,B).
 
 and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
 
   sel(A=x,B=y) = sel(A=x) * sel(B=y).
 
 If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
 (var_eq_const() in src/backend/utils/adt/selfuncs.c). If
 
   sel(B=y) ~= 0,
 
 we'd currently also conclude that
 
   sel(A=x,B=y) ~= 0.
 
 With the uniform correlation approach, we'd instead estimate
 
   sel(A=x,B=y) ~= sel(A=x) / dist(B)
 
 assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
 If dist(B) is small, this estimate is much worse than what we'd currently
 get, since we've effectively ignored the information that the restriction
 B=y alone guarantees that only very few rows will match.

Well, I guess you're right. Let's see two examples - uncorrelated and
higly correlated data (and then a few comments on what I think you're
missing).

uncorrelated

Let there be 10 distinct values in A, 100 distinct values in B, and
there are all possible combinations (10x100 = 1000). All the values (and
combinations) are uniformly distributed.

The current estimate is correct, i.e.

  sel(A=a) = 1/10 = 1/dist(A)
  sel(B=b) = 1/100 = 1/dist(B)
  sel(A=a, B=b) = sel(A) * sel(B) = 1/1000  /* due to AVI */

the proposed estimate is

  sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B))
= (10 * 1/10 + 100 * 1/100) / 2000 = 1/1000

so actually it produces the same estimate, but thats due to the uniform
distribution assumption.

Let's say the selectivities for A and B are very different - A is 10x
les common, B is 10x more common (let's say we know this). Then the
current estimate is still correct, i.e. it gives

  sel(A=a) = 1/100
  sel(B=b) = 1/10

  = sel(A=a,B=b) = 1/1000

but the new estimate is

  (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200

So yes, in case of uncorrelated data this overestimates the result.


highly correlated
-
Again, let dist(A)=10, dist(B)=100. But this case let B=A i.e. for each
value in B, there is a unique value in A. Say B=mod(A,10) or something
like that. So now there is dist(A,B)=100.

Assuming uniform distribution, the correct estimate is

  sel(A=a,B=b) = sel(B=b) = 1/100

and the current estimate is

  sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000

The proposed estimate is

  (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100

which is correct.

Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we
currently get (just as before)

  sel(A=a,B=b) = 1/10 * 1/100 = 1/1000

which is 100x less than the correct value (sel(B)=1/10). The new
estimate yields

  (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20

which is not as accurate as with uniform distribution, but it's
considerably more accurate than the current estimate (1/1000).


comments

I really don't think this a perfect solution giving 100% accurate
results in all cases. But the examples above illustrate that it gives
much better estimates in case of correlated data.

It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to correlation). In that case he could create
'cross-column' statistics on those columns

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-18 Thread Tomas Vondra
Dne 12.12.2010 03:47, Robert Haas napsal(a):
 On Sat, Dec 11, 2010 at 4:40 PM,  t...@fuzzy.cz wrote:
 Hello

 you have to respect pg coding style:

 a) not too long lines
 b) not C++ line comments

 OK, thanks for the notice. I've fixed those two problems.
 
 Please add this to the currently-open commitfest.
 
 https://commitfest.postgresql.org/action/commitfest_view/open

I've done several small changes to the patch, namely

- added docs for the functions (in SGML)
- added the same thing for background writer

So I think now it's 'complete' and I'll add it to the commit fest in a
few minutes.

regards
Tomas
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5fd0213..40c6c6b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -600,6 +600,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_db_last_stat_reset_time/function(typeoid/type)/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+   Time of the last reset of statistics for this database (as a result of 
executing
+  functionpg_stat_reset/function function)
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
@@ -725,6 +734,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_last_stat_reset_time/function(typeoid/type)/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+   Time of the last reset of statistics on this table (as a result of 
executing
+  functionpg_stat_reset/function or 
functionpg_stat_reset_single_table_counters/function)
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_vacuum_count/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
@@ -879,6 +897,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_function_last_stat_reset_time/function(typeoid/type)/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+   Time of the last reset of statistics for this function (as a result of 
executing
+  functionpg_stat_reset/function or 
functionpg_stat_reset_single_function_counters/function)
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_xact_function_calls/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
@@ -1064,6 +1091,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_bgwriter_last_stat_reset_time()/function/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+   Time of the last reset of statistics for the background writer (as a 
result of executing
+  functionpg_stat_reset_shared('bgwriter')/function)
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry
   entrytypebigint/type/entry
   entry
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 346eaaf..3e1cca3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -310,6 +310,7 @@ CREATE VIEW pg_stat_all_tables AS
 pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
 pg_stat_get_last_analyze_time(C.oid) as last_analyze,
 pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
+pg_stat_get_last_stat_reset_time(C.oid) as last_stat_reset,
 pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 pg_stat_get_analyze_count(C.oid) AS analyze_count,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
 pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
-pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
+pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
+pg_stat_get_db_last_stat_reset_time(D.oid) AS last_stat_reset
 FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
@@ -512,7 +514,8 @@ CREATE VIEW pg_stat_user_functions AS
 P.proname AS funcname,
 pg_stat_get_function_calls(P.oid) AS calls,
 pg_stat_get_function_time(P.oid) / 1000 AS total_time,
-pg_stat_get_function_self_time(P.oid) / 1000 AS self_time
+pg_stat_get_function_self_time(P.oid) / 1000 AS self_time,
+pg_stat_get_function_last_stat_reset_time(P.oid) AS last_stat_reset
 FROM pg_proc P LEFT JOIN pg_namespace N ON (N.oid = 

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 17:26, Tom Lane napsal(a):
 That seems like quite a bizarre definition.  What I was envisioning was
 that we'd track only the time of the last whole-database stats reset.

Well, but that does not quite work. I need is to know whether the
snapshot is 'consistent' i.e. whether I can compare it to the previous
snapshot and get meaningful results (so that I can perform some analysis
on the difference).

And by 'consistent' I mean that none of the counters was reset since the
previous snapshot. The most obvious and most flexible way to do this is
keeping track of the last reset for each of the counters, which is
exactly what I've done in the patch.

The other possibility I've offered in my previous post is to keep just a
per-database timestamp, and set it whenever some stats in the database
are reset (table/index/function counters or all stats). It definitely is
not as flexible as the previous solution, but it gives me at least some
info that something was reset. But I'd have to throw away the whole
snapshot - in the previous case I could do analysis at least on the
counters that were not reset.

The solution you've offered - keeping only the per-database timestamp,
and not updating it when a table/index/function stats are reset, that's
completely useless for me. It simply does not provide an answer to the
question Is this snapshot consistent?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 19:13, Jim Nasby napsal(a):
 Is there a reason this info needs to be tracked in the stats table?
 I know it's the most obvious place, but since we're worried about the
 size of it, what about tracking it in pg_class or somewhere else?

I guess this is the best place for this kind of info, as it is directly
related to the stats items. Well, maybe we could place it to pg_class,
but is it a wise idea?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
 On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.

 The reason that this wasn't done years ago is precisely that nobody's
 figured out how to do it with a tolerable amount of stats data and a
 tolerable amount of processing time (both at ANALYZE time and during
 query planning).  It's not hard to see what we'd ideally like to do;
 it's getting from there to something useful in production that's hard.
 
 I think we have to face up to the fact that attempting to derive
 meaningful cross-column stats will require larger sample sizes.

Amen.

 If we collect everything we're going to have ~10^9 stats slots with
 default stats_target 100 and a 100 column table.

 We should disconnect sample size from histogram size, and we need to
 make the initial column pairings vastly fewer than all combinations.
 Manual specification seems like it will be required for the cases where
 we decide not to include it automatically, so it seems we'll need manual
 specification anyway. In that case, we should do manual specification
 first.

Well, not really. The more bins you have, the larger sample you need to
get a representative representation of the stats. So the histogram and
sample size are naturally connected.

And there are some (mostly heuristics) rules to determine how large the
sample should be. E.g. when building a contingency table for a
chi-squared test, a common rule is that each bin should contain at least
5 values. So the more bins you have, the larger sample you need.

I like the way oracle does this - you can either let them decide what is
the proper sample size, or you can specify how large the sample should
be (what portion of the table).

So the tricky part here is determining the number of bins in the
histogram. In the one-dimensional case, stats_target=100 actually means
each bin contains about 1% of data. So I guess we should use a similar
approach in the multi-dimensional case too, i.e. let the user determine
a desired precision and then derive the number of bins automatically
(which is a bit tricky because of the multiple dimensions).

But yeah, in general you're right - this will require larger samples,
more complex stats to collect, we'll need some space to store the
collected stats ...

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 20:28, Tom Lane napsal(a):
 Tomas Vondra t...@fuzzy.cz writes:
 Dne 19.12.2010 17:26, Tom Lane napsal(a):
 That seems like quite a bizarre definition.  What I was envisioning was
 that we'd track only the time of the last whole-database stats reset.
 
 Well, but that does not quite work. I need is to know whether the
 snapshot is 'consistent' i.e. whether I can compare it to the previous
 snapshot and get meaningful results (so that I can perform some analysis
 on the difference).
 
 Oh, I see.  Yeah, if that's what you want it for then partial resets
 have to change the timestamp too.  I guess if we are careful to document
 it properly, this won't be too horrid.
 
   regards, tom lane
 

Well, maybe. I'd prefer the timestamp for each item, as that allows me
to determine which stats were not reset and analyze at least those data.

Plus I won't have time to write the other patch for at least a week, so
let's see whether there are serious objections agains the current patch.

I've never had problems with the pgstat.dat file, but I understand
others might, although this adds just 8 bytes to each item.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 23:58, Tom Lane napsal(a):
 Tomas Vondra t...@fuzzy.cz writes:
  Plus I won't have time to write the other patch for at least a week, so
  let's see whether there are serious objections agains the current patch.
 If you think this objection is not serious, you're mistaken.

I know there were problems with pgstats.stat and I/O (for example this
thread discussing an I/O storm
http://archives.postgresql.org/pgsql-hackers/2008-01/msg01095.php).

But I thought those issues were already resolved and I have not noticed
any such issue recently. Am I missing something?

 What is bothering me about that is this: how many of our users ever
 reset the stats at all?  Of those, how many reset the stats for just
 some tables and not all of them?  And of those, how many care about
 having the database remember when they did it, at a per-table level?
 I think the answer to each of those questions is not many, and so
 the fraction of our users who will get a benefit from that added
 overhead is epsilon cubed.  But approximately 100% of our users will
 be paying the overhead.

Sure, I understand that only a fraction of the users will use the patch
I've proposed. That's why I'm saying that the per-database timestamp
would be sufficient (although I'd prefer the per-record timestamp).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
 Hmmm, maybe we could give this possibility (to identify two separate
 groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
 user would say 'build stats for (A,B) and (C)' - this actually represents
 apriori knowledge of dependencies supplied by the user.

 In that case we could search for 'implicativeness' between those two
 groups (and not within the groups), and we could restrict ourselves to 2D
 (and thus use a more sophisticated formula).
 
 Hm, I hated this idea at first, but I'm starting to like it more and more.
 It *does* seem rather unrealistic that a user would know that a bunch of
 columns are correlated, but have no idea in what way... 

Yes, that's true. Although sometimes the dependency may be very
complicated - but let's restrict to 2D for now, build something that
solves this simplified case and then we can discuss higher dimensions.

 Any examples when this's be the case would be very much appreciated - Maybe
 we should ask around on -general about this?

Well, I think the ZIP code example i a typical case of this - the users
know about the dependency between ZIP codes and cities. A natural
workaround would be to omit the dependent column from the query, but
that's not always possible (e.g. when an ORM is involved, building the
queries automatically).

 But we should be able to do some basic analysis even when the user
 supplies a list of columns without such apriori knowledge.
 
 That, I think, overcomplicates things, at least for a first cut.
 
 To summarize, I think you've shown quite nicely that the uniform bayesian
 approach is a very sensible first step towards better estimates in the case
 of correlated columns. It's statistically sound, and the dist(A,B) estimates
 it requires are probably a necessary ingredient of any solution to the 
 problem.
 If we can make it degrade more gracefully if the columns are uncorrelated we
 should do that, but if we can't thats still no reason to drop the whole idea.

Agreed. IMHO the uncorrelated case is not a big concern, as the users
usually know something's wrong with the columns. But we should introduce
some 'autodetect' but let's leave that for the future.

 So I guess we should turn our attention to how we'd obtain reasonably good 
 estimates
 of dist(A,B), and return to the current discussion once the other pieces are 
 in place.
 
 I think that maybe it'd be acceptable to scan a large portion of the
 table to estimate dist(A,B), *if* we must only do so only once in a while. 
 But even with
 a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
 values in memory,
 and spilling them into, say, an on-disk hash table adds even more overhead to 
 the already
 expensive full scan. Maybe using a bloom filter instead of a hash table could 
 avoid
 the spilling to disk, in exchange for a slightly less precise result...

I have no idea what a Bloom filter is (shame on me). I was not thinking
about collecting the stats, I was interested primarily in what data do
we actually need. And my knowledge about the algorithms currently used
is very limited :-(

But I agree we should at least discuss the possible solutions. Until now
I've done something like this

   SELECT COUNT(DISTINCT a) AS dist_a,
  COUNT(DISTINCT b) AS dist_b,
  COUNT(DISTINCT a || ':' || b) AS dist_ab FROM my_table;

but that's not very efficient.

My plan for the near future (a few weeks) is to build a simple 'module'
with the ability to estimate the number of rows for a given condition.
This could actually be quite useful as a stand-alone contrib module, as
the users often ask how to get a number of rows fast (usually for paging).

That may be quite slow when the query returns too many rows, even when
there is an index. It may be even much slower than the actual query (as
it usually contains a small LIMIT).

An estimate is often sufficient, but the 'pg_class.tuples' does not
really work with conditions. So this might be an improvement ...

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-23 Thread Tomas Vondra
Dne 20.12.2010 00:03, Tom Lane napsal(a):
 I wrote:
 That is not the number of interest.  The number of interest is that it's
 8 bytes added onto a struct that currently contains 11 of 'em; in other
 words a 9% increase in the size of the stats file, and consequently
 about a 9% increase in the cost of updating it.
 
 Wups, sorry, I was looking at the wrong struct.  It's actually an
 addition of 1 doubleword to a struct of 21 of 'em, or about 5%.
 That's still an awful lot in comparison to the prospective usefulness
 of the information.
 
   regards, tom lane
 

OK, so here goes the simplified patch - it tracks one reset timestamp
for a background writer and for each database.

regards
Tomas

PS: I've noticed Magnus posted a patch to track recovery conflicts,
adding a new view pg_stat_database_conflicts. I have not checked it yet
but it should not influence this patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
 I think that maybe it'd be acceptable to scan a large portion of the
 table to estimate dist(A,B), *if* we must only do so only once in a while. 
 But even with
 a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
 values in memory,
 and spilling them into, say, an on-disk hash table adds even more overhead to 
 the already
 expensive full scan. Maybe using a bloom filter instead of a hash table could 
 avoid
 the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance

   To build a Bloom filter with limited error rate, you need to size it
   properly (number of hash function and size of the bit array). But
   that's exactly the the information we're looking for.

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

   That's not much but this needs to be done for each column separately
   and for the combination - for N columns you need (at least) N+1
   filters.

   Sure - increasing the error rate and using a different estimate to
   build the bloom filter would significantly decrease the memory
   requirements.

2) sampling the whole table

   A naive approach would be to sample the whole table each time, but
   for large tables that might be a bit of a problem. So I'm thinking
   about what alternatives are there ...

   One possibility is to build the Bloom filter incrementally and store
   it on disk, or something like that. I guess this could be done
   during VACUUM or something like that. Anyway there's an issue how to
   set the filter size initially (estimate the number of elements),
   just as in the previous section.

   Another possibility is to collect the data from just a small portion
   of a table and then use the result to estimate the number of distinct
   values for the whole table. But I'm not sure we can do this reliably,
   I see many traps in this.

But maybe I'm just missing something important.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-23 Thread Tomas Vondra
Dne 23.12.2010 20:09, Robert Haas napsal(a):
 2010/12/23 Tomas Vondra t...@fuzzy.cz:
 Dne 20.12.2010 00:03, Tom Lane napsal(a):
 I wrote:
 That is not the number of interest.  The number of interest is that it's
 8 bytes added onto a struct that currently contains 11 of 'em; in other
 words a 9% increase in the size of the stats file, and consequently
 about a 9% increase in the cost of updating it.

 Wups, sorry, I was looking at the wrong struct.  It's actually an
 addition of 1 doubleword to a struct of 21 of 'em, or about 5%.
 That's still an awful lot in comparison to the prospective usefulness
 of the information.

   regards, tom lane


 OK, so here goes the simplified patch - it tracks one reset timestamp
 for a background writer and for each database.
 
 I think you forgot the attachment.

Yes, I did. Thanks!

Tomas

From da2154954a547c143cd0d130597411911f339c07 Mon Sep 17 00:00:00 2001
From: vampire vamp...@rimmer.reddwarf
Date: Thu, 23 Dec 2010 18:40:48 +0100
Subject: [PATCH] Track timestamp of the last stats reset (for each database and 
a background writer).

This is simplified version of the previous patch, that kept a timestamp for each
database, table, index and function and for the background writer. This does not
keep the timestamp for individual objects, but if stats for a 
table/index/function
are reset, the timestamp for the whole database is updated.
---
 doc/src/sgml/monitoring.sgml |   18 ++
 src/backend/catalog/system_views.sql |6 --
 src/backend/postmaster/pgstat.c  |   13 +
 src/backend/utils/adt/pgstatfuncs.c  |   26 ++
 src/include/catalog/pg_proc.h|4 
 src/include/pgstat.h |5 +
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5fd0213..99e091d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -600,6 +600,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entryliteralfunctionpg_stat_get_db_stat_reset_time/function(typeoid/type)/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+   Time of the last reset of statistics for this database (as a result of 
executing
+   functionpg_stat_reset/function function) or any object within it 
(table or index).
+  /entry
+ /row
+
+ row
   
entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
   entrytypebigint/type/entry
   entry
@@ -1062,6 +1071,15 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
varnamebgwriter_lru_maxpages/varname parameter
   /entry
  /row
+ 
+ row
+  
entryliteralfunctionpg_stat_get_bgwriter_stat_reset_time()/function/literal/entry
+  entrytypetimestamptz/type/entry
+  entry
+Time of the last reset of statistics for the background writer (as a 
result of executing
+functionpg_stat_reset_shared('bgwriter')/function)
+  /entry
+ /row
 
  row
   
entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 346eaaf..da92438 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,7 +502,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
 pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
-pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
+pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
+   pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
 FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
@@ -538,7 +539,8 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
 pg_stat_get_buf_written_backend() AS buffers_backend,
 pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-pg_stat_get_buf_alloc() AS buffers_alloc;
+pg_stat_get_buf_alloc() AS buffers_alloc,
+   pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
 SELECT
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 856daa7..66e0b69 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3130,6 +3130,8 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result-n_tuples_deleted = 0;
result-last_autovac_time = 0;
 
+   result-stat_reset_timestamp = GetCurrentTimestamp();
+
memset(hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
@@ -3408,6

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
 My plan for the near future (a few weeks) is to build a simple 'module'
 with the ability to estimate the number of rows for a given condition.
 This could actually be quite useful as a stand-alone contrib module, as
 the users often ask how to get a number of rows fast (usually for paging).

Hm, I've been thinking about where to place this new module - I thought
pgfoundry might be the right place, but I've noticed it supports just
CVS. Well, that's a bit antiquated SCM I think - I'm very happy with the
comfort provided by Git or SVN.

Is there some other place commonly used for pg-related projects? I've
been thinking about github or something like that ...

And I've finally set up the wiki-page about this effort - it's just a
summary of this whole thread.

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a):
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 We are aware of that, but we really needed to do some very rough estimates
 and it's much easier to do the calculations with 10. Actually according to
 wikipedia it's not 10bits per element but 9.6, etc. But it really does not
 matter if there is 10MB or 20MB of data, it's still a lot of data ...

Oooops, now I see what's the problem. I thought you were pointing out
something out, but I've actually used 1B = 1b (which is obviously
wrong). But Florian already noticed that and fixed the estimates.

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
 On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
 
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 Uh, of course. So in the real universe, the numbers are
 
 ~1.2MB per 1e6 elements for a false positive rate of 1%
 ~0.5MB per 1e6 elements for a false positive rate of 10%
 
 Hm. So for a table with a billion distinct elements, we'd need half
 a gigabyte per column for the filter. A tuple with two int columns
 takes at least 24+2*4 = 32bytes to store I think, making such a table
 at least 32GB in size. The filter size would thus be 1/64 of the table
 size in the worst case. 

Yes, but in reality you need three such filters - one for each column,
one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB
(with 1% error rate).

But this is severely excessive compared to the real needs, as there are
usually much less distinct (not equal to the number of tuples as we
assume in these computations).

I was thinking about a simple heuristics to scale the filter properly,
something like this:

1) sample a small portion of the table and count distinct of values
2) compute number of dist. values / number of sampled tuples
3) scale this to the whole table and scale the filter

Say there are really 50 distinct values, 1.000 rows will be sampled but
20 distinct values are missing in the sample. This gives 5% in step (2)
and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the
filter needs just 60kB. Which is a huge improvement compared to the
previous approach (1.2MB).

Obviously this will still lead to overestimates in most cases, and there
are probably some other fail cases, but I think it's a reasonable
solution. I don't think this can result in an underestimate (which is
the case where you loose precision).

And in case we want to build this incrementally (from a VACUUM) we
really need to use a bit larger filter, because rescaling the filter is
not possible AFAIK (without rebuilding it from scratch).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
 The filter size could be derived from the table's statistics target, or
 be otherwise user-definable. We could also auto-resize once it gets too
 full. But still, that all seems awfully complex :-(

Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like

   error rate = 10 - 0.9 * (statistics_target - 100)

which gives

   1%  for statistics target 1000
   10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)

   error rate = 11 - 91000 / (statistics_target^2)

which gives about

   1%  for statistics target 1000
   10% for statistics targer 100
   36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

Another possibility is to collect the data from just a small portion
of a table and then use the result to estimate the number of distinct
values for the whole table. But I'm not sure we can do this reliably,
I see many traps in this.
 This is how it works currently. The problem with this approach is that
 it gives you very little guarantees about how precise the result will be.
 Extrapolating works very well for things like MKVs and histograms, because
 there you're by definition interested mostly in values which occur often -
 and thus with a high probability in the relative few rows you sample. For
 the number of distinct values, however, this isn't true - if ndistinct
 is an order of magnitude smaller than the number of rows, relatively few
 rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

 Another idea would be to obtain the ndistinct values from an index somehow.
 Postgres cannot currently scan an index in physical order, only in logical
 order, due to locking considerations. But since we'd only be interested in
 an estimate, maybe a scan in physical block order would work for ndistinc
 estimates? Just a wild idea, mind you, I haven't checked at all if that'd
 be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-27 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
 On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
 
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.
 
 Uh, of course. So in the real universe, the numbers are
 
 ~1.2MB per 1e6 elements for a false positive rate of 1%
 ~0.5MB per 1e6 elements for a false positive rate of 10%
 
 Hm. So for a table with a billion distinct elements, we'd need half
 a gigabyte per column for the filter. A tuple with two int columns
 takes at least 24+2*4 = 32bytes to store I think, making such a table
 at least 32GB in size. The filter size would thus be 1/64 of the table
 size in the worst case. 

I've read several papers about estimating the number of distinct values,
and I have some bad as well as good news. I don't want this thread to
grow infinitely, and it's a rather separate topic so I'll start a new
thread for ndistinct estimates.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Hi everyone,

about two weeks ago I've started a thread about cross-column stats. One
of the proposed solutions is based on number of distinct values, and the
truth is the current distinct estimator has some problems.

I've done some small research - I have read about 20 papers on this, and
I'd like to present a short summary here, possible solutions etc.

The simple truth is

1) sampling-based estimators are a dead-end
2) there are very interesting stream-based estimators
3) the stream-based estimators have some disadvantages

I wrote a more thorough summary on a wiki
(http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the
most interesting papers etc. so just a very short explanation.

1) sampling-based estimators are a dead-end

   The paper from Charikar  Chaudhuri states (and proves) that for
   each sampling-based estimator, there's a data set where the estimator
   produces arbitrary error (with an indispensable probability). So
   replacing one sampling-based estimator with another generally does
   not improve the situation - fixes one dataset, breaks another one.

   The error is directly related to the size of the sample, so the
   truth is that to get a good distinct estimate you need to scan a
   significant portion of the table (almost all of it).

   So the estimators based on tiny samples are a dead end.

2) there are very interesting stream-based estimators

   A very interesting idea comes from the field of data streams. These
   estimates are based no a one pass through the data, and then
   incremental updates. A very nice thing is that these algorithms
   don't need that much space, as they don't store the actual values.

   The idea behind this is similar to the Bloom filter, i.e. set bits
   of a bitmap using a hash of the value. But in this case it's not
   required to be able to answer questions like 'is this an element
   of the set X?' so it's actually even more space efficient. For an
   introduction see the first paper published in 1985 by Flajolet
   (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100).

   One of the recent articles (published in June 2010) actually presents
   an optimal algorithm with O(e^-2 + log(n)) space complexity and O(1)
   update complexity. The e means precision, i.e. that the estimate
   will be within [(1-e)F, (1+e)F] where F is the real value.

   The paper on self-learning bitmap states that to track 10^9 distinct
   values with 1% relative error you need about 61 kilobits of space
   (which is absolutely awesome). The optimal algorithm needs a bit more
   space I think,

   A very interesting solution id distinct sampling that somehow
   extends the Wegman's adaptive sampling approach.

3) the stream-based estimators have some disadvantages

   Not everything is perfect, though - the most serious disadvantage is
   that those algorithms (usually) don't handle removal of elements.
   Inserts are easy to handle, but deletes are hard (just as in case of
   Bloom filters).

   So using this stream-based approach might lead to degradation in
   case of heavily updated tables :-(

   I see two possible solutions here:

   (a) use counters instead of bits, and track actual counts - but this
   is tricky, especially with 'probabilistic' algorithms and needs
   much more space (but still, 32x 61kb is just 250kB)

   (b) count how many deletes/updates were performed, and if needed
   rebuild the whole bitmap

   But even though these disadvantages, there really is no other
   way to enhance the estimates. I don't think this should be a
   default behavior - just as in case of cross-column stats this should
   be optional when the current estimator does not work well.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 27.12.2010 22:46, Robert Haas napsal(a):
 2010/12/27 Tomas Vondra t...@fuzzy.cz:
   But even though these disadvantages, there really is no other
   way to enhance the estimates. I don't think this should be a
   default behavior - just as in case of cross-column stats this should
   be optional when the current estimator does not work well.
 
 This is going to be a lot of work to implement, so before you do it,
 we should try to reach a consensus that (a) it's part of an overall
 strategy that the community generally supports and (b) we have
 consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

 With respect to (a), I have to admit I've found the discussion on
 cross-column stats to be quite difficult to follow.  I'd like to see a
 very simple description of exactly what information we're going to
 store, under what circumstances we'll store it, and how we'll use it
 to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki

   http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:

  (a) multi-dimensional histogram - this can be done about the same way
  we create single-dimensional histograms, that's not a big problem
  (although more data might be necessary to get accurate histograms)

  (b) # of distinct values - this is the tricky part, as described in
  this problem

 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my head
 there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My big plan is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection
of stats and computing of estimates (not integrated with the
optimizer in any way).

This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo
stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 28.12.2010 00:04, Kevin Grittner napsal(a):
 Tom Lane t...@sss.pgh.pa.us wrote:
  
 Well, first, those scans occur only once every few hundred million
 transactions, which is not likely a suitable timescale for
 maintaining statistics.
  
 I was assuming that the pass of the entire table was priming for the
 incremental updates described at the start of this thread.  I'm not
 clear on how often the base needs to be updated for the incremental
 updates to keep the numbers close enough.

Well, that really depends on the workload. If you never remove all
occurences of a given value (e.g. a ZIP code), you don't need to rescan
the table at all.

All you need is to build the stats once and then update them
incrementally - and I belive this could be handled by autovacuum.

 And second, we keep on having discussions about rejiggering
 the whole tuple-freezing strategy.  Even if piggybacking on those
 scans looked useful, it'd be unwise to assume it'll continue to
 work the same way it does now.
  
 Sure, it might need to trigger its own scan in the face of heavy
 deletes anyway, since the original post points out that the
 algorithm handles inserts better than deletes, but as long as we
 currently have some sequential pass of the data, it seemed sane to
 piggyback on it when possible.  And maybe we should be considering
 things like this when we weigh the pros and cons of rejiggering. 
 This issue of correlated values comes up pretty often

Again, there are two types of stats - one of them needs to scan the
whole table (estimate of distinct values), the other one does not
(multi-dimentional histograms).

These two cases are independent - you don't necessarily need both.

Better ndistinct estimates would actually solve some of the current
issues, it's not just a matter of cross-column stats.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tomas Vondra
Dne 30.12.2010 15:43, Florian Pflug napsal(a):
 On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:

 With respect to (b), I think I'd need to see a much more detailed
 design for how you intend to make this work.  Off the top of my
 head there seems to be some pretty serious feasibility problems.

 I had one random thought on that -- it seemed like a large concern
 was that there would need to be at least an occasional scan of the
 entire table to rebuild the distinct value information.
 
 I believe we could actually avoid that.
 
 First, the paper An Optimal Algorithm for the Distinct Elements Problem
 actually contains an algorithm with *does* handle deletes - it's called
 L_0 estimate there.

Hmmm, that's interesting. I know there's a part about L_0 estimation,
but that's about estimating Hamming norm of a vector - so I've ignored
it as I thought we can't use it to estimate number of distinct values.
But if it really handles deletions and if we can use it, then it's
really interesting.

 Second, as Tomas pointed out, the stream-based estimator is essentially a
 simplified version of a bloom filter. It starts out with a field of
 N zero bits, and sets K of them to 1 for each value v in the stream.
 Which bits are set to 1 depends on some hash function(s) H_i(v). It's
 then easy to compute how many 1-bits you'd expect to find in the bit
 field after seeing D distinct values, and by reversing that you can
 estimate D from the number of 1-bits in the bit field.

No, I haven't said the stream-based estimators are simplified versions
of a Bloom filter. I said the approach is very similar - all the
algorithms use bitmaps and hash functions, but the algorithms (Bloom
filter vs. probabilistic counting and adaptive sampling) are very different.

The Bloom filter is much more straightforward. The other algorithms are
much more sophisticated which allows to use less space.

 To avoid having to rescan large tables, instead of storing one such
 bit field, we'd store one per B pages of data. We'd then only need
 to scan a range of B pages around every updated or deleted tuple,
 and could afterwards compute a new global estimate of D by combining
 the individual bit fields with bitwise and.

I don't think this could help.

1) This works just with the Bloom filters, not with the other
   algorithms (you can't combine the segments using bitmap OR).

2) With heavily modified tables the updates are usually 'spread'
   through the whole table, so you'll have to rebuild all the
   segments anyway.

 Since the need to regularly VACUUM tables hit by updated or deleted
 won't go away any time soon, we could piggy-back the bit field
 rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking about something else - we could 'attach' the rebuild to
an actual seq scan if the amount of changes reaches some threshold
(since the last rebuild). Only in case the amount of changes reaches a
higher threshold, we would rebuild the stats on our own.

Something like

IF (# of updates * deletes  5%) THEN wait for seq scan
IF (# of updates * deletes  10%) THEN rebuild the stats

I've found a nice ppt describing how Oracle does this:

   http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt

and there's PDF version

   http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf

According to this, Oracle is using the probabilistic counting approach
(see slide 26). And once they find out there were to many changes in the
table, they rebuild the whole thing.

I'm not saying we should do exactly the same, just that this might be a
good direction.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Tomas Vondra
Dne 7.1.2011 20:56, Jim Nasby napsal(a):
 On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote:
 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally). I was
 thinking about pg_stats but I'm not sure it's the right place - depending
 on the algorithm, this may be a fet kilobytes up to several megabytes. And
 it's not needed except when updating it. Any ideas?
 
 If you're using it essentially as a queue, maybe a resource fork makes sense?

A resource fork? Not sure what you mean, could you describe it in more
detail?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-15 Thread Tomas Vondra
Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)

   - quite good performance, especially with respect to the memory
   requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)

   - very different algorithms, almost the same performance (especially
   with large number of distinct values)

   - my personal favourites, as the memory requirements are still very
   reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)

   - this algorithm is described as an optimal algorithm (not
   exactly, it's a building block of the optimal algorithm), but the
   results are not as good as with (1) and (2)

   - one of the problems is it needs k-wise independent hash functions,
   and the MD5 hashing scheme used with the other algorithms probably
   does not conform to this condition :-(

   - I've tried to implement such hash functions on my own (using prime
   field and polynomials in modulo arithmetics), but the results were
   quite bad - if you know how to get such hash functions, let me know.

   - this algorithm is much more complex than the other algorithms, so
   if someone could do a review, that would be nice (maybe there is a
   bug resulting in the big estimate error)

4) Bloom Counter

   - basically a Bloom filter + a counter (incremented when an new
   element is added to the filter)

   - due to the design (no false negatives, positive false positives)
   it's significantly biased (gives underestimates), but I think that
   can be easily fixed with a coefficient (depends on the number of
   items / false positive eror rate)

   - needs significantly more memory to get similar performance as the
   other algorithms (a Bloom counter that can track 1 milion distinct
   values needs about 1.2MB of memory, while a S-Bitmap that tracks 10
   milion items needs just 65kB)

   - so unless we can use the special feature of Bloom Filter (i.e. the
   ability to detect items that are in the data), this is a waste of
   memory

   I know Google uses Bloom filters in BigTable to limit I/O, i.e.
   before doing the I/O they ask the Bloom filter if there are such
   data - if the answer is no, they don't have to do the I/O (false
   negatives not allowed), if the answer is yes they do the I/O.

   Not sure if we can / want to do something like this in PostgreSQL,
   as it significantly depends on the workload (and in most cases we
   know the data are there, so we have to do the I/O anyway). But I
   guess it might be a good idea for a contrib module (to maintain the
   Bloom filter on your own and do the decision on application). The
   tricky part here is to maintain the bloom filter up to date.

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Tomas Vondra
Dne 9.1.2011 13:58, Jim Nasby napsal(a):
 A resource fork? Not sure what you mean, could you describe it in more
 detail?
 
 Ooops, resource forks are a filesystem thing; we call them relation forks. 
 From src/backend/storage/smgr/README:

OK, I think using relation forks seems like a good solution. I've done
some basic research and I think these are the basic steps when adding a
new fork:

1) define a new item in the ForkNum enum (relfilenode.h) - this should
   be somethink like DISTINCT_FORK I guess

2) modify the ForkNames (catalog.c) and the relpathbackend so that the
   proper filename is assigned to the fork

And then it will be accessed through smgr (smgrcreate etc.). Am I right
or is there something else I need to do?

There are a few open questions though:

1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

2) Where to keep the info that there is an estimator for a column? I
   guess we could put this into pg_attribute (it's one boolean). But
   what about the estimators for groups of columns? Because that's why
   I'm building this - to get distinct estimates for groups of columns.

   I guess we'll need a new system catalog to track this? (The same
   will be true for multi-dimensional histograms anyway).

3) I still am not sure how to manage the updates, i.e. how to track the
   new values.

   One possibility might be to do that synchronously - whenever a new
   item is inserted into the table, check if there's an estimator and
   update it. Updating the estimators is quite efficient (e.g. the
   bitmap manages to do 10.000.000 inserts in 9 seconds on my ancient
   workstation) although there might be issues with locking etc.

   The other possibility is to update the estimator asynchronously, i.e.
   store the new values somewhere (or just ctid of the row), and then
   process it periodically. I'm not sure how to intercept the new rows
   and where to store them. In another fork? Somewhere else?

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tomas Vondra
Dne 18.1.2011 18:12, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 4:53 AM,  t...@fuzzy.cz wrote:
 So the most important question is how to intercept the new/updated rows,
 and where to store them. I think each backend should maintain it's own
 private list of new records and forward them only in case of commit. Does
 that sound reasonable?
 
 At the risk of sounding demoralizing, nothing about this proposal
 sounds very promising to me, and that sounds like a particularly bad
 idea.  What happens if the transaction updates a billion records?  Or
 even a million records?  Are you going to store all of those in
 backend-local memory until commit time?  Or spool them in a
 potentially-gigantic disk file somewhere?  That memory allocation - or
 file - could grow to be larger than the size of the entire database in
 the worst case.  And COMMIT could take an awfully long time if it has
 to spool megabytes or gigabytes of data off to some other process.
 And what happens if there's a crash after the COMMIT but before all
 that data is sent?  The estimates become permanently wrong?

Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.

I was thinking about it and I believe at least one of the algorithms
(the adaptive sampling) might handle merging i.e. the backends would
not need to store the list of values, just a private copy of the
estimator and update it. And at the end (after commit), the estimator
would be merged back into the actual one.

So the algorithm would be something like this:

1. create copy for all distinct estimators influenced by the INSERT
2. update the local copy
3. commit and merge the local copies back into the original estimator

The amount of copied data is very small, depending on the number of
distinct items it can handle and precision (see the adaptive.c for
details). Some examples

2^48 items + 1% error = 86 kB
2^48 items + 5% error = 3.4 kB
2^64 items + 1% error = 115 kB
2^64 items + 5% error = 4.6 kB

so it's much better than storing all the data.

But I really need to check this is possible (right now I'm quite busy
with organizing a local conference, so maybe next week).

Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the write of the merged estimator
fails, but I think it might be possible to write the estimator to xlog
or something like that. So it would be replayed during recovery etc. Or
is it a stupid idea?

 And are we doing all of this just to get a more accurate estimate of
 ndistinct?  For the amount of effort that it will probably take to get
 this working at all, you could probably implement index-only scans and
 have enough bandwidth left over to tackle global temporary tables.
 And unless I'm missing something, the chances that the performance
 consequences will be tolerable are pretty much zero.  And it would
 only benefit the tiny fraction of users for whom bad n_distinct
 estimates cause bad plans, and then the even tinier percentage of
 those who can't conveniently fix it by using the manual override that
 we already have in place - which presumably means people who have
 gigantic tables that are regularly rewritten with massive changes in
 the data distribution that affect plan choice.  Is that more than the
 empty set?

First of all, I've never proposed to use this as a default behaviour.
Actually I've strongly argued agains that idea on several occasions. So
the user would have to enable that explicitly for some columns, I really
am not an idiot to force everyone to pay for this.

So the goal is to help the small fraction of users and keep the
current solution for the rest of them.

And yes, the main reason why I am working on this are the multi-column
stats (in case of discrete columns). You can fix the ndistinct estimate
by manually overriding the estimate, but for the multi-column stats you
need an estimate of ndistinct for the combination of columns, and that's
a bit difficult to do manually.

The other reason is I've studied statistics (and I enjoy it, which seems
weird to most people). And it's a good way to learn the internals.

 Maybe the idea here is that this wouldn't fix just ndistinct estimates
 but would also help with multi-column statistics.  Even if that's the
 case, I think it's almost certainly a dead end from a performance
 standpoint.  Some kind of manual ANALYZE process that can be invoked
 when needed to scan the entire table would be painful but maybe
 acceptable for certain people with a problem they can't fix any other
 way, but doing this automatically for everyone seems like a really bad
 idea.

Yes, as I've mentioned above, the multi-column stats are the reason why
I started working on this. And yes, it really should work like this:

1. user realizes there's something wrong as the plans are bad
2. after 

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:25, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Yes, I was aware of this problem (amount of memory consumed with lots of
 updates), and I kind of hoped someone will come up with a reasonable
 solution.
 
 As far as I can see, periodically sampling some or all of the table is
 really the only thing that has a chance of working OK.  You could try
 to track changes only when they're not too big, but I think that's
 still going to have nasty performance consequences.

Yes, sampling all the table is the only way to get reliable ndistinct
estimates. I'm not sure what you mean by 'tracking changes' - as I've
mentioned in the previous post, this might be solved by updating a local
copy. Which requires a constant space (a few kB, see the previous post).
Is that acceptable? I don't know, that's up to the user if he want's to
pay this price.

 So the algorithm would be something like this:

 1. create copy for all distinct estimators influenced by the INSERT
 2. update the local copy
 3. commit and merge the local copies back into the original estimator
 
 Well, maybe.  But DELETEs seem like a problem.  And it's still adding
 foreground work to every query, which I just have a hard time
 believing is ever going to work out

Yes, deletes are difficult to handle. My idea was to compute something
like ((tuples changed + tuples deleted) / tuples total), and indicate
that a rebuild of the estimator is needed if this coefficient changes
too much - e.g. log a message or something like that.

 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?
 
 It's not stupid, in the sense that that is what you'd need to do if
 you want to avoid ever having to rescan the table.  But it is another
 thing that I think is going to be way too expensive.

Way too expensive? All you need to put into the logfile is a copy of the
estimator, which is a few kBs. How is that 'way too expensive'? Sure, it
might be expensive when the user does a lot of small changes in separate
transactions, that's true, but I guess we could minimize the amount of
data written to the xlog by doing a diff of the estimators or something
like that.

 Yes, as I've mentioned above, the multi-column stats are the reason why
 I started working on this. And yes, it really should work like this:

 1. user realizes there's something wrong as the plans are bad
 2. after analysis, the user realizes the reason is an imprecise
   estimate due to dependence between columns
 3. the user enables cross-column stats on the columns and checks
   if it actually solved the issue
 4. if the cost outweights the gains, the user drops the stats

 Does that sound reasonable?
 
 Yes.  The only caveat I'm trying to insert is that it's hard for me to
 see how the proposed approach could ever be cheap enough to be a win.

IMHO the question is not 'How expensive is that?' but rather 'How
expensive is it compared to the gains?' If the user gets much better
estimates and a then a much better plan, then this may be a completely
acceptable price.

 If we need some kind of more expensive kind of ANALYZE that scans the
 whole table, and it's optional, sure, why not?  But that's a one-time
 hit.  You run it, it sucks, it's over, and now your queries work.
 Odds are good you may never need to touch it again.  Now, if we can
 convince ourselves that multi-column stats are likely to require
 constant updating, then maybe there would be more to recommend the
 stream-based approach, although even then it seems dreadfully complex
 and expensive to me.  But I bet these things are pretty static.  If

No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can use a multi-dimensional
histogram, you don't need to scan the whole table.

But the multi-dimensional histograms are not applicable to some cases.
Especially to the ZIP code fail case, that was repeatedly discussed. So
in case of discrete data, we need a different approach - and the
solution I've proposed is based on using ndistinct estimates to get the
estimates (actually it's based on one of the papers listed on wiki).

 and expensive to me.  But I bet these things are pretty static.  If
 the city and state tend to imply the zip code when there are 10K rows
 in the table, they probably still will when there are 100K rows in the
 table.  If users with org_id = 1 have a higher karma score on average

OK, how will you measure the implicativeness?

We have discussed this in another thread and there is a lot of gotchas
although it seems like a really

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:46, Tom Lane napsal(a):
 Robert Haas robertmh...@gmail.com writes:
 ... I guess I'm just saying I'd think really, really hard
 before abandoning the idea of periodic sampling.  You have to get an
 awful lot of benefit out of those cross-column stats to make it worth
 paying a run-time cost for them.
 
 I've been trying to not discourage Tomas from blue-sky speculation,
 but I have to say I agree with Robert that the chance of any usable
 result from this approach is very very small.  Feel free to do some
 experimentation to prove us wrong --- but don't put a lot of effort
 into it before that.
 
   regards, tom lane

Discourage? Not really - I have not expected this to be a simple thing
to implement. And yes, it might turn out to be a dead end eventually.

OTOH the approach it seems really expensive so let's not try to build
it is a certain dead end, so I'm not going to surrender due to such
speculations (althouh those are valid concerns and I'll have to address
them in the future).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 23:44, Nathan Boley napsal(a):
 1) The distribution of values in a table is rarely pathological, and
 usually follows one of several common patterns. ( IOW, we have good
 heuristics )

Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of multiple columns. Which is usually
pathological in case of dependent columns. Which is exactly the case
when the user will explicitly enable this feature to get better
estimates (and thus better plans).

 2) The choice of plan is fairly robust to mis-estimation of ndistinct,
 because there are only a few things the executer can choose. It
 doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
 the table, we still usually need to scan the entire table.

Again, don't think about a single column (although even in that case
there are known fail cases). Think about multiple columns and the number
of distinct combinations. With attribute value independence assumption,
this is usually significantly underestimated. And that's a problem as it
often leads to an index scan instead of sequential scan (or something
like that).

 Again, in my *very* humble opinion, If the ultimate goal is to improve
 the planner, we should try to cut down on the number of cases in which
 a poor estimate of ndistinct gives a really bad plan instead of
 chasing after marginal gains from a better ndistinct estimator. Maybe

What is a marginal gain? The ultimate goal is to build cross-column
stats, which in case of dependent columns usually results is poor plans.
How is fixing that a marginal gain?

Yes, it may turn out it's not worth it, but it's a bit early to say that
without an implementation and some hard data.

 ( as I've advocated for before ) this means parameterizing the
 distribution of values ( ie, find that the values are roughly uniform
 ), maybe this means estimating the error of our statistics and using
 the most robust rather than the best plan, or maybe it means something
 totally different. But: ( and the literature seems to support me )

Which literature? I've read an awful lot of papers on this topic lately,
and I don't remember anything like that. If there's an interesting
paper, I'd like to read it. Yes, all the papers state estimating a
ndistinct is a particularly tricky task, but that's somehow expected?

I've even checked how other databases do this - e.g. Oracle does it very
similarly to what I proposed (the user has to enable the feature, it has
to be rebuilt from time to time, etc.). I'm not saying we should do
everything like Oracle, but they are not clueless idiots.

 pounding away at the ndistinct estimator seems like a dead end. If you
 think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

No, I'm not trying to do this just to improve the ndistinct estimate.
The goal is to improve the cardinality estimate in case of dependent
columns, and one of the papers depends on good ndistinct estimates.

And actually it does not depend on ndistinct for the columns only, it
depends on ndistinct estimates for the combination of columns. So
improving the ndistinct estimates for columns is just a necessary first
step (and only if it works reasonably well, we can do the next step).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:06, Nathan Boley napsal(a):
 And actually it does not depend on ndistinct for the columns only, it
 depends on ndistinct estimates for the combination of columns. So
 improving the ndistinct estimates for columns is just a necessary first
 step (and only if it works reasonably well, we can do the next step).
 
 I think that any approach which depends on precise estimates of
 ndistinct is not practical.

I'm not aware of any other approach to the 'discrete fail case' (where
the multi-dimensional histograms are not applicable). If someone finds a
better solution, I'll be the first one to throw away this stuff.

 I am very happy that you've spent so much time on this, and I'm sorry
 if my previous email came off as combative. My point was only that
 simple heuristics have served us well in the past and, before we go to
 the effort of new, complicated schemes, we should see how well similar
 heuristics work in the multiple column case. I am worried that if the
 initial plan is too complicated then nothing will happen and, even if
 something does happen, it will be tough to get it committed ( check
 the archives for cross column stat threads - there are a lot ).

If I've leaned one thing over the years in IT, it's not to take critique
personally. All the problems mentioned in this thread are valid
concerns, pointing out weak points of the approach. And I'm quite happy
to receive this feedback - that's why I started it.

On the other hand - Jara Cimrman (a famous Czech fictional character,
depicted as the best scientist/poet/teacher/traveller/... - see [1])
once said that you can't be really sure you don't get gold by blowing
cigarette smoke into a basin drain, until you actually try it. So I'm
blowing cigaretter smoke into the drain ...

It may wery vell happen this will be a dead end, but I'll do my best to
fix all the issues or to prove that the pros outweight the cons. And
even if it will be eventually rejected, I hope to get -1 from TL to be
eligible for that t-shirt ...

[1] http://en.wikipedia.org/wiki/Jara_Cimrman

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:36, Robert Haas napsal(a):
 On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Regarding the crash scenario - if the commit fails, just throw away the
 local estimator copy, it's not needed. I'm not sure how to take care of
 the case when commit succeeds and the write of the merged estimator
 fails, but I think it might be possible to write the estimator to xlog
 or something like that. So it would be replayed during recovery etc. Or
 is it a stupid idea?

 It's not stupid, in the sense that that is what you'd need to do if
 you want to avoid ever having to rescan the table.  But it is another
 thing that I think is going to be way too expensive.

 Way too expensive? All you need to put into the logfile is a copy of the
 estimator, which is a few kBs. How is that 'way too expensive'?
 
 At this point, this is all a matter of religion, right?  Neither of us
 has a working implementation we've benchmarked.  But yes, I believe
 you're going to find that implementing some kind of streaming
 estimator is going to impose a...  pulls number out of rear end 6%
 performance penalty, even after you've optimized the living daylights
 out of it.  Now you might say... big deal, it improves my problem
 queries by 100x.  OK, but if you could get the same benefit by doing
 an occasional full table scan during off hours, you could have the
 same performance with a *0%* performance penalty.  Even better, the
 code changes would be confined to ANALYZE rather than spread out all
 over the system, which has positive implications for robustness and
 likelihood of commit.

Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).

But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.

 I'm not trying to argue you out of working on this.  It's obviously
 your time to spend, and if works better than I think it will, great!
 I'm merely offering you an opinion on what will probably happen if you
 go this route - namely, it'll carry an unpalatable run-time penalty.
 That opinion may be worth no more than what you paid for it, but there
 you have it.

Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a):
 It seems that the suggested multi-column selectivity estimator would be
 more sensitive to ndistinct of the individual columns. Is that correct?
 How is it biased? If we routinely under-estimate ndistinct of individual
 columns, for example, does the bias accumulate or cancel itself in the
 multi-column estimate?
 
 I'd like to see some testing of the suggested selectivity estimator with
 the ndistinct estimates we have. Who knows, maybe it works fine in
 practice.

The estimator for two columns and query 'A=a AND B=b' is about

 0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b))

so it's quite simple. It's not that sensitive to errors or ndistinct
estimates for individual columns, but the problem is in the multi-column
ndistinct estimates. It's very likely that with dependent colunms (e.g.
with the ZIP codes / cities) the distribution is so pathological that
the sampling-based estimate will be very off.

I guess this was a way too short analysis, but if you can provide more
details of the expected tests etc. I'll be happy to provide that.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 11:05, Csaba Nagy napsal(a):
 Hi Tomas,
 
 On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
 No, the multi-column statistics do not require constant updating. There
 are cases where a sampling is perfectly fine, although you may need a
 bit larger sample. Generally if you can use a multi-dimensional
 histogram, you don't need to scan the whole table.
 
 In the cases where sampling is enough, you can do that to the updates
 too: do a sampling on the changes, in that you only process every Nth
 change to make it to the estimator. If you can also dynamically tune the
 N to grow it as the statistics stabilize, and lower it if you detect
 high variance, even better.
 
 If the analyze process could be decoupled from the backends, and maybe
 just get the data passed over to be processed asynchronously, then that
 could be a feasible way to have always up to date statistics when the
 bottleneck is IO and CPU power is in excess. If that then leads to
 better plans, it could really be a win exceeding the overhead.

OK, this sounds interesting. I'm not sure how to do that but it might be
a good solution. What about transactions? If the client inserts data
(and it will be sent asynchronously to update the estimator) and then
rolls back, is the estimator 'rolled back' or what happens?

This was exactly the reason why I initially wanted to collect all the
data at the backend (and send them to the estimator at commit time).
Which was then replaced by the idea to keep a local estimator copy and
merge it back to the original estimator at commit time.

 If this analyze process (or more of them) could also just get the data
 from the modified buffers in a cyclic way, so that backends need nothing
 extra to do, then I don't see any performance disadvantage other than
 possible extra locking contention on the buffers and non-determinism of
 the actual time when a change makes it to the statistics. Then you just
 need to get more CPU power and higher memory bandwidth to pay for the
 accurate statistics.

Well, the possible locking contention sounds like a quite significant
problem to me :-(

The lag between an update and a change to the stats is not that big deal
I guess - we have the same behaviour with the rest of the stats (updated
by autovacuum every once a while).

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2011-02-03 Thread Tomas Vondra
Dne 30.1.2011 23:22, Robert Haas napsal(a):
 On Thu, Dec 23, 2010 at 2:41 PM, Tomas Vondra t...@fuzzy.cz wrote:
 OK, so here goes the simplified patch - it tracks one reset timestamp
 for a background writer and for each database.

 I think you forgot the attachment.

 Yes, I did. Thanks!
 
 This patch no longer applies.  Please update.

I've updated the patch - rebased to the current HEAD.

regards
Tomas
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
***
*** 663,668  postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
--- 663,677 
   /row
  
   row
+   
entryliteralfunctionpg_stat_get_db_stat_reset_time/function(typeoid/type)/literal/entry
+   entrytypetimestamptz/type/entry
+   entry
+Time of the last reset of statistics for this database (as a result of 
executing
+functionpg_stat_reset/function function) or any object within it 
(table or index).
+   /entry
+  /row
+ 
+  row

entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
entrytypebigint/type/entry
entry
***
*** 1125,1130  postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
--- 1134,1148 
 varnamebgwriter_lru_maxpages/varname parameter
/entry
   /row
+  
+  row
+   
entryliteralfunctionpg_stat_get_bgwriter_stat_reset_time()/function/literal/entry
+   entrytypetimestamptz/type/entry
+   entry
+ Time of the last reset of statistics for the background writer (as a 
result of executing
+ functionpg_stat_reset_shared('bgwriter')/function)
+   /entry
+  /row
  
   row

entryliteralfunctionpg_stat_get_buf_written_backend()/function/literal/entry
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***
*** 523,529  CREATE VIEW pg_stat_database AS
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
  pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
! pg_stat_get_db_conflict_all(D.oid) AS conflicts
  FROM pg_database D;
  
  CREATE VIEW pg_stat_database_conflicts AS
--- 523,530 
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
  pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
! pg_stat_get_db_conflict_all(D.oid) AS conflicts,
! pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
  FROM pg_database D;
  
  CREATE VIEW pg_stat_database_conflicts AS
***
*** 570,576  CREATE VIEW pg_stat_bgwriter AS
  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
  pg_stat_get_buf_written_backend() AS buffers_backend,
  pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
! pg_stat_get_buf_alloc() AS buffers_alloc;
  
  CREATE VIEW pg_user_mappings AS
  SELECT
--- 571,578 
  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
  pg_stat_get_buf_written_backend() AS buffers_backend,
  pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
! pg_stat_get_buf_alloc() AS buffers_alloc,
!   pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
  
  CREATE VIEW pg_user_mappings AS
  SELECT
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***
*** 3160,3165  pgstat_get_db_entry(Oid databaseid, bool create)
--- 3160,3167 
result-n_conflict_bufferpin = 0;
result-n_conflict_startup_deadlock = 0;
  
+   result-stat_reset_timestamp = GetCurrentTimestamp();
+ 
memset(hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
***
*** 3438,3443  pgstat_read_statsfile(Oid onlydb, bool permanent)
--- 3440,3451 
 * load an existing statsfile.
 */
memset(globalStats, 0, sizeof(globalStats));
+   
+   /*
+* Set the current timestamp (will be kept only in case we can't load an
+* existing statsfile.
+*/
+   globalStats.stat_reset_timestamp = GetCurrentTimestamp();
  
/*
 * Try to open the status file. If it doesn't exist, the backends simply
***
*** 4052,4057  pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int 
len)
--- 4060,4067 
dbentry-n_tuples_deleted = 0;
dbentry-last_autovac_time = 0;
  
+   dbentry-stat_reset_timestamp = GetCurrentTimestamp();
+ 
memset(hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
***
*** 4083,4088  pgstat_recv_resetsharedcounter

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2011-02-04 Thread Tomas Vondra
Dne 4.2.2011 03:37, Greg Smith napsal(a):
 Thinking I should start with why I think this patch is neat...most of
 the servers I deal with are up 24x7 minus small amounts of downtime,
 presuming everyone does their job right that is.  In that environment,
 having a starting timestamp for when the last stats reset happened lets
 you quickly compute some figures in per-second terms that are pretty
 close to actual average activity on the server.  Some examples of how I
 would use this:
 
 psql -c 
 SELECT
  CAST(buffers_backend * block_size AS numeric) / seconds_uptime /
 (1024*1024)
AS backend_mb_per_sec
 FROM
  (SELECT *,extract(epoch from now()-stats_reset) AS seconds_uptime,
  (SELECT cast(current_setting('block_size') AS int8)) AS block_size
   FROM pg_stat_bgwriter) AS raw WHERE raw.seconds_uptime  0
 
 backend_mb_per_sec
 
   4.27150807681618
 
 psql -c 
 SELECT
  datname,CAST(xact_commit AS numeric) / seconds_uptime
AS commits_per_sec
 FROM
  (SELECT *,extract(epoch from now()-stats_reset) AS seconds_uptime
   FROM pg_stat_database) AS raw WHERE raw.seconds_uptime  0
 
 
  datname  |  commits_per_sec  ---+
 template1 | 0.0338722604313051
 postgres  | 0.0363144438470267
 gsmith| 0.0820573653236174
 pgbench   |  0.059147072347085
 
 Now I reset, put some load on the system and check the same stats
 afterward; watch how close these match up:
 
 $ psql -d pgbench -c select pg_stat_reset()
 $ pgbench -j 4 -c 32 -T 30 pgbench
 transaction type: TPC-B (sort of)
 scaling factor: 100
 query mode: simple
 number of clients: 32
 number of threads: 4
 duration: 30 s
 number of transactions actually processed: 6604
 tps = 207.185627 (including connections establishing)
 tps = 207.315043 (excluding connections establishing)
 
  datname  |  commits_per_sec  ---+
 pgbench   |   183.906308135572
 
 Both these examples work as I expected, and some playing around with the
 patch didn't find any serious problems with the logic it implements. 
 One issue though, an oversight I think can be improved upon; watch what
 happens when I create a new database:
 
 $ createdb blank
 $ psql -c select datname,stats_reset from pg_stat_database where
 datname='blank'
 datname | stats_reset
 -+-
 blank   |
 
 That's not really what I would hope for here.  One major sort of
 situation I'd like this feature to work against is the one where someone
 asks for help but has never touched their database stats before, which
 is exactly what I'm simulating here.  In this case that person would be
 out of luck, the opposite of the experience I'd like a newbie to have at
 this point.

Are you sure about it? Because when I create a database, the field is
NULL - that's true. But once I connect to the database, the stats are
updated and the field is set (thanks to the logic in pgstat.c).

In that case 'stat_reset=NULL' would mean 'no-one ever touched this db'
which seems quite reasonable to me.



$ createdb testdb1
$ createdb testdb2

$ psql -d testdb1 -c select stats_reset from pg_stat_database where
  datname = 'testdb2'
 stats_reset
-

(1 row)

$ psql -d testdb2 -c \q

$ psql -d testdb1 -c select stats_reset from pg_stat_database where
  datname = 'testdb2'
  stats_reset
---
 2011-02-04 19:03:23.938929+01
(1 row)



But maybe I've missed something and it does not work the way I think it
does.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2011-02-06 Thread Tomas Vondra
On 6.2.2011 08:17, Greg Smith wrote:
 Below is what changed since the last posted version, mainly as feedback
 for Tomas:
 
 -Explained more clearly that pg_stat_reset and
 pg_stat_reset_single_counters will both touch the database reset time,
 and that it's initialized upon first connection to the database.
 
 -Added the reset time to the list of fields in pg_stat_database and
 pg_stat_bgwriter.
 
 -Fixed some tab/whitespace issues.  It looks like you had tab stops set
 at 8 characters during some points when you were editing non-code
 files.  Also, there were a couple of spot where you used a tab while
 text in the area used spaces.  You can normally see both types of errors
 if you read a patch, they showed up as misaligned things in the context
 diff.
 
 -Removed some extra blank lines that didn't fit the style of the
 surrounding code.
 
 Basically, all the formatting bits I'm nitpicking about I found just by
 reading the patch itself; they all stuck right out.  I'd recommend a
 pass of that before submitting things if you want to try and avoid those
 in the future.

Thanks for fixing the issues and for the review in general. I have to
tune the IDE a bit to produce the right format and be a bit more careful
when creating the patches.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum stress-testing our system

2012-11-18 Thread Tomas Vondra
Hi!

On 26.9.2012 19:18, Jeff Janes wrote:
 On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
 On 26-09-2012 09:43, Tomas Vondra wrote:
 5) splitting the single stat file into multiple pieces - e.g. per 
 database,
 written separately, so that the autovacuum workers don't need to read all
 the data even for databases that don't need to be vacuumed. This might be
 combined with (4).

 IMHO that's the definitive solution. It would be one file per database 
 plus a
 global one. That way, the check would only read the global.stat and process
 those database that were modified. Also, an in-memory map could store that
 information to speed up the checks.

 +1

 That would help for the case of hundreds of databases, but how much
 does it help for lots of tables in a single database?
 
 It doesn't help that case, but that case doesn't need much help.  If
 you have N statistics-kept objects in total spread over M databases,
 of which T objects need vacuuming per naptime, the stats file traffic
 is proportional to N*(M+T).  If T is low, then there is generally is
 no problem if M is also low.  Or at least, the problem is much smaller
 than when M is high for a fixed value of N.

I've done some initial hacking on splitting the stat file into multiple
smaller pieces over the weekend, and it seems promising (at least with
respect to the issues we're having).

See the patch attached, but be aware that this is a very early WIP (or
rather a proof of concept), so it has many rough edges (read sloppy
coding). I haven't even added it to the commitfest yet ...

The two main changes are these:

(1) The stats file is split into a common db file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called pgstat.stat, the per-db files have the
database OID appended, so for example pgstat.stat.12345 etc.

This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:

   pgstat_read_db_statsfile
   pgstat_write_db_statsfile

that do the trick of reading/writing stat file for one database.

(2) The pgstat_read_statsfile has an additional parameter onlydbs that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).

So what are the benefits?

(a) When a launcher asks for info about databases, something like this
is called in the end:

   pgstat_read_db_statsfile(InvalidOid, false, true)

which means all databases (InvalidOid) and only db info (true). So
it reads only the one common file with db entries, not the
table/func stats.

(b) When a worker asks for stats for a given DB, something like this is
called in the end:

   pgstat_read_db_statsfile(MyDatabaseId, false, false)

which reads only the common stats file (with db entries) and only
one file for the one database.

The current implementation (with the single pgstat.stat file), all
the data had to be read (and skipped silently) in both cases.
That's a lot of CPU time, and we're seeing ~60% of CPU spent on
doing just this (writing/reading huge statsfile).

So with a lot of databases/objects, this pgstat.stat split saves
us a lot of CPU ...

(c) This should lower the space requirements too - with a single file,
you actually need at least 2x the disk space (or RAM, if you're
using tmpfs as we are), because you need to keep two versions of
the file at the same time (pgstat.stat and pgstat.tmp).

Thanks to this split you only need additional space for a copy of
the largest piece (with some reasonable safety reserve).


Well, it's very early patch, so there are rough edges too

(a) It does not solve the many-schema scenario at all - that'll need
a completely new approach I guess :-(

(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.

That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.

One way might be adding the timestamp(s) into PgStat_StatDBEntry
and the other one is using an array of inquiries for each database.

And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.

(c) I'm a bit worried about the number of files - right now there's one

Re: [HACKERS] autovacuum stress-testing our system

2012-11-18 Thread Tomas Vondra
On 26.9.2012 18:29, Tom Lane wrote:
 What seems to me like it could help more is fixing things so that the
 autovac launcher needn't even launch a child process for databases that
 haven't had any updates lately.  I'm not sure how to do that, but it
 probably involves getting the stats collector to produce some kind of
 summary file.

Couldn't we use the PgStat_StatDBEntry for this? By splitting the
pgstat.stat file into multiple pieces (see my other post in this thread)
there's a file with StatDBEntry items only, so maybe it could be used as
the summary file ...

I've been thinking about this:

 (a) add needs_autovacuuming flag to PgStat_(TableEntry|StatDBEntry)

 (b) when table stats are updated, run quick check to decide whether
 the table needs to be processed by autovacuum (vacuumed or
 analyzed), and if yes then set needs_autovacuuming=true and both
 for table and database

The worker may read the DB entries from the file and act only on those
that need to be processed (those with needs_autovacuuming=true).

Maybe the DB-level field might be a counter of tables that need to be
processed, and the autovacuum daemon might act on those first? Although
the simpler the better I guess.

Or did you mean something else?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm partial-match

2012-11-18 Thread Tomas Vondra
On 15.11.2012 20:39, Fujii Masao wrote:
 Hi,
 
 I'd like to propose to extend pg_trgm so that it can compare a partial-match
 query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
 GIN method for pg_trgm.
 
 Currently, when the query key is less than three characters, we cannot use
 a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
 partial-match method. In this case, seq scan or index full scan would be
 executed, and its response time would be very slow. I'd like to alleviate this
 problem.
 
 Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
 i.e., if query key contains multibyte characters. In this case, byte length of
 the trigram string might be larger than three, and its CRC is used as a
 trigram key instead of the trigram string itself. Because of using CRC, we
 cannot do a partial-match. Attached patch extends pg_trgm so that it
 compares a partial-match query key only when KEEPONLYALNUM is
 enabled.
 
 Attached patch is WIP yet. What I should do next is:
 
 * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
 * write the regression test
 
 Comments? Review? Objection?

Hi,

I've done a quick review of the current patch:

(1) It applies cleanly on the current master and builds fine.

(2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented
differently (using tabs instead of spaces).

(3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef
KEEPONLYALNUM, although trgm_pmatch is not used at all.

(4) The patch removes some commented-out variables, but there still
remain various commented-out variables. Will this be cleaned too?

(5) I've done basic functionality of the patch, it really seems to work:

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
   SELECT md5(i::text) FROM generate_series(1,100) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%';
  QUERY PLAN

 Bitmap Heap Scan on test  (cost=1655.96..11757.63 rows=141414 width=33)
   Recheck Cond: (val ~~ '%aa%'::text)
   -  Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414
width=0)
 Index Cond: (val ~~ '%aa%'::text)
(4 rows)

Without the patch, this gives a seq scan (as expected).

Do you expect to update the docs too? IMHO it's worth mentioning that
the pg_trgm can handle even patterns shorter than 2 chars ...


regards
Tomas Vondra


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] too much pgbench init output

2012-11-19 Thread Tomas Vondra
On 19.11.2012 11:59, Jeevan Chalke wrote:
 Hi,
 
 I gone through the discussion for this patch and here is my review:
 
 The main aim of this patch is to reduce the number of log lines. It is
 also suggested to use an options to provide the interval but few of us
 are not much agree on it. So final discussion ended at keeping 5 sec
 interval between each log line.
 
 However, I see, there are two types of users here:
 1. Who likes these log lines, so that they can troubleshoot some
 slowness and all
 2. Who do not like these log lines.

Who likes these lines / needs them for something useful?

 So keeping these in mind, I rather go for an option which will control
 this. People falling in category one can set this option to very low
 where as users falling under second category can keep it high.

So what option(s) would you expect? Something that tunes the interval
length or something else?

A switch that'd choose between the old and new behavior might be a good
idea, but I'd strongly vote against automagic heuristics. It makes the
behavior very difficult to predict and I really don't want to force the
users to wonder whether the long delay is due to general slowness of the
machine or some clever logic that causes long delays between log messages.

That's why I choose a very simple approach with constant time interval.
It does what I was aiming for (less logs) and it's easy to predict.
Sure, we could choose different interval (or make it an option).

 However, assuming we settled on 5 sec delay, here are few comments on
 that patch attached:
 
 Comments:
 =
 
 Patch gets applied cleanly with some whitespace errors. make and make
 install too went smooth.
 make check was smooth. Rather it should be smooth since there are NO
 changes in other part of the code rather than just pgbench.c and we do
 not have any test-case as well.
 
 However, here are few comments on changes in pgbench.c
 
 1.
 Since the final discussion ended at keeping a 5 seconds interval will be
 good enough, Author used a global int variable for that.
 Given that it's just a constant, #define would be a better choice.

Good point. Although if we add an option to supply different values, a
variable is a better match.

 2.
 +/* let's not call the timing for each row, but only each 100
 rows */
 Why only 100 rows ? Have you done any testing to come up with number 100
 ? To me it seems very low. It will be good to test with 1K or even 10K.
 On my machine (2.4 GHz Intel core 2 duo Macbook PRO, running Ubuntu in
 VM with 4GB RAM, 1067 DDR3), in 5 Sec, approx 1M rows were inserted. So
 checking every 100 rows looks overkill.

Well, the 100 is clearly a magical constant. The goal was to choose a
value large enough to amortize the getlocaltime() cost, but small enough
to print info even in cases when the performance sucks for some reason.

I've seen issues where the speed suddenly dropped to a fraction of the
expected value, e.g. 100 rows/second, and in those cases you'd have to
wait for a very long time to actually get the next log message (with the
suggested 10k step).

So 100 seems like a good compromise to me ...

 3.
 Please indent following block as per the indentation just above that
 
 /* used to track elapsed time and estimate of the remaining time */
 instr_timestart, diff;
 double elapsed_sec, remaining_sec;
 int log_interval = 1;

OK

 4.
 +/* have ve reached the next interval? */
 Do you mean have WE reached...

OK

 
 5.
 While applying a patch, I got few white-space errors. But I think every
 patch goes through pgindent which might take care of this.

OK

 
 Thanks

Thanks for the review. I'll wait for a bit more discussion about the
choices before submitting another version of the patch.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Tomas Vondra
On 19.11.2012 22:58, Alexander Korotkov wrote:
 Hi!
 
 New version of patch is attached. Changes are following:
 1) Right way to convert from pg_wchar to multibyte.
 2) Optimization of producing CFNA-like graph on trigrams (produce
 smaller, but equivalent, graphs in less time).
 3) Comments and refactoring.

Hi,

thanks for the updated message-id. I've done the initial review:

1) Patch applies fine to the master.

2) It's common to use upper-case names for macros, but trgm.h defines
   macro iswordchr - I see it's moved from trgm_op.c but maybe we
   could make it a bit more correct?

3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each
   other in trgm.h - maybe it'd be a good idea to join them?

4) The two new method prototypes at the end of trgm.h use different
   indendation than the rest (spaces only instead of tabs).

5) There are no regression tests / updated docs (yet).

6) It does not compile - I do get a bunch of errors like this

trgm_regexp.c:73:2: error: expected specifier-qualifier-list before
‘TrgmStateKey’
trgm_regexp.c: In function ‘addKeys’:
trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’
trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’
...

It seems this is cause by the order of typedefs in trgm_regexp.c, namely
TrgmState referencing TrgmStateKey before it's defined. Moving the
TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4  but
I think it's not compiler-dependent.)

7) Once fixed, it seems to work

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
   SELECT md5(i::text) FROM generate_series(1,100) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';


   QUERY PLAN
-
 Bitmap Heap Scan on test  (cost=16.77..385.16 rows=100 width=33)
   Recheck Cond: (val ~ '.*qqq.*'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
   width=0)
 Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)

but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: store additional info in GIN index

2012-12-01 Thread Tomas Vondra

On 18.11.2012 22:54, Alexander Korotkov wrote:
 Hackers,

 Patch completely changes storage in posting lists and leaf pages of
 posting trees. It uses varbyte encoding for BlockNumber and
 OffsetNumber. BlockNumber are stored incremental in page. Additionally
 one bit of OffsetNumber is reserved for additional information NULL
 flag. To be able to find position in leaf data page quickly patch
 introduces small index in the end of page.

Hi,

I've tried to apply the patch with the current HEAD, but I'm getting
segfaults whenever VACUUM runs (either called directly or from autovac
workers).

The patch applied cleanly against 9b3ac49e and needed a minor fix when
applied on HEAD (because of an assert added to ginRedoCreatePTree), but
that shouldn't be a problem.

The backtrace always looks like this:

Program received signal SIGSEGV, Segmentation fault.
0x004dea3b in processPendingPage (accum=0x7fff15ab8aa0,
ka=0x7fff15ab8a70, page=0x7f88774a7ea0 , startoff=1) at ginfast.c:785
785 addInfo = index_getattr(itup, 2,
accum-ginstate-tupdesc[curattnum - 1], addInfoIsNull);
(gdb) bt
#0  0x004dea3b in processPendingPage (accum=0x7fff15ab8aa0,
ka=0x7fff15ab8a70, page=0x7f88774a7ea0 , startoff=1) at ginfast.c:785
#1  0x004df3c6 in ginInsertCleanup (ginstate=0x7fff15ab97c0,
vac_delay=1 '\001', stats=0xfb0050) at ginfast.c:909
#2  0x004dbe8c in ginbulkdelete (fcinfo=0x7fff15abbfb0) at
ginvacuum.c:747



Reproducing the issue is quite simple:

1) create table messages (id int, txt text, ts tsvector);
2) insert into messages select i, substr(md5(i::text), 0, 4),
  to_tsvector('english',  substr(md5(i::text), 0, 4))
  from generate_series(1,10) s(i);
3) vacuum messages
4) ... segfault :-(

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum stress-testing our system

2012-12-02 Thread Tomas Vondra
On 21.11.2012 19:02, Robert Haas wrote:
 On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra t...@fuzzy.cz wrote:
 The two main changes are these:

 (1) The stats file is split into a common db file, containing all the
 DB Entries, and per-database files with tables/functions. The common
 file is still called pgstat.stat, the per-db files have the
 database OID appended, so for example pgstat.stat.12345 etc.

 This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
 functions, introducing two new functions:

pgstat_read_db_statsfile
pgstat_write_db_statsfile

 that do the trick of reading/writing stat file for one database.

 (2) The pgstat_read_statsfile has an additional parameter onlydbs that
 says that you don't need table/func stats - just the list of db
 entries. This is used for autovacuum launcher, which does not need
 to read the table/stats (if I'm reading the code in autovacuum.c
 correctly - it seems to be working as expected).
 
 I'm not an expert on the stats system, but this seems like a promising
 approach to me.
 
 (a) It does not solve the many-schema scenario at all - that'll need
 a completely new approach I guess :-(
 
 We don't need to solve every problem in the first patch.  I've got no
 problem kicking this one down the road.
 
 (b) It does not solve the writing part at all - the current code uses a
 single timestamp (last_statwrite) to decide if a new file needs to
 be written.

 That clearly is not enough for multiple files - there should be one
 timestamp for each database/file. I'm thinking about how to solve
 this and how to integrate it with pgstat_send_inquiry etc.
 
 Presumably you need a last_statwrite for each file, in a hash table or
 something, and requests need to specify which file is needed.
 
 And yet another one I'm thinking about is using a fixed-length
 array of timestamps (e.g. 256), indexed by mod(dboid,256). That
 would mean stats for all databases with the same mod(oid,256) would
 be written at the same time. Seems like an over-engineering though.
 
 That seems like an unnecessary kludge.
 
 (c) I'm a bit worried about the number of files - right now there's one
 for each database and I'm thinking about splitting them by type
 (one for tables, one for functions) which might make it even faster
 for some apps with a lot of stored procedures etc.

 But is the large number of files actually a problem? After all,
 we're using one file per relation fork in the base directory, so
 this seems like a minor issue.
 
 I don't see why one file per database would be a problem.  After all,
 we already have on directory per database inside base/.  If the user
 has so many databases that dirent lookups in a directory of that size
 are a problem, they're already hosed, and this will probably still
 work out to a net win.

Attached is a v2 of the patch, fixing some of the issues and unclear
points from the initial version.

The main improvement is that it implements writing only stats for the
requested database (set when sending inquiry). There's a dynamic array
of request - for each DB only the last request is kept.

I've done a number of changes - most importantly:

- added a stats_timestamp field to PgStat_StatDBEntry, keeping the last
  write of the database (i.e. a per-database last_statwrite), which is
  used to decide whether the file is stale or not

- handling of the 'permanent' flag correctly (used when starting or
  stopping the cluster) for per-db files

- added a very simple header to the per-db files (basically just a
  format ID and a timestamp) - this is needed for checking of the
  timestamp of the last write from workers (although maybe we could
  just read the pgstat.stat, which is now rather small)

- a 'force' parameter (true - write all databases, even if they weren't
  specifically requested)


So with the exception of 'multi-schema' case (which was not the aim of
this effort), it should solve all the issues of the initial version.

There are two blocks of code dealing with clock glitches. I haven't
fixed those yet, but that can wait I guess. I've also left there some
logging I've used during development (printing inquiries and which file
is written and when).

The main unsolved problem I'm struggling with is what to do when a
database is dropped? Right now, the statfile remains in pg_stat_tmp
forewer (or until the restart) - is there a good way to remove the
file? I'm thinking about adding a message to be sent to the collector
from the code that handles DROP TABLE.


I've done some very simple performance testing - I've created 1000
databases with 1000 tables each, done ANALYZE on all of them. With only
autovacum running, I've seen this:


Without the patch
-

%CPU  %MEM   TIME+   COMMAND
183.00:10.10 postgres: autovacuum launcher process
172.60:11.44 postgres: stats collector process

The I/O

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-04 Thread Tomas Vondra
On 4.12.2012 20:12, Alexander Korotkov wrote:
 Hi!
 
 On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 I've tried to apply the patch with the current HEAD, but I'm getting
 segfaults whenever VACUUM runs (either called directly or from autovac
 workers).
 
 The patch applied cleanly against 9b3ac49e and needed a minor fix when
 applied on HEAD (because of an assert added to ginRedoCreatePTree), but
 that shouldn't be a problem.
 
 
 Thanks for testing! Patch is rebased with HEAD. The bug you reported was
 fixed.

Applies fine, but I get a segfault in dataPlaceToPage at gindatapage.c.
The whole backtrace is here: http://pastebin.com/YEPuWeuV

The messages written into PostgreSQL log are quite variable - usually it
looks like this:

2012-12-04 22:31:08 CET 31839 LOG:  database system was not properly
shut down; automatic recovery in progress
2012-12-04 22:31:08 CET 31839 LOG:  redo starts at 0/68A76E48
2012-12-04 22:31:08 CET 31839 LOG:  unexpected pageaddr 0/1BE64000 in
log segment 00010069, offset 15089664
2012-12-04 22:31:08 CET 31839 LOG:  redo done at 0/69E63638

but I've seen this message too

2012-12-04 22:20:29 CET 31709 LOG:  database system was not properly
shut down; automatic recovery in progress
2012-12-04 22:20:29 CET 31709 LOG:  redo starts at 0/AEAFAF8
2012-12-04 22:20:29 CET 31709 LOG:  record with zero length at 0/C7D5698
2012-12-04 22:20:29 CET 31709 LOG:  redo done at 0/C7D55E


I wasn't able to prepare a simple testcase to reproduce this, so I've
attached two files from my fun project where I noticed it. It's a
simple DB + a bit of Python for indexing mbox archives inside Pg.

- create.sql - a database structure with a bunch of GIN indexes on
   tsvector columns on messages table

- load.py - script for parsing mbox archives / loading them into the
messages table (warning: it's a bit messy)


Usage:

1) create the DB structure
$ createdb archives
$ psql archives  create.sql

2) fetch some archives (I consistently get SIGSEGV after first three)
$ wget
http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-01.gz
$ wget
http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-02.gz
$ wget
http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-03.gz

3) gunzip and load them using the python script
$ gunzip pgsql-hackers.*.gz
$ ./load.py --db archives pgsql-hackers.*

4) et voila - a SIGSEGV :-(


I suspect this might be related to the fact that the load.py script uses
savepoints quite heavily to handle UNIQUE_VIOLATION (duplicate messages).


Tomas
#!/bin/env python

import argparse
import datetime
import getpass
import multiprocessing
import os
import psycopg2
import psycopg2.extras
import psycopg2.errorcodes
import quopri
import random
import re
import sys
import traceback
import UserDict

from multiprocessing import Process, JoinableQueue

class Message(dict):
	
	def __init__(self, message):
		self.message = message
		self.body= self.body(message)
		self.headers = self.headers(message)
		self.parts   = self.parts(message)
	
	def __getitem__(self, key):
		if self.headers.has_key(key.lower()):
			return self.headers[key.lower()]
		else:
			return None
	
	def __setitem__(self, key, value):
		self.headers.update({key.lower() : value})
	
	def __delitem__(self, key):
		if self.headers.has_key(key.lower()):
			del self.headers[key.lower()]
	
	def keys(self):
		return self.headers.keys()
	
	def get_body(self):
		return self.body

	def get_raw(self):
		return self.message
	
	def get_parts(self):
		return self.parts
	
	def get_headers(self):
		return self.headers

	def get_content_type(self):
		
		if self.headers.has_key('content-type'):
			return self.headers['content-type'].split(';')[0]
		else:
			return None
		
	def __repr__(self):
		return '%s %s' % (type(self).__name__, self.headers)
	
	def is_multipart(self):
		ctype = self.get_content_type()
		if ctype != None and re.match('multipart/.*', ctype):
			return True
		else:
			return False
	
	def part_boundary(self):
	
		if not self.is_multipart():
			return None
		else:
			r = re.match('.*boundary=?([^]*)?', self.headers['content-type'], re.IGNORECASE)
			if r:
return '--' + r.group(1)

	# FIXME this keeps only the last value - needs to keep a list
	def headers(self, message):
		
		lines = message.split(\n)
		
		key = ''
		value = ''
		
		headers = {};
		
		for l in lines:
			if l == '':
if key != '':
	headers.update({key.lower() : value})
break
			
			r = re.match('([a-zA-Z0-9-]*):\s*(.*)', l)
			if r:
if key != '':
	headers.update({key.lower() : value})

key = r.group(1)
value = r.group(2)
			else:
value += ' ' + l.strip()
			
		r = re.match('^From .*@.*\s+([a-zA-Z]*\s+[a-zA-Z]*\s+[0-9]+ [0-9]+:[0-9]+:[0-9]+\s+[0-9]{4})$', lines[0])
		if r:
			headers.update({'message-date' : r.group(1)})
		
		r = re.match('^From bouncefilter\s+([a-zA-Z]*\s+[a-zA-Z]*\s+[0

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-05 Thread Tomas Vondra
On 5.12.2012 09:10, Alexander Korotkov wrote:
 On Wed, Dec 5, 2012 at 1:56 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 Thanks for bug report. It is fixed in the attached patch.

Hi,

I gave it another try and this time it went fine - I didn't get any
segfault when loading the data, which is a good news.

Then I've run a simple benchmarking script, and the results are not as
good as I expected, actually I'm getting much worse performance than
with the original GIN index.

The following table contains the time of loading the data (not a big
difference), and number of queries per minute for various number of
words in the query.

The queries looks like this

SELECT id FROM messages
 WHERE body_tsvector @@ plainto_tsquery('english', 'word1 word2 ...')

so it's really the simplest form of FTS query possible.

   without patch |  with patch

loading   750 sec| 770 sec
1 word   1500|1100
2 words 23000|9800
3 words 24000|9700
4 words 16000|7200


I'm not saying this is a perfect benchmark, but the differences (of
querying) are pretty huge. Not sure where this difference comes from,
but it seems to be quite consistent (I usually get +-10% results, which
is negligible considering the huge difference).

Is this an expected behaviour that will be fixed by another patch?

The database contains ~680k messages from the mailing list archives,
i.e. about 900 MB of data (in the table), and the GIN index on tsvector
is about 900MB too. So the whole dataset nicely fits into memory (8GB
RAM), and it seems to be completely CPU bound (no I/O activity at all).

The configuration was exactly the same in both cases

shared buffers = 1GB
work mem = 64 MB
maintenance work mem = 256 MB

I can either upload the database somewhere, or provide the benchmarking
script if needed.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-06 Thread Tomas Vondra
On 6.12.2012 05:47, Shigeru Hanada wrote:
 On Mon, Nov 12, 2012 at 4:36 AM, Tomas Vondra t...@fuzzy.cz wrote:
 Hi,

 I've prepared a slightly updated patch, based on the previous review.
 See it attached.
 
 All changes in v3 patch seem good, however I found some places which requires
 cosmetic changes.  Please see attached v3.1 patch for those changes.

OK, thanks!

 Oops, you're right. I omitted 1-3 and 1-4 in actual measurement, but
 IMO loading data is necessary to fill the shared buffer up, because
 # of buffers which are deleted during COMMIT is major factor of this
 patch. And, yes, I verified that all shared buffers are used after
 all loading have been finished.

 I don't think it's an important factor, as the original code does this:

   for (i = 0; i  NBuffers; i++)
   {
 volatile BufferDesc *bufHdr = BufferDescriptors[i];
 ...
   }

 in the DropRelFileNodeAllBuffers. That loops through all shared buffers
 no matter what, so IMHO the performance in this case depends on the
 total size of the shared buffers. But maybe I'm missing something important.
 
 I worry the effect of continue in the loop to the performance.  If most of
 shared buffers are not used at the moment of DROP, the load of DROP doesn't
 contain the overhead of LockBufHdr and subsequent few lines.
 Do you expect such situation in your use cases?

I still fail to see the issue here - can you give an example of when
this would be a problem?

The code was like this before the patch, I only replaced the simple
comparison to a binary search ... Or do you think that this check means
if the buffer is empty?

/* buffer does not belong to any of the relations */
if (rnode == NULL)
continue;

Because it does not - notice that the 'rnode' is a result of the binary
search, not a direct check of the buffer.

So the following few lines (lock and recheck) will be skipped either for
unused buffers and buffers belonging to relations that are not being
dropped.

Maybe we could do a simple 'is the buffer empty' check before the
bsearch call, but I don't expect that to happen very often in real world
(the cache tends to be used).

 I've done a simple benchmark on my laptop with 2GB shared buffers, it's
 attached in the drop-test.py (it's a bit messy, but it works).
 [snip]
 With those parameters, I got these numbers on the laptop:

   creating 1 tables
 all tables created in 3.298694 seconds
   dropping 1 tables one by one ...
 all tables dropped in 32.692478 seconds
   creating 1 tables
 all tables created in 3.458178 seconds
   dropping 1 tables in batches by 100...
 all tables dropped in 3.28268 seconds

 So it's 33 seconds vs. 3.3 seconds, i.e. 10x speedup. On AWS we usually
 get larger differences, as we use larger shared buffers and the memory
 is significantly slower there.
 
 Do you have performance numbers of same test on not-patched PG?
 This patch aims to improve performance of bulk DROP, so 4th number in
 the result above should be compared to 4th number of not-patched PG?

I've re-run the tests with the current patch on my home workstation, and
the results are these (again 10k tables, dropped either one-by-one or in
batches of 100).

1) unpatched

dropping one-by-one:15.5 seconds
dropping in batches of 100: 12.3 sec

2) patched (v3.1)

dropping one-by-one:32.8 seconds
dropping in batches of 100:  3.0 sec

The problem here is that when dropping the tables one-by-one, the
bsearch overhead is tremendous and significantly increases the runtime.
I've done a simple check (if dropping a single table, use the original
simple comparison) and I got this:

3) patched (v3.2)

dropping one-by-one:16.0 seconds
dropping in batches of 100:  3.3 sec

i.e. the best of both. But it seems like an unnecessary complexity to me
- if you need to drop a lot of tables you'll probably do that in a
transaction anyway.

Tomas
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2446282..9e92230 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -312,6 +312,10 @@ smgrDoPendingDeletes(bool isCommit)
PendingRelDelete *pending;
PendingRelDelete *prev;
PendingRelDelete *next;
+   SMgrRelation *srels = (SMgrRelation *) palloc(sizeof(SMgrRelation));
+   int nrels = 0,
+   i = 0,
+   maxrels = 1;
 
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -335,14 +339,33 @@ smgrDoPendingDeletes(bool isCommit)
SMgrRelation srel;
 
srel = smgropen(pending-relnode, 
pending-backend);
-   smgrdounlink(srel, false);
-   smgrclose(srel);
+   
+   /* extend the array if needed (double the size

Re: [HACKERS] Serious problem: media recovery fails after system or PostgreSQL crash

2012-12-06 Thread Tomas Vondra
Hi,

On 6.12.2012 23:45, MauMau wrote:
 From: Tom Lane t...@sss.pgh.pa.us
 Well, that's unfortunate, but it's not clear that automatic recovery is
 possible.  The only way out of it would be if an undamaged copy of the
 segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
 not even be trying to fetch from the archive if we had a local copy.
 
 No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
 from archive.  XLogFileReadAnyTLI() does that.  Also, PG manual contains
 the following description:
 
 http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL
 
 
 WAL segments that cannot be found in the archive will be sought in
 pg_xlog/; this allows use of recent un-archived segments. However,
 segments that are available from the archive will be used in preference
 to files in pg_xlog/.

So why don't you use an archive command that does not create such
incomplete files? I mean something like this:

archive_command = 'cp %p /arch/%f.tmp  mv /arch/%f.tmp /arch/%f'

Until the file is renamed, it's considered 'incomplete'.


Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-08 Thread Tomas Vondra
On 8.12.2012 15:26, Andres Freund wrote:
 On 2012-12-06 23:38:59 +0100, Tomas Vondra wrote:
 I've re-run the tests with the current patch on my home workstation, and
 the results are these (again 10k tables, dropped either one-by-one or in
 batches of 100).

 1) unpatched

 dropping one-by-one:15.5 seconds
 dropping in batches of 100: 12.3 sec

 2) patched (v3.1)

 dropping one-by-one:32.8 seconds
 dropping in batches of 100:  3.0 sec

 The problem here is that when dropping the tables one-by-one, the
 bsearch overhead is tremendous and significantly increases the runtime.
 I've done a simple check (if dropping a single table, use the original
 simple comparison) and I got this:

 3) patched (v3.2)

 dropping one-by-one:16.0 seconds
 dropping in batches of 100:  3.3 sec

 i.e. the best of both. But it seems like an unnecessary complexity to me
 - if you need to drop a lot of tables you'll probably do that in a
 transaction anyway.

 
 Imo that's still a pretty bad performance difference. And your
 single-table optimization will probably fall short as soon as the table
 has indexes and/or a toast table...

Why? I haven't checked the code but either those objects are droppped
one-by-one (which seems unlikely) or they're added to the pending list
and then the new code will kick in, making it actually faster.

Yes, the original code might be faster even for 2 or 3 objects, but I
can't think of a simple solution to fix this, given that it really
depends on CPU/RAM speed and shared_buffers size.

 +
 +/*
 + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), 
 so
 + * that it's suitable for bsearch.
 + */
 +static int
 +rnode_comparator(const void * p1, const void * p2)
 +{
 +RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
 +RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
 +
 +if (n1.node.relNode  n2.node.relNode)
 +return -1;
 +else if (n1.node.relNode  n2.node.relNode)
 +return 1;
 +
 +if (n1.node.dbNode  n2.node.dbNode)
 +return -1;
 +else if (n1.node.dbNode  n2.node.dbNode)
 +return 1;
 +
 +if (n1.node.spcNode  n2.node.spcNode)
 +return -1;
 +else if (n1.node.spcNode  n2.node.spcNode)
 +return 1;
 +else
 +return 0;
 +}
 
 ISTM that this whole comparator could be replaced by memcmp(). That
 could quite possibly lower the overhead of the bsearch() in simple
 cases. We already rely on the fact that RelFileNode's have no padding
 atm (c.f. buf_internal.h), so a memcmp() seems fine to me.

Good point, I'll try that. The original code used a macro that simply
compared the fields and I copied that logic, but if we can remove the
code ...

Thanks for the review!

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-08 Thread Tomas Vondra
On 8.12.2012 15:49, Tomas Vondra wrote:
 On 8.12.2012 15:26, Andres Freund wrote:
 On 2012-12-06 23:38:59 +0100, Tomas Vondra wrote:
 I've re-run the tests with the current patch on my home workstation, and
 the results are these (again 10k tables, dropped either one-by-one or in
 batches of 100).

 1) unpatched

 dropping one-by-one:15.5 seconds
 dropping in batches of 100: 12.3 sec

 2) patched (v3.1)

 dropping one-by-one:32.8 seconds
 dropping in batches of 100:  3.0 sec

 The problem here is that when dropping the tables one-by-one, the
 bsearch overhead is tremendous and significantly increases the runtime.
 I've done a simple check (if dropping a single table, use the original
 simple comparison) and I got this:

 3) patched (v3.2)

 dropping one-by-one:16.0 seconds
 dropping in batches of 100:  3.3 sec

 i.e. the best of both. But it seems like an unnecessary complexity to me
 - if you need to drop a lot of tables you'll probably do that in a
 transaction anyway.


 Imo that's still a pretty bad performance difference. And your
 single-table optimization will probably fall short as soon as the table
 has indexes and/or a toast table...
 
 Why? I haven't checked the code but either those objects are droppped
 one-by-one (which seems unlikely) or they're added to the pending list
 and then the new code will kick in, making it actually faster.
 
 Yes, the original code might be faster even for 2 or 3 objects, but I
 can't think of a simple solution to fix this, given that it really
 depends on CPU/RAM speed and shared_buffers size.

I've done some test and yes - once there are other objects the
optimization falls short. For example for tables with one index, it
looks like this:

  1) unpatched

  one by one:  28.9 s
  100 batches: 23.9 s

  2) patched

  one by one:  44.1 s
  100 batches:  4.7 s

So the patched code is by about 50% slower, but this difference quickly
disappears with the number of indexes / toast tables etc.

I see this as an argument AGAINST such special-case optimization. My
reasoning is this:

* This difference is significant only if you're dropping a table with
  low number of indexes / toast tables. In reality this is not going to
  be very frequent.

* If you're dropping a single table, it really does not matter - the
  difference will be like 100 ms vs. 200 ms or something like that.

* If you're dropping a lot of tables, you should do than in a
  transaction anyway, or you should be aware that doing that in a
  transaction will be faster (we can add this info into DROP TABLE
  docs).

IMHO this is similar to doing a lot of INSERTs - doing all of them in a
single transaction is faster. The possibility that someone needs to drop
a lot of tables in separate transactions exists in theory, but I don't
know of a realistic real-world usage.

So I'd vote to ditch that special case optimization.

 +
 +/*
 + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), 
 so
 + * that it's suitable for bsearch.
 + */
 +static int
 +rnode_comparator(const void * p1, const void * p2)
 +{
 +   RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
 +   RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
 +
 +   if (n1.node.relNode  n2.node.relNode)
 +   return -1;
 +   else if (n1.node.relNode  n2.node.relNode)
 +   return 1;
 +
 +   if (n1.node.dbNode  n2.node.dbNode)
 +   return -1;
 +   else if (n1.node.dbNode  n2.node.dbNode)
 +   return 1;
 +
 +   if (n1.node.spcNode  n2.node.spcNode)
 +   return -1;
 +   else if (n1.node.spcNode  n2.node.spcNode)
 +   return 1;
 +   else
 +   return 0;
 +}

 ISTM that this whole comparator could be replaced by memcmp(). That
 could quite possibly lower the overhead of the bsearch() in simple
 cases. We already rely on the fact that RelFileNode's have no padding
 atm (c.f. buf_internal.h), so a memcmp() seems fine to me.
 
 Good point, I'll try that. The original code used a macro that simply
 compared the fields and I copied that logic, but if we can remove the
 code ...

Hmmm, I've replaced the comparator with this one:

static int
rnode_comparator(const void * p1, const void * p2)
{
return memcmp(p1, p2, sizeof(RelFileNode));
}

and while it's not significantly faster (actually it's often a bit
slower than the original comparator), it's a much simpler code.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2012-12-08 Thread Tomas Vondra
On 8.12.2012 16:33, Andres Freund wrote:
 Hi Tomas,
 
 On 2012-11-27 14:55:59 +0100, Pavel Stehule wrote:
 attached is a v4 of the patch. There are not many changes, mostly some
 simple tidying up, except for handling the Windows.
 
 After a quick look I am not sure what all the talk about windows is
 about? instr_time.h seems to provide all you need, even for windows? The
 only issue of gettimeofday() for windows seems to be that it is that its
 not all that fast an not too high precision, but that shouldn't be a
 problem in this case?
 
 Could you expand a bit on the problems?

Well, in the current pgbench.c, there's this piece of code

#ifndef WIN32
/* This is more than we really ought to know about instr_time */
fprintf(logfile, %d %d %.0f %d %ld %ld\n,
st-id, st-cnt, usec, st-use_file,
(long) now.tv_sec, (long) now.tv_usec);
#else
/* On Windows, instr_time doesn't provide a timestamp anyway */
fprintf(logfile, %d %d %.0f %d 0 0\n,
st-id, st-cnt, usec, st-use_file);
#endif

and the Windows-related discussion in this patch is about the same issue.

As I understand it the problem seems to be that INSTR_TIME_SET_CURRENT
does not provide the timestamp at all.

I was thinking about improving this by combining gettimeofday and
INSTR_TIME, but I have no Windows box to test it on :-(

 
 * I had a problem with doc
 
 The current patch has conflict markers in the sgml source, there seems
 to have been some unresolved merge. Maybe that's all that causes the
 errors?

Ah, I see. Probably my fault, I'll fix that.

 Whats your problem with setting up the doc toolchain?
 
 issues:

 * empty lines with invisible chars (tabs) + and sometimes empty lines
 after and before {}

 * adjustment of start_time

 +   * the desired interval */
 +   while (agg-start_time
 + agg_interval  INSTR_TIME_GET_DOUBLE(now))
 +
 agg-start_time = agg-start_time + agg_interval;

 can skip one interval - so when transaction time will be larger or
 similar to agg_interval - then results can be strange. We have to know
 real length of interval
 
 Could you post a patch that adresses these issues?

Yes, I'll work on it today/tomorrow and I'll send an improved patch.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2012-12-08 Thread Tomas Vondra
Hi,

attached is a v5 of this patch. Details below:

On 8.12.2012 16:33, Andres Freund wrote:
 Hi Tomas,
 
 On 2012-11-27 14:55:59 +0100, Pavel Stehule wrote:
 attached is a v4 of the patch. There are not many changes, mostly some
 simple tidying up, except for handling the Windows.
 
 After a quick look I am not sure what all the talk about windows is
 about? instr_time.h seems to provide all you need, even for windows? The
 only issue of gettimeofday() for windows seems to be that it is that its
 not all that fast an not too high precision, but that shouldn't be a
 problem in this case?
 
 Could you expand a bit on the problems?

As explained in the previous message, this is an existing problem with
unavailable timestamp. I'm not very fond of adding Linux-only features,
but fixing that was not the goal of this patch.

 * I had a problem with doc
 
 The current patch has conflict markers in the sgml source, there seems
 to have been some unresolved merge. Maybe that's all that causes the
 errors?
 
 Whats your problem with setting up the doc toolchain?

Yeah, my fault because of incomplete merge. But the real culprit was a
missing refsect2. Fixed.

 
 issues:

 * empty lines with invisible chars (tabs) + and sometimes empty lines
 after and before {}

Fixed (I've removed the lines etc.)


 * adjustment of start_time

 +   * the desired interval */
 +   while (agg-start_time
 + agg_interval  INSTR_TIME_GET_DOUBLE(now))
 +
 agg-start_time = agg-start_time + agg_interval;

 can skip one interval - so when transaction time will be larger or
 similar to agg_interval - then results can be strange. We have to know
 real length of interval
 
 Could you post a patch that adresses these issues?

So, in the end I've rewritten the section advancing the start_time. Now
it works so that when skipping an interval (because of a very long
transaction), it will print lines even for those empty intervals.

So for example with a transaction file containing a single query

   SELECT pg_sleep(1.5);

and an interval length of 1 second, you'll get something like this:

1355009677 0 0 0 0 0
1355009678 1 1501028 2253085056784 1501028 1501028
1355009679 0 0 0 0 0
1355009680 1 1500790 2252370624100 1500790 1500790
1355009681 1 1500723 2252169522729 1500723 1500723
1355009682 0 0 0 0 0
1355009683 1 1500719 2252157516961 1500719 1500719
1355009684 1 1500703 2252109494209 1500703 1500703
1355009685 0 0 0 0 0

which is IMHO the best way to deal with this.

I've fixed several minor issues, added a few comments.

regards
Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e376452..5a03796 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -150,6 +150,7 @@ char   *index_tablespace = NULL;
 #define naccounts  10
 
 bool   use_log;/* log transaction latencies to 
a file */
+intagg_interval;   /* log aggregates instead of 
individual transactions */
 bool   is_connect; /* establish connection for 
each transaction */
 bool   is_latencies;   /* report per-command latencies */
 intmain_pid;   /* main process id used 
in log filename */
@@ -245,6 +246,18 @@ typedef struct
char   *argv[MAX_ARGS]; /* command word list */
 } Command;
 
+typedef struct
+{
+
+   longstart_time; /* when does the interval start 
*/
+   int cnt;/* number of transactions */
+   double  min_duration;   /* min/max durations */
+   double  max_duration;
+   double  sum;/* sum(duration), 
sum(duration^2) - for estimates */
+   double  sum2;
+   
+} AggVals;
+
 static Command **sql_files[MAX_FILES]; /* SQL script files */
 static int num_files;  /* number of script files */
 static int num_commands = 0;   /* total number of Command structs */
@@ -377,6 +390,10 @@ usage(void)
 -l   write transaction times to log file\n
 --sampling-rate NUM\n
  fraction of transactions to log (e.g. 0.01 
for 1%% sample)\n
+#ifndef WIN32
+--aggregate-interval NUM\n
+ aggregate data over NUM seconds\n
+#endif
 -M simple|extended|prepared\n
  protocol for submitting queries to server 
(default: simple)\n
 -n   do not run VACUUM before tests\n
@@ -830,9 +847,25 @@ clientDone(CState *st, bool ok)
return false;   /* always false */
 }
 
+static
+void agg_vals_init(AggVals * aggs, instr_time start)
+{
+   /* basic counters */
+   aggs-cnt = 0;  /* number of transactions */
+   aggs-sum = 

Re: [HACKERS] too much pgbench init output

2012-12-08 Thread Tomas Vondra
On 20.11.2012 08:22, Jeevan Chalke wrote:
 Hi,
 
 
 On Tue, Nov 20, 2012 at 12:08 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 On 19.11.2012 11:59, Jeevan Chalke wrote:
  Hi,
 
  I gone through the discussion for this patch and here is my review:
 
  The main aim of this patch is to reduce the number of log lines. It is
  also suggested to use an options to provide the interval but few of us
  are not much agree on it. So final discussion ended at keeping 5 sec
  interval between each log line.
 
  However, I see, there are two types of users here:
  1. Who likes these log lines, so that they can troubleshoot some
  slowness and all
  2. Who do not like these log lines.
 
 Who likes these lines / needs them for something useful?
 
 
 No idea. I fall in second category.
 
 But from the discussion, I believe some people may need detailed (or lot
 more) output.

I've read the thread again and my impression is that no one really needs
or likes those lines, but

  (1) it's rather pointless to print a message every 100k rows, as it
  usually fills the console with garbabe

  (2) it's handy to have regular updates of the progress

I don't think there're people (in the thread) that require to keep the
current amount of log messages.

But there might be users who actually use the current logs to do
something (although I can't imagine what). If we want to do this in a
backwards compatible way, we should probably use a new option (e.g.
-q) to enable the new (less verbose) logging.

Do we want to allow both types of logging, or shall we keep only the new
one? If both, which one should be the default one?

  So keeping these in mind, I rather go for an option which will control
  this. People falling in category one can set this option to very low
  where as users falling under second category can keep it high.
 
 So what option(s) would you expect? Something that tunes the interval
 length or something else?
 
 
 Interval length.

Well, I can surely imagine something like --interval N.

 A switch that'd choose between the old and new behavior might be a good
 idea, but I'd strongly vote against automagic heuristics. It makes the
 behavior very difficult to predict and I really don't want to force the
 users to wonder whether the long delay is due to general slowness of the
 machine or some clever logic that causes long delays between log
 messages.
 
 That's why I choose a very simple approach with constant time interval.
 It does what I was aiming for (less logs) and it's easy to predict.
 Sure, we could choose different interval (or make it an option).
 
 
 I am preferring an option for choosing an interval, say from 1 second to
 10 seconds.

U, why not to allow arbitrary integer? Why saying 1 to 10 seconds?

 BTW, what if, we put one log message every 10% (or 5%) with time taken
 (time taken for last 10% (or 5%) and cumulative) over 5 seconds ?
 This will have only 10 (or 20) lines per pgbench initialisation.
 And since we are showing time taken for each block, if any slowness
 happens, one can easily find a block by looking at the timings and
 troubleshoot it.
 Though 10% or 5% is again a debatable number, but keeping it constant
 will eliminate the requirement of an option.

That's what I originally proposed in September (see the messages from
17/9), and Alvaro was not relly excited about this.

Attached is a patch with fixed whitespace / indentation errors etc.
Otherwise it's the same as the previous version.

Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e376452..334ce4c 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -135,6 +135,11 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * logging steps (seconds between log messages)
+ */
+intlog_step_seconds = 5;
+
+/*
  * tablespace selection
  */
 char  *tablespace = NULL;
@@ -1362,6 +1367,11 @@ init(bool is_no_vacuum)
charsql[256];
int i;
 
+   /* used to track elapsed time and estimate of the remaining time */
+   instr_time  start, diff;
+   double  elapsed_sec, remaining_sec;
+   int log_interval = 1;
+
if ((con = doConnect()) == NULL)
exit(1);
 
@@ -1430,6 +1440,8 @@ init(bool is_no_vacuum)
}
PQclear(res);
 
+   INSTR_TIME_SET_CURRENT(start);
+
for (i = 0; i  naccounts * scale; i++)
{
int j = i + 1;
@@ -1441,10 +1453,27 @@ init(bool is_no_vacuum)
exit(1);
}
 
-   if (j % 10 == 0)
-   fprintf(stderr, %d of %d tuples (%d%%) done.\n,
-   j, naccounts * scale

Re: [HACKERS] New statistics for WAL buffer dirty writes

2012-12-09 Thread Tomas Vondra
On 29.10.2012 04:58, Satoshi Nagayasu wrote:
 2012/10/24 1:12, Alvaro Herrera wrote:
 Satoshi Nagayasu escribi�:

 With this patch, walwriter process and each backend process
 would sum up dirty writes, and send it to the stat collector.
 So, the value could be saved in the stat file, and could be
 kept on restarting.

 The statistics could be retreive with using
 pg_stat_get_xlog_dirty_writes() function, and could be reset
 with calling pg_stat_reset_shared('walwriter').

 Now, I have one concern.

 The reset time could be captured in globalStats.stat_reset_timestamp,
 but this value is the same with the bgwriter one.

 So, once pg_stat_reset_shared('walwriter') is called,
 stats_reset column in pg_stat_bgwriter does represent
 the reset time for walwriter, not for bgwriter.

 How should we handle this?  Should we split this value?
 And should we have new system view for walwriter?

 I think the answer to the two last questions is yes.  It doesn't seem to
 make sense, to me, to have a single reset timings for what are
 effectively two separate things.

 Please submit an updated patch to next CF.  I'm marking this one
 returned with feedback.  Thanks.

 
 I attached the latest one, which splits the reset_time
 for bgwriter and walwriter, and provides new system view,
 called pg_stat_walwriter, to show the dirty write counter
 and the reset time.

I've done a quick review of the v4 patch:

1) applies fine on HEAD, compiles fine

2) make installcheck fails because of a difference in the 'rules'
   test suite (there's a new view pg_stat_walwriter - see the
   attached patch for a fixed version or expected/rules.out)

3) I do agree with Alvaro that using the same struct for two separate
   components (bgwriter and walwriter) seems a bit awkward. For example
   you need to have two separate stat_reset fields, the reset code
   becomes much more verbose (because you need to list individual
   fields) etc.

   So I'd vote to either split this into two structures or keeping it
   as a single structure (although with two views on top of it).

4) Are there any other fields that might be interesting? Right now
   there's just dirty_writes but I guess there are other values. E.g.
   how much data was actually written etc.?

Tomas

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e376452..334ce4c 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -135,6 +135,11 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * logging steps (seconds between log messages)
+ */
+intlog_step_seconds = 5;
+
+/*
  * tablespace selection
  */
 char  *tablespace = NULL;
@@ -1362,6 +1367,11 @@ init(bool is_no_vacuum)
charsql[256];
int i;
 
+   /* used to track elapsed time and estimate of the remaining time */
+   instr_time  start, diff;
+   double  elapsed_sec, remaining_sec;
+   int log_interval = 1;
+
if ((con = doConnect()) == NULL)
exit(1);
 
@@ -1430,6 +1440,8 @@ init(bool is_no_vacuum)
}
PQclear(res);
 
+   INSTR_TIME_SET_CURRENT(start);
+
for (i = 0; i  naccounts * scale; i++)
{
int j = i + 1;
@@ -1441,10 +1453,27 @@ init(bool is_no_vacuum)
exit(1);
}
 
-   if (j % 10 == 0)
-   fprintf(stderr, %d of %d tuples (%d%%) done.\n,
-   j, naccounts * scale,
-   (int) (((int64) j * 100) / (naccounts * 
scale)));
+   /* let's not call the timing for each row, but only each 100 
rows */
+   if (j % 100 == 0 || j == scale * naccounts)
+   {
+   INSTR_TIME_SET_CURRENT(diff);
+   INSTR_TIME_SUBTRACT(diff, start);
+
+   elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+   remaining_sec = (scale * naccounts - j) * elapsed_sec / 
j;
+
+   /* have we reached the next interval? */
+   if (elapsed_sec = log_interval * log_step_seconds) {
+
+   fprintf(stderr, %d of %d tuples (%d%%) done 
(elapsed %.2f s, remaining %.2f s).\n,
+   j, naccounts * scale,
+   (int) (((int64) j * 100) / 
(naccounts * scale)), elapsed_sec, remaining_sec);
+
+   /* skip to the next interval */
+   log_interval = 
(int)ceil(elapsed_sec/log_step_seconds);
+   }
+   }
+
}
if (PQputline(con, \\.\n))
{
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b7df8ce..bf9acc5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ 

Re: [HACKERS] CommitFest #3 and upcoming schedule

2012-12-09 Thread Tomas Vondra
On 9.12.2012 16:56, Simon Riggs wrote:
 On 16 November 2012 07:20, Greg Smith g...@2ndquadrant.com wrote:
 
 Project guidelines now ask each patch submitter to review patches of the
 same number and approximate complexity as they submit.  If you've submitted
 some items to the CommitFest, please look at the open list and try to find
 something you can review.
 
 The deadline for 9.3 is looming and many patches have not yet been reviewed.
 
 I'm sending a public reminder to all patch authors that they should
 review other people's patches if they expect their own to be reviewed.
 
 Simply put, if you don't help each other by reviewing other patches
 then the inevitable result will be your patch will not be neither
 reviewed nor committed.

IMHO many of the patches that are currently marked as needs review and
have no reviewers, were actually reviewed or are being discussed
thoroughly on the list, but this information was not propagated to the
CF page.

Not sure how to fix this except for updating patches that I've reviewed
and urging the others to do the same.


Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CommitFest #3 and upcoming schedule

2012-12-09 Thread Tomas Vondra
On 9.12.2012 22:41, Jeff Janes wrote:
 On Sun, Dec 9, 2012 at 10:47 AM, Tomas Vondra t...@fuzzy.cz wrote:

 IMHO many of the patches that are currently marked as needs review and
 have no reviewers, were actually reviewed or are being discussed
 thoroughly on the list, but this information was not propagated to the
 CF page.
 
 Should active discussion on the hackers list prevent someone from
 doing a review?  I know I am reluctant to review a patch when it seems
 it is still being actively redesigned/debated by others.

 Maybe a new status of needs design consensus would be useful.

IMHO introducing new statuses won't improve the state. Moreover reaching
a design consensus is a natural part of the review process.

I see those discussions as a part of the review process, so it's not
that an active discussion means 'no review' (although the CF page states
needs review or no reviewer for such patches).

There's nothing wrong with doing yet another review for a patch, but in
most cases I tend to agree with the points already raised in the
discussion so it's not really productive. Thus I share the same
reluctance to do more reviews for those actively discussed patches.

My point is that some of the idle patches are actually quite active in
the background, no one just updated the CF page. And I see many such
patches moved forward over the last few days.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-10 Thread Tomas Vondra

Dne 10.12.2012 16:38, Andres Freund napsal:

On 2012-12-08 17:07:38 +0100, Tomas Vondra wrote:

I've done some test and yes - once there are other objects the
optimization falls short. For example for tables with one index, it
looks like this:

  1) unpatched

  one by one:  28.9 s
  100 batches: 23.9 s

  2) patched

  one by one:  44.1 s
  100 batches:  4.7 s

So the patched code is by about 50% slower, but this difference 
quickly

disappears with the number of indexes / toast tables etc.

I see this as an argument AGAINST such special-case optimization. My
reasoning is this:

* This difference is significant only if you're dropping a table 
with
  low number of indexes / toast tables. In reality this is not going 
to

  be very frequent.

* If you're dropping a single table, it really does not matter - the
  difference will be like 100 ms vs. 200 ms or something like that.


I don't particularly buy that argument. There are good reasons (like
avoiding deadlocks, long transactions) to drop multiple tables
in individual transactions.
Not that I have a good plan to how to work around that though :(


Yeah, if you need to drop the tables one by one for some reason, you
can't get rid of the overhead this way :-(

OTOH in the example above the overhead is ~50%, i.e. 1.5ms / table with 
a
single index. Each such associated relation (index, TOAST table, ...) 
means

a relation that needs to be dropped and on my machine, once I reach ~5
relations there's almost no difference as the overhead is balanced by 
the

gains.

Not sure how to fix that in an elegant way, though :-(

Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Serious problem: media recovery fails after system or PostgreSQL crash

2012-12-16 Thread Tomas Vondra
On 8.12.2012 03:08, Jeff Janes wrote:
 On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Hi,

 On 6.12.2012 23:45, MauMau wrote:
 From: Tom Lane t...@sss.pgh.pa.us
 Well, that's unfortunate, but it's not clear that automatic recovery is
 possible.  The only way out of it would be if an undamaged copy of the
 segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
 not even be trying to fetch from the archive if we had a local copy.

 No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
 from archive.  XLogFileReadAnyTLI() does that.  Also, PG manual contains
 the following description:

 http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL


 WAL segments that cannot be found in the archive will be sought in
 pg_xlog/; this allows use of recent un-archived segments. However,
 segments that are available from the archive will be used in preference
 to files in pg_xlog/.

 So why don't you use an archive command that does not create such
 incomplete files? I mean something like this:

 archive_command = 'cp %p /arch/%f.tmp  mv /arch/%f.tmp /arch/%f'

 Until the file is renamed, it's considered 'incomplete'.
 
 Wouldn't having the incomplete file be preferable over having none of it at 
 all?
 
 It seems to me you need considerable expertise to figure out how to do
 optimal recovery (i.e. losing the least transactions) in this
 situation, and that that expertise cannot be automated.  Do you trust
 a partial file from a good hard drive, or a complete file from a
 partially melted pg_xlog?

It clearly is a rather complex issue, no doubt about that. And yes,
reliability of the devices with pg_xlog on them is an important detail.
Alghough if the WAL is not written in a reliable way, you're hosed
anyway I guess.

The recommended archive command is based on the assumption that the
local pg_xlog is intact (e.g. because it's located on a reliable RAID1
array), which seems to be the assumption of the OP too.

In my opinion it's more likely to meet an incomplete copy of WAL in the
archive than a corrupted local WAL. And if it really is corrupted, it
would be identified during replay.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-16 Thread Tomas Vondra
Hi,

I've updated the patch to include the optimization described in the
previous post, i.e. if the number of relations is below a certain
threshold, use a simple for loop, for large numbers of relations use
bsearch calls.

This is done by a new constant BSEARCH_LIMIT, which is set to 10 in the
patch. Then I've modified the 'drop-test' script to take yet another
argument - number of indexes for each table. So for example this

./drop-test.py 1 100 3 'dbname=test'

means create 1 tables, 3 indexes for each of them, and then drop
them in batches of 100 tables.

Then I've run the test with 0, 1, 2, ... 11 indexes for each table for

(a) unpatched HEAD
(b) patch v3.1 (without the optimization)
(c) patch v3.3 (with BSEARCH_LIMIT=10)

and I got these results:


1) dropping one-by-one
--

This is the really interesting part - the issue with the v3.1 is that
for a single table, it's ~2x slower than unpatched PostgreSQL.

 0   1   2   3   4   5   6   789   10   11
--
unpatched   16  28  40  52  63  75  87  99  110  121  135  147
 v3.1   33  43  46  56  58  60  63  72   75   76   79   80
 v3.3   16  20  23  25  29  33  36  40   44   47   79   82

The values are durations in seconds, rounded to integer values. I've run
the test repeatedly and there's very small variance in the numbers.

The v3.3 improves that and it's actually even faster than unpatched
PostgreSQL. How can this happen? I believe it's because the code is
rewritten from

   for each relation (r) in the drop list
  DropRelFileNodeAllBuffers (r)
 for each shared buffer
check and invalidate

to

   copy the relations from drop list to array (a)
   DropRelFileNodeAllBuffers(a)
  for each shared buffer
  for each relation (r) in the array
  check and invalidate

At least that's the only explanation I was able to come up with.

Yet another interesting observation is that even the v3.1 is about as
fast as the unpatched code once there are 3 or more indexes (or TOAST
tables).

So my opinion is that the optimizated patch works as expected, and that
even without the optimization the performance would be acceptable for
most real-world cases.

2) dropping in transaction
--

This is mostly to verify that the code did not break anything, because
the optimization should not kick-in in this case at all. And that seems
to be the case:

 0   1   2   3   4   5   6   789   10   11
--
unpatched   13  24  35  46  58  69  81  92  103  115  127  139
 v3.13   5   7   8  10  12  14  15   16   18   20   21
 v3.33   4   6   7   8  10  11  13   14   15   18   20

The differences between v3.1 and v3.3 are mostly due to rounding etc.


Attached is the v3.3 patch and the testing script I've been using for
the tests above. Feel free to run the tests on your hardware, with your
hardware, shared buffers size etc. I've run that on a 4-core i5-2500 CPU
with 2GB shared buffers.

Tomas


#!/bin/env python

import datetime
import psycopg2
import sys

if __name__ == '__main__':
	
	if len(sys.argv)  4:
		print ERORR: not enough parameters
		print HINT:  drop-test.py num-of-tables drop-num num-of-indexes 'connection string'
		sys.exit(1)
	
	ntables = int(sys.argv[1])
	nlimit  = int(sys.argv[2])
	nindex  = int(sys.argv[3])
	connstr = str(sys.argv[4])
	debug   = False
	
	conn = psycopg2.connect(connstr)
	cur  = conn.cursor()
	
	# print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		for j in range(nindex):
			cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i))
		conn.commit();
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)
	conn.commit()
	end = datetime.datetime.now()
	# print '  all tables created in %s seconds' % ((end-start).total_seconds(),)

	# set to autocommit mode
	conn.autocommit = True
	start = datetime.datetime.now()
	# print 'dropping %s tables one by one ...' % (ntables,)
	for i in range(ntables):
		cur.execute('DROP TABLE tab_%s' % (i,))
		if (i % 1000 == 0) and debug:
			print '  tables dropped: %s' % (i,)

	end = datetime.datetime.now()
	print 'dropped one-by-one in %s seconds' % ((end-start).total_seconds(),)

	# cancel the autocommit mode
	conn.autocommit = False
	
	# recreate tables
	# print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		for j in range(nindex):
			cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i))
		conn.commit();
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)

	conn.commit()
	end = datetime.datetime.now()
	# print '  all tables created in %s seconds' % ((end-start).total_seconds(),)
	
	# drop the tables in batches
	

Re: [HACKERS] too much pgbench init output

2012-12-16 Thread Tomas Vondra
Hi,

attached is a new version of the patch that

(a) converts the 'log_step_seconds' variable to a constant (and does
not allow changing it using a command-line option etc.)

(b) keeps the current logging as a default

(c) adds a -q switch that enables the new logging with a 5-second
interval

I'm still not convinced there should be yet another know for tuning the
log interval - opinions?


Tomas

On 11.12.2012 10:23, Jeevan Chalke wrote:
 
 
 
 On Sun, Dec 9, 2012 at 8:11 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 On 20.11.2012 08:22, Jeevan Chalke wrote:
  Hi,
 
 
  On Tue, Nov 20, 2012 at 12:08 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz
  mailto:t...@fuzzy.cz mailto:t...@fuzzy.cz wrote:
 
  On 19.11.2012 11:59, Jeevan Chalke wrote:
   Hi,
  
   I gone through the discussion for this patch and here is my
 review:
  
   The main aim of this patch is to reduce the number of log
 lines. It is
   also suggested to use an options to provide the interval but
 few of us
   are not much agree on it. So final discussion ended at
 keeping 5 sec
   interval between each log line.
  
   However, I see, there are two types of users here:
   1. Who likes these log lines, so that they can troubleshoot some
   slowness and all
   2. Who do not like these log lines.
 
  Who likes these lines / needs them for something useful?
 
 
  No idea. I fall in second category.
 
  But from the discussion, I believe some people may need detailed
 (or lot
  more) output.
 
 I've read the thread again and my impression is that no one really needs
 or likes those lines, but
 
   (1) it's rather pointless to print a message every 100k rows, as it
   usually fills the console with garbabe
 
   (2) it's handy to have regular updates of the progress
 
 I don't think there're people (in the thread) that require to keep the
 current amount of log messages.
 
 But there might be users who actually use the current logs to do
 something (although I can't imagine what). If we want to do this in a
 backwards compatible way, we should probably use a new option (e.g.
 -q) to enable the new (less verbose) logging.
 
 Do we want to allow both types of logging, or shall we keep only the new
 one? If both, which one should be the default one?
 
 
 Both the options are fine with me, but the default should be the current
 behaviour.
 
 
   So keeping these in mind, I rather go for an option which
 will control
   this. People falling in category one can set this option to
 very low
   where as users falling under second category can keep it high.
 
  So what option(s) would you expect? Something that tunes the
 interval
  length or something else?
 
 
  Interval length.
 
 Well, I can surely imagine something like --interval N.
 
 
 +1
  
 
 
  A switch that'd choose between the old and new behavior might
 be a good
  idea, but I'd strongly vote against automagic heuristics. It
 makes the
  behavior very difficult to predict and I really don't want to
 force the
  users to wonder whether the long delay is due to general
 slowness of the
  machine or some clever logic that causes long delays between log
  messages.
 
  That's why I choose a very simple approach with constant time
 interval.
  It does what I was aiming for (less logs) and it's easy to
 predict.
  Sure, we could choose different interval (or make it an option).
 
 
  I am preferring an option for choosing an interval, say from 1
 second to
  10 seconds.
 
 U, why not to allow arbitrary integer? Why saying 1 to 10 seconds?
 
 
 Hmm.. actually, I have no issues with any number there. Just put 1..10
 as we hard-coded it 5. No particular reason as such.
  
 
 
  BTW, what if, we put one log message every 10% (or 5%) with time taken
  (time taken for last 10% (or 5%) and cumulative) over 5 seconds ?
  This will have only 10 (or 20) lines per pgbench initialisation.
  And since we are showing time taken for each block, if any slowness
  happens, one can easily find a block by looking at the timings and
  troubleshoot it.
  Though 10% or 5% is again a debatable number, but keeping it constant
  will eliminate the requirement of an option.
 
 That's what I originally proposed in September (see the messages from
 17/9), and Alvaro was not relly excited about this.
 
 Attached is a patch with fixed whitespace / indentation errors etc.
 Otherwise it's the same as the previous version.
 
 
 OK. Looks good now.
 
 Any other views / suggestions are welcome.
 
 Thanks

Re: [HACKERS] system administration functions with hardcoded superuser checks

2012-12-18 Thread Tomas Vondra
On 18.12.2012 18:38, Pavel Stehule wrote:
 2012/12/18 Peter Eisentraut pete...@gmx.net:
 There are some system administration functions that have hardcoded
 superuser checks, specifically:

 pg_reload_conf
 pg_rotate_logfile

 Some of these are useful in monitoring or maintenance tools, and the
 hardcoded superuser checks require that these tools run with maximum
 privileges.  Couldn't we just install these functions without default
 privileges and allow users to grant privileges as necessary?
 
 isn't it too strong gun for some people ???
 
 I believe so some one can decrease necessary rights and it opens doors
 to system.

No one was speaking about making them executable by a wider group of
users by default (i.e. decreasing necessary rights). Today, when you
need to provide the EXECUTE privilege on those functions, you have three
options

(a) make him a superuser - obviously not a good choice

(b) create a SECURITY DEFINER wrapper **for each function separately**

(c) deny to do that


Being able to do a plain GRANT on the function is merely a simpler way
to do (b). It has advantages (less objects/functions to care about) and
disadvantages (e.g. you can't do additional parameter values checks).

 pg_read_file
 pg_read_file_all
 pg_read_binary_file
 pg_read_binary_file_all
 pg_stat_file
 pg_ls_dir
 
 is relative dangerous and I am not for opening these functions.
 
 power user can simply to write extension, but he knows what he does/

I see only dangers that are already present.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] too much pgbench init output

2012-12-19 Thread Tomas Vondra
On 19.12.2012 06:30, Jeevan Chalke wrote:
 
 
 
 On Mon, Dec 17, 2012 at 5:37 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 Hi,
 
 attached is a new version of the patch that
 
 (a) converts the 'log_step_seconds' variable to a constant (and does
 not allow changing it using a command-line option etc.)
 
 (b) keeps the current logging as a default
 
 (c) adds a -q switch that enables the new logging with a 5-second
 interval
 
 I'm still not convinced there should be yet another know for tuning the
 log interval - opinions?
 
 
 It seems that you have generated a patch over your earlier version and
 due to that it is not cleanly applying on fresh sources.
 Please generate patch on fresh sources.

Seems you're right - I've attached the proper patch against current master.

 However, I absolutely no issues with the design. Also code review is
 already done and looks good to me.
 I think to move forward on this we need someone from core-team. So I am
 planning to change its status to ready-for-committor.
 
 Before that please provide updated patch for final code review.

thanks
Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e376452..f3953a7 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -39,6 +39,7 @@
 #include portability/instr_time.h
 
 #include ctype.h
+#include math.h
 
 #ifndef WIN32
 #include sys/time.h
@@ -102,6 +103,7 @@ extern int  optind;
 #define MAXCLIENTS 1024
 #endif
 
+#define LOG_STEP_SECONDS   5   /* seconds between log messages */
 #define DEFAULT_NXACTS 10  /* default nxacts */
 
 intnxacts = 0; /* number of 
transactions per client */
@@ -150,6 +152,7 @@ char   *index_tablespace = NULL;
 #define naccounts  10
 
 bool   use_log;/* log transaction latencies to 
a file */
+bool   use_quiet;  /* quiet logging onto stderr */
 bool   is_connect; /* establish connection for 
each transaction */
 bool   is_latencies;   /* report per-command latencies */
 intmain_pid;   /* main process id used 
in log filename */
@@ -389,6 +392,7 @@ usage(void)
 -v   vacuum all four standard tables before 
tests\n
   \nCommon options:\n
 -d print debugging output\n
+-q quiet logging (a message each 5 seconds)\n
 -h HOSTNAMEdatabase server host or socket directory\n
 -p PORTdatabase server port number\n
 -U USERNAMEconnect as specified database user\n
@@ -1362,6 +1366,11 @@ init(bool is_no_vacuum)
charsql[256];
int i;
 
+   /* used to track elapsed time and estimate of the remaining time */
+   instr_time  start, diff;
+   double  elapsed_sec, remaining_sec;
+   int log_interval = 1;
+
if ((con = doConnect()) == NULL)
exit(1);
 
@@ -1430,6 +1439,8 @@ init(bool is_no_vacuum)
}
PQclear(res);
 
+   INSTR_TIME_SET_CURRENT(start);
+
for (i = 0; i  naccounts * scale; i++)
{
int j = i + 1;
@@ -1441,10 +1452,33 @@ init(bool is_no_vacuum)
exit(1);
}
 
-   if (j % 10 == 0)
+   /* If we want to stick with the original logging, print a 
message each
+* 100k inserted rows. */
+   if ((! use_quiet)  (j % 10 == 0))
fprintf(stderr, %d of %d tuples (%d%%) done.\n,
-   j, naccounts * scale,
-   (int) (((int64) j * 100) / (naccounts * 
scale)));
+   j, naccounts * scale,
+   (int) (((int64) j * 
100) / (naccounts * scale)));
+   /* let's not call the timing for each row, but only each 100 
rows */
+   else if (use_quiet  (j % 100 == 0))
+   {
+   INSTR_TIME_SET_CURRENT(diff);
+   INSTR_TIME_SUBTRACT(diff, start);
+
+   elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+   remaining_sec = (scale * naccounts - j) * elapsed_sec / 
j;
+
+   /* have we reached the next interval (or end)? */
+   if ((j == scale * naccounts) || (elapsed_sec = 
log_interval * LOG_STEP_SECONDS)) {
+
+   fprintf(stderr, %d of %d tuples (%d%%) done 
(elapsed %.2f s, remaining %.2f s).\n,
+   j, naccounts * scale

Re: [HACKERS] system administration functions with hardcoded superuser checks

2012-12-19 Thread Tomas Vondra
On 19.12.2012 07:34, Magnus Hagander wrote:
 On Wed, Dec 19, 2012 at 1:58 AM, Tomas Vondra t...@fuzzy.cz wrote:
 On 18.12.2012 18:38, Pavel Stehule wrote:
 2012/12/18 Peter Eisentraut pete...@gmx.net:
 There are some system administration functions that have hardcoded
 superuser checks, specifically:

 pg_reload_conf
 pg_rotate_logfile

 Some of these are useful in monitoring or maintenance tools, and the
 hardcoded superuser checks require that these tools run with maximum
 privileges.  Couldn't we just install these functions without default
 privileges and allow users to grant privileges as necessary?

 isn't it too strong gun for some people ???

 I believe so some one can decrease necessary rights and it opens doors
 to system.

 No one was speaking about making them executable by a wider group of
 users by default (i.e. decreasing necessary rights). Today, when you
 need to provide the EXECUTE privilege on those functions, you have three
 options
 
 Given how limited these functions are in scope, I don't see a problem here.
 
 pg_read_file
 pg_read_file_all
 pg_read_binary_file
 pg_read_binary_file_all
 pg_stat_file
 pg_ls_dir

 is relative dangerous and I am not for opening these functions.

 power user can simply to write extension, but he knows what he does/

 I see only dangers that are already present.
 
 Granting executability on pg_read_xyz is pretty darn close to granting
 superuser, without explicitly asking for it. Well, you get read only
 superuser. If we want to make that step as easy as just GRANT, we
 really need to write some *very* strong warnings in the documentation
 so that people realize this. I doubt most people will realize it
 unless we do that (and those who don't read the docs, whch is probably
 a majority, never will).

Yup, that's what I meant by possibility to perform additional parameter
values checks ;-)

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] strange OOM errors with EXECUTE in PL/pgSQL

2012-12-19 Thread Tomas Vondra
Hi,

one of our local users reported he's getting OOM errors on 9.2, although
on 9.1 the code worked fine. Attached is a simple test-case that should
give you an OOM error almost immediately.

What it does:

1) creates a simple table called test with one text column.

2) creates a plpgsql function with one parameter, and all that function
   does is passing the parameter to EXECUTE

3) calls the function with a string containing many INSERTs into the
   test table

The way the EXECUTE is used is a bit awkward, but the failures seem a
bit strange to me. The whole script is ~500kB and most of that is about
11k of very simple INSERT statements:

   insert into test(value) values (''aa'');

all of them are exactly the same. Yet when it fails with OOM, the log
contains memory context stats like these:

TopMemoryContext: 5303376 total in 649 blocks; 2648 free ...
  PL/pgSQL function context: 8192 total in 1 blocks; 3160 free ...
  TopTransactionContext: 8192 total in 1 blocks; 6304 free ...
ExecutorState: 8192 total in 1 blocks; 7616 free ...
  ExprContext: 8192 total in 1 blocks; 8160 free ...
SPI Exec: 33554432 total in 14 blocks; 6005416 free ...
  CachedPlanSource: 3072 total in 2 blocks; 1856 free ...
  CachedPlanSource: 538688 total in 3 blocks; 1744 free ...
CachedPlanQuery: 3072 total in 2 blocks; 1648 free ...
  CachedPlanSource: 538688 total in 3 blocks; 1744 free ...
CachedPlanQuery: 3072 total in 2 blocks; 1648 free ...
  CachedPlanSource: 538688 total in 3 blocks; 1744 free ...
CachedPlanQuery: 3072 total in 2 blocks; 1648 free ...
  CachedPlanSource: 538688 total in 3 blocks; 1744 free ...
CachedPlanQuery: 3072 total in 2 blocks; 1648 free ...
  CachedPlanSource: 538688 total in 3 blocks; 1744 free ...
  ...

There is ~9500 of these CachedPlanSource + CachedPlanQuery row pairs
(see the attached log). That seems a bit strange to me, because all the
queries are exactly the same in this test case.

The number of queries needed to get OOM is inversely proportional to the
query length - by using a longer text (instead of 'aaa') you may
use much less queries.

I am no expert in this area, but it seems to me that the code does not
expect that many INSERTs in EXECUTE and does not release the memory for
some reason (e.g. because the plans are allocated in SPI Exec memory
context, etc.).

regards
Tomas


pg-oom.log.gz
Description: application/gzip


test2.sql.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-19 Thread Tomas Vondra
On 19.12.2012 02:18, Andres Freund wrote:
 On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote:
 
 I think except of the temp buffer issue mentioned below its ready.
 
 -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
 +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
  {
 -int i;
 +int i, j;
 +
 +/* sort the list of rnodes */
 +pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);

  /* If it's a local relation, it's localbuf.c's problem. */
 -if (RelFileNodeBackendIsTemp(rnode))
 +for (i = 0; i  nnodes; i++)
  {
 -if (rnode.backend == MyBackendId)
 -DropRelFileNodeAllLocalBuffers(rnode.node);
 -return;
 +if (RelFileNodeBackendIsTemp(rnodes[i]))
 +{
 +if (rnodes[i].backend == MyBackendId)
 +DropRelFileNodeAllLocalBuffers(rnodes[i].node);
 +}
  }
 
 While you deal with local buffers here you don't anymore in the big loop
 over shared buffers. That wasn't needed earlier since we just returned
 after noticing we have local relation, but thats not the case anymore.

Hmm, but that would require us to handle the temp relations explicitly
wherever we call DropRelFileNodeAllBuffers. Currently there are two such
places - smgrdounlink() and smgrdounlinkall().

By placing it into DropRelFileNodeAllBuffers() this code is shared and I
think it's a good thing.

But that does not mean the code is perfect - it was based on the
assumption that if there's a mix of temp and regular relations, the temp
relations will be handled in the first part and the rest in the second one.

Maybe it'd be better to improve it so that the temp relations are
removed from the array after the first part (and skip the second one if
there are no remaining relations).

 
  for (i = 0; i  NBuffers; i++)
  {
 +RelFileNodeBackend *rnode = NULL;
  volatile BufferDesc *bufHdr = BufferDescriptors[i];
 -
 +
  /*
   * As in DropRelFileNodeBuffers, an unlocked precheck should be 
 safe
   * and saves some cycles.
   */
 -if (!RelFileNodeEquals(bufHdr-tag.rnode, rnode.node))
 +
 +/*
 + * For low number of relations to drop just use a simple walk 
 through,
 + * to save the bsearch overhead. The BSEARCH_LIMIT is rather a 
 guess
 + * than a exactly determined value, as it depends on many 
 factors (CPU
 + * and RAM speeds, amount of shared buffers etc.).
 + */
 +if (nnodes = BSEARCH_LIMIT)
 
 I think thats a sensible plan. It makes sense that for a small number of
 relations a sequential scan of the rnodes array is faster than a bsearch
 and 10 sounds like a good value although I would guess the optimal value
 is slightly higher on most machines. But if it works fine without
 regressions thats pretty good...

I think it's pointless to look for the optimal value in this case, given
on how many factors it depends. We could use 20 instead of 10, but I
wouldn't go higher probably.

 +
 +/*
 + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), 
 so
 + * that it's suitable for bsearch.
 + */
 +static int
 +rnode_comparator(const void * p1, const void * p2)
 +{
 +RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
 +RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
 +
 +if (n1.node.relNode  n2.node.relNode)
 +return -1;
 +else if (n1.node.relNode  n2.node.relNode)
 +return 1;
 +
 +if (n1.node.dbNode  n2.node.dbNode)
 +return -1;
 +else if (n1.node.dbNode  n2.node.dbNode)
 +return 1;
 +
 +if (n1.node.spcNode  n2.node.spcNode)
 +return -1;
 +else if (n1.node.spcNode  n2.node.spcNode)
 +return 1;
 +else
 +return 0;
 +}
 
 Still surprised this is supposed to be faster than a memcmp, but as you
 seem to have measured it earlier..

It surprised me too. These are the numbers with the current patch:

1) one by one
=
  0246810   12   14   16   18   20
--
current  15   22   28   34   4175   77   82   92   99  106
memcmp   16   23   29   36   44   122  125  128  153  154  158

Until the number of indexes reaches ~10, the numbers are almost exactly
the same. Then the bsearch branch kicks in and it's clear how much
slower the memcmp comparator is.

2) batches of 100
=
  0246810   12   14   16   18   20
--
current   358   10   1215   17   21   23   27   31
memcmp47   10   13   1619   22   28   30   32   36

Here the difference is much smaller, but even here the memcmp is
consistently a bit slower.


My

Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL

2012-12-19 Thread Tomas Vondra
On 20.12.2012 02:29, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 What it does:
 
 1) creates a simple table called test with one text column.
 
 2) creates a plpgsql function with one parameter, and all that function
does is passing the parameter to EXECUTE
 
 3) calls the function with a string containing many INSERTs into the
test table
 
 The way the EXECUTE is used is a bit awkward, but the failures seem a
 bit strange to me. The whole script is ~500kB and most of that is about
 11k of very simple INSERT statements:
 
insert into test(value) values (''aa'');
 
 The reason this fails is that you've got a half-megabyte source string,
 and each of the 11000 plans that are due to be created from it saves
 its own copy of the source string.  Hence, 5500 megabytes needed just
 for source strings.
 
 We could possibly fix this by inventing some sort of reference-sharing
 arrangement (which'd be complicated and fragile) or by not storing the
 source strings with the plans (which'd deal a serious blow to our
 ability to provide helpful error messages).  Neither answer seems
 appealing.

Thanks for the explanation, I didn't occur to me that each plan keeps a
copy of the whole source string.

 I think it would be a better idea to adopt a less brain-dead way of
 processing the data.  Can't you convert this to a single INSERT with a
 lot of VALUES rows?  Or split it into multiple EXECUTE chunks?

Well, it's not my app but I'll recommend it to them. Actually I already
did but I didn't have an explanation of why it behaves like this.

The really annoying bit is that in 9.1 this works fine (it's just as
crazy approach as on but it does not end with OOM error).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >