date:20110118

Re: pg_ctl failover Re: [HACKERS] Latches, signals, and waiting

2011-01-18 Thread Fujii Masao

On Thu, Jan 13, 2011 at 9:08 PM, Fujii Masao  wrote:
> I did s/failover/promote. Here is the updated patch.

I rebased the patch to current git master.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pg_ctl_promote_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for logging the current role

2011-01-18 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Point being, until we get Andrew's jagged-csv-import magic committed to
> > core, we won't be able to import a log file that a user has changed the
> > field list for mid-stream (following the add-a-new-column use-case we've
> > been discussing).
> 
> Now I think you're reaching the mission-to-mars stage that Robert was
> complaining about.  Solving that sort of problem is well outside the
> scope of this patch.  I don't care if people have to shut down and
> restart their servers in order to make a change to the log format.
> Even if I did, the other patch sounds like a better approach.

Alright, here's the latest on this patch.  I've added a log_csv_fields
GUC along with the associated magic to make it work (at least for me).
Also added 'role_name' and '%U' options.  Requires postmaster restart to
change, didn't include any 'short-cut' field options, though I don't
think it'd be hard to do if we can decide on it.  Default remains the
same as what was in 9.0.

commit ff249aeac7216da623bf77840380d5e767f681fc
Author: Stephen Frost 
Date:   Wed Jan 19 00:26:52 2011 -0500

Add log_csv_fields GUC for CSV output & curr_role

This patch adds a new GUC called 'log_csv_fields'.  This GUC allows
the user to control the set of fields written to the CSV output as
well as the order in which they are written.  The default set of
fields remains those that were included in 9.0, to avoid breaking
existing user configurations.

In passing, update 'user_name' for log_line_prefix and log_csv_fields
to mean 'session user' (which could be reset by a superuser with
set session authorization), and add a 'role_name' option (%U) to
log_line_prefix and log_csv_fields, to allow users to log the
current role (as set by SET ROLE- not impacted by SECURITY DEFINER
functions).

Thanks,

Stephen
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 3504,3510  local0.*/var/log/postgresql

   %u
!  User name
   yes

--- 3504,3523 

   %u
!  Session user name, typically the user name which was used
!  to authenticate to PostgreSQL with,
!  but can be changed by a superuser, see SET SESSION
!  AUTHORIZATION
!  yes
! 
! 
!  %U
!  Current role name, when set with SET ROLE;
!  the current role identifier is relevant for permission checking;
!  Returns 'none' if the current role matches the session user.
!  Note: Log messages from inside SECURITY DEFINER
!  functions will show the calling role, not the effective role
!  inside the SECURITY DEFINER function
   yes

***
*** 3621,3626  FROM pg_stat_activity;
--- 3634,3662 

+  
+   log_csv_fields (string)
+   
+log_csv_fields configuration parameter
+   
+   
+
+ Controls the set and order of the fields which are written out in
+ the CSV-format log file.
+ 
+ The default is: log_time, user_name, database_name, process_id,
+ connection_from, session_id, session_line_num, command_tag,
+ session_start_time, virtual_transaction_id, transaction_id,
+ error_severity, sql_state_code, message, detail, hint,
+ internal_query, internal_query_pos, context, query, query_pos,
+ location, application_name
+ 
+ For details on what these fields are, refer to the log_line_prefix
+ and CSV logging documentation.
+
+   
+  
+ 

log_lock_waits (boolean)

***
*** 3728,3761  FROM pg_stat_activity;
  Including csvlog in the log_destination list
  provides a convenient way to import log files into a database table.
  This option emits log lines in comma-separated-values
! (CSV) format,
! with these columns:
! timestamp with milliseconds,
! user name,
! database name,
! process ID,
! client host:port number,
! session ID,
! per-session line number,
! command tag,
! session start time,
! virtual transaction ID,
! regular transaction ID,
! error severity,
! SQLSTATE code,
! error message,
! error message detail,
! hint,
! internal query that led to the error (if any),
! character count of the error position therein,
! error context,
! user query that led to the error (if any and enabled by
! log_min_error_statement),
! character count of the error position therein,
! location of

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Fujii Masao

On Wed, Jan 19, 2011 at 4:12 AM, Magnus Hagander  wrote:
> Ah, ok. I've added the errorcode now, PFA. I also fixed an error in
> the change for result codes I broke in the last patch. github branch
> updated as usual.

Great. Thanks for the quick update!

Here are another comments:

+ * IDENTIFICATION
+ *   src/bin/pg_basebackup.c

Typo: s/"src/bin/pg_basebackup.c"/"src/bin/pg_basebackup/pg_basebackup.c"

+   printf(_("  -c, --checkpoint=fast|slow\n"
+"set fast or slow 
checkpoinging\n"));

Typo: s/checkpoinging/checkpointing

The "fast or slow" seems to lead users to always choose "fast". Instead,
what about "fast or smooth", "fast or spread" or "immediate or delayed"?

You seem to have forgotten to support "--checkpoint" long option.
The struct long_options needs to be updated.

What if pg_basebackup receives a signal while doing a backup?
For example, users might do Ctrl-C to cancel the long-running backup.
We should define a signal handler and send a Terminate message
to the server to cancel the backup?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Snapshot synchronization, again...

2011-01-18 Thread Joachim Wieland

Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.

On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch  wrote:
> "" is a valid md5 message digest.  To avoid always 
> accepting
> a snapshot with that digest, we would need to separately store a flag 
> indicating
> whether a given syncSnapshotChksums member is currently valid.  Maybe that 
> hole
> is acceptable, though.

In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.

>> +
>> +     if (!XactReadOnly)
>> +             elog(WARNING, "A snapshot exporting function should be 
>> readonly.");
>
> There are legitimate use cases for copying the snapshot of a read-write
> transaction.  Consider a task like calculating some summaries for insert into 
> a
> table.  To speed this up, you might notionally partition the source data and
> have each of several workers calculate each partition.  I'd vote for removing
> this warning entirely.

Warning removed, adding this fact to the documentation only is fine with me.

> InvalidBackendId is -1.  Is InvalidOid (0) in fact also invalid as a backend 
> ID?

Yes, there is a check in utils/init/postinit.c that makes sure that 0
is never a valid backendId. But anyway, I have now made this more
explicit by adding an own parse routine for ints.

>> +     while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
>> +     {
>> +             if (xid == GetTopTransactionIdIfAny())
>> +                     continue;
>> +             snapshot->xip[i++] = xid;
>
> This works on a virgin GetSnapshotData() snapshot, which always has enough 
> space
> (sized according to max_connections).  A snapshot from CopySnapshot(), 
> however,
> has just enough space for the original usage.  I get a SIGSEGV at this line 
> with
> the attached test script.  If I change things around so xip starts non-NULL, I
> get messages like these as the code scribbles past the end of the allocation:

This has been fixed. xip/subxip are now allocated separately if necessary.

> Though unlikely, the other backend may not even exist by now.  Indeed, that
> could have happened and RecentGlobalXmin advanced since we compared the
> checksum.  Not sure what the right locking is here, but something is needed.

Good catch. What I have done now is a second check at the end of
ImportSnapshot(). At that point we have set up the new snapshot and
adjusted MyProc->xmin. If we succeed, then we are fine. If not we
throw an error and invalidate the whole transaction.

>> +      * Instead we must check to not go forward (we might have opened a 
>> cursor
>> +      * in this transaction and still have its snapshot registered)
>> +      */
>
> I'm thinking this case should yield an ERROR.  Otherwise, our net result would
> be to silently adopt a snapshot that is neither our old self nor the target.
> Maybe there's a use case for that, but none comes to my mind.

This can happen when you do:

DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;

> I guess the use case for pg_import_snapshot from READ COMMITTED would be
> something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
> First off, would that work ("stuff" use the imported snapshot)?  If it does
> work, we should take the precedent of SET LOCAL and issue no warning.  If it
> doesn't work, then we should document that the snapshot does take effect until
> the next command and probably keep this warning or something similar.

No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.

>> + Datum
>> + pg_import_snapshot(PG_FUNCTION_ARGS)
>> + {
>> +     bytea  *snapshotData = PG_GETARG_BYTEA_P(0);
>> +     bool    ret = true;
>> +
>> +     if (ActiveSnapshotSet())
>> +             ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
>> +
>> +     ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
>
> Is it valid to scribble directly on snapshots like this?

I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.

I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.

So thanks again and I'm looking forward to your next review... :-)

Joachim
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2dbac56..c96942a 100644
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
*** CreateSharedMemoryAndSemaphores(bool mak
*** 124,129 
--- 124,130 
  		size

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Fujii Masao

On Tue, Jan 18, 2011 at 8:40 PM, Magnus Hagander  wrote:
>> When I untar the tar file taken by pg_basebackup, I got the following
>> messages:
>>
>>    $ tar xf base.tar
>>    tar: Skipping to next header
>>    tar: Archive contains obsolescent base-64 headers
>>    tar: Error exit delayed from previous errors
>>
>> Is this a bug? This happens only when I create $PGDATA by using
>> initdb -X (i.e., I relocated the pg_xlog directory elsewhere than
>> $PGDATA).
>
> Interesting. What version of tar and what platform? I can't reproduce
> that here...

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

$ uname -a
Linux hermes 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux

$ tar --version
tar (GNU tar) 1.15.1

>> +               r = PQgetCopyData(conn, ©buf, 0);
>> +               if (r == -1)
>>
>> Since -1 of PQgetCopyData might indicate an error, in this case,
>> we would need to call PQgetResult?.
>
> Uh, -1 means end of data, no? -2 means error?

The comment in pqGetCopyData3 says

/*
 * On end-of-copy, exit COPY_OUT or COPY_BOTH mode and let caller
 * read status with PQgetResult().  The normal case is that it's
 * Copy Done, but we let parseInput read that.  If error, we expect
 * the state was already changed.
 */

Also the comment in getCopyDataMessage says

/*
 * If it's a legitimate async message type, process it.  (NOTIFY
 * messages are not currently possible here, but we handle them for
 * completeness.)  Otherwise, if it's anything except Copy Data,
 * report end-of-copy.
 */

So I thought that. BTW, walreceiver has already done that.

>> ReceiveTarFile seems refactorable by using GZWRITE and GZCLOSE
>> macros.
>
> You mean the ones from pg_dump? I don't think so. We can't use
> gzwrite() with compression level 0 on the tar output, because it will
> still write a gz header. With pg_dump, that is ok because it's our
> format, but with a .tar (without .gz) I don't think it is.

Right. I withdrow the comment.

>> +       /*
>> +        * Make sure we're unpacking into an empty directory
>> +        */
>> +       verify_dir_is_empty_or_create(current_path);
>>
>> Can pg_basebackup take a backup of $PGDATA including a tablespace
>> directory, without an error? The above code seems to prevent that
>
> Uh, how do you mean it woul dprevent that? It requires that the
> directory you're writing the tablespace to is empty or nonexistant,
> but that shouldn't prevent a backup, no? It will prevent you from
> overwriting things with your backup, but that's intentional - if you
> don't need the old dir, just remove it.

What I'm worried about is the case where a tablespace is created
under the $PGDATA directory. In this case, ISTM that pg_basebackup
takes the backup of $PGDATA including the tablespace directory first,
and then takes the backup of the tablespace directory again. But,
since the tablespace directory is not already empty, the backup of
the tablespace seems to fail.

> Was easy, done with "-c ".

Thanks a lot!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/python refactoring

2011-01-18 Thread Alvaro Herrera

Excerpts from Peter Eisentraut's message of mar ene 18 19:22:50 -0300 2011:

> #16: This is probably pointless because pgindent will reformat this.

pgindent used to remove useless braces around single-statement blocks,
but this behavior was removed years ago because it'd break formatting
around PG_TRY blocks.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-18 Thread Fujii Masao

On Tue, Jan 18, 2011 at 8:50 PM, Greg Smith  wrote:
>> Why is the minimum value 64kB only when wal_buffers is set to
>> -1? This seems confusing for users.
>>
>
> That's because the current default on older versions is 64kB.  Since the
> automatic selection is going to be the new default, I hope, I don't want it
> to be possible it will pick a number smaller than the default of older
> versions.  So the automatic lower limit is 64kB, while the actual manually
> set lower limit remains 32kB, as before.

It would be helpful to explain that as the source code comment. Also
in the document.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: PL/Python validator function

2011-01-18 Thread Hitoshi Harada

2011/1/18 Jan Urbański :
> On 17/01/11 09:26, Jan Urbański wrote:
>> On 17/01/11 01:02, Hitoshi Harada wrote:
>>> This is a review for the patch sent as
>>> https://commitfest.postgresql.org/action/patch_view?id=456
>>> It includes adequate amount of test. I found regression test failure
>>> in plpython_error.
>>
>>> My environment is CentOS release 5.4 (Final) with python 2.4.3
>>> installed default.
>
> Seems that somewhere between Python 2.4 and Python 2.6 the whole module
> that was providing SyntaxError got rewritten and the way a syntax error
> from Py_CompileString is reported changed :( I tried some tricks but in
> the end I don't think it's worth it: I just added an alternative
> regression output file for older Pythons.
>
>>> It looks fine overall. The only thing that I came up with is trigger
>>> check logic in PLy_procedure_is_trigger. Although it seems following
>>> plperl's corresponding function, the check of whether the prorettype
>>> is pseudo type looks redundant since it checks prorettype is
>>> TRIGGEROID or OPAQUEOID later. But it is not critical.
>>
>> Yes, you're right, a check for prorettype only should be sufficient. Wil
>> fix.
>
> I removed the test for TYPTYPE_PSEUDO in the is_trigger function.
>
> Updated patch attached.

Thanks. I tested the new version and looks ok. I'll mark it "Ready for
Commiter".


Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log_hostname and pg_stat_activity

2011-01-18 Thread Alvaro Herrera

Excerpts from Steve Singer's message of mar ene 18 21:24:25 -0300 2011:

> Coding Review
> -
> 
> As Alvaro pointed out BackendStatusShmemSize should be updated.
> 
> To answer his question about why clientaddr works:  clientaddr is a 
> SockAddr which is a structure not a pointer so the data gets copied by 
> value to beentry.   That won't work for strings,  I have no issues with 
> how your allocating space in beentry and then strlcpy'ing the values.

Doh, of course.  Thanks.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/python refactoring

2011-01-18 Thread Hitoshi Harada

2011/1/19 Peter Eisentraut :
> On mån, 2011-01-17 at 21:49 +0200, Peter Eisentraut wrote:
>> On sön, 2011-01-02 at 12:41 +0100, Jan Urbański wrote:
>> > Here they are. There are 16 patches in total. They amount to what's
>> > currently in my refactor branch (and almost to what I've sent as the
>> > complete refactor patch, while doing the splitting I discovered a few
>> > useless hunks that I've discarded).
>>
>> Committed 0001, rest later.
>
> Today committed: 3, 5, 10, 11, 12, 13

I have reviewed this original patches for myself as the fundamental of
subsequent work, and have comments.

- This is not in the patch, but around line 184 "vis versa" in comment
seems like typo.
- -1 is used as the initial value of PLyTypeInfo.is_rowtype. Why not 0?
- PLy_(input|output)_tuple_funcs() in PLy_trigger_handler() is added.
The comment says it should check for the possibility that the
relation's tupdesc changed since last call. Is it true that tupdesc
may change even in one statement? And it also says the two functions
are responsible  for not doing repetitive work, but ISTM they are not
doing something to stop redundant work. The function doesn't seem like
lightweight enough to be called on each row.
- There are elog() and PLy_elog() overall, but it looks like to me
that the usage is inconsistent. Moreover, elog(ERROR, "PLyTypeInfo
struct is initialized for a Datum"); in
PLy_(input|output)_tuple_funcs() should be Assert() not elog()?
- A line break should be added before PLy_add_exception() after "static void"
- This is also not in the patch, but the comment
/* these should only be called once at the first call
 * of plpython_call_handler.  initialize the python interpreter
 * and global data.
 */
is bogus. PLy_init_interp() is called in _PG_init().

That's all for now. Some of them must be my misunderstanding or
trivial, but I post them for the record.


Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: Add \dL to show languages

2011-01-18 Thread Josh Kupershmidt

On Tue, Jan 18, 2011 at 1:35 PM, Andreas Karlsson  wrote:
> Hi Josh,
>
> Nope, I do not have any better ideas than "DO Blocks?".
>
> Everything looks good with the exception one bug now.
>
> \dL foo
> * QUERY **
> SELECT l.lanname AS "Name",
>       pg_catalog.pg_get_userbyid(l.lanowner) as "\Owner",
>       l.lanpltrusted AS "Trusted"
> FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'
>
> ORDER BY 1;
> **
>
> ERROR:  syntax error at or near "l"
> LINE 4: FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'
>
>
> I believe the fix is to move \n from before the WHERE clause to after
> the FROM, and from before ORDER BY to after WHERE.

Whoops, good you caught that. Should be fixed now.

> Fix this bug and I believe this patch is ready for a committer.
>
> PS. You added some trailing withspace after printACLColumn, A
> recommendation if you want to avoid it is to either have a git commit
> hook which checks for that and/or have colouring of git diffs so you can
> see it marked in red. I use both. :)

Got that now too. I lost my ~/.emacs file recently, which is mostly
why I'm making whitespace mistakes. Rebuilding slowly though;
(setq-default show-trailing-whitespace t) is what I needed.

I left the "Call Handler" and "Validator" columns in the verbose
output since I haven't heard otherwise.

Josh
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 5f61561..30d4507 100644
*** a/doc/src/sgml/ref/psql-ref.sgml
--- b/doc/src/sgml/ref/psql-ref.sgml
*** testdb=>
*** 1249,1254 
--- 1249,1269 
  

  
+   
+ \dL[S+] [ pattern ]
+ 
+ 
+ Lists all procedural languages. If pattern
+ is specified, only languages whose names match the pattern are listed.
+ By default, only user-created languages
+ are shown; supply the S modifier to include system
+ objects. If + is appended to the command name, each
+ language is listed with its call handler, validator, access privileges,
+ and whether it is a system object.
+ 
+ 
+   
  

  \dn[S+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 962c13c..301dc11 100644
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
*** exec_command(const char *cmd,
*** 416,421 
--- 416,424 
  			case 'l':
  success = do_lo_list();
  break;
+ 			case 'L':
+ success = listLanguages(pattern, show_verbose, show_system);
+ break;
  			case 'n':
  success = listSchemas(pattern, show_verbose, show_system);
  break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 205190f..abfd854 100644
*** a/src/bin/psql/describe.c
--- b/src/bin/psql/describe.c
*** listTables(const char *tabtypes, const c
*** 2566,2571 
--- 2566,2638 
  }
  
  
+ /* \dL
+ *
+ * Describes Languages.
+ */
+ bool
+ listLanguages(const char *pattern, bool verbose, bool showSystem)
+ {
+ 	PQExpBufferData buf;
+ 	PGresult *res;
+ 	printQueryOpt myopt = pset.popt;
+ 
+ 	initPQExpBuffer(&buf);
+ 
+ 	printfPQExpBuffer(&buf,
+ 	  "SELECT l.lanname AS \"%s\",\n",
+ 	  gettext_noop("Name"));
+ 	if (pset.sversion >= 80300)
+ 			appendPQExpBuffer(&buf,
+ 			  "   pg_catalog.pg_get_userbyid(l.lanowner) as \"%s\",\n",
+ 			  gettext_noop("Owner"));
+ 
+ 	appendPQExpBuffer(&buf,
+ 	  "   l.lanpltrusted AS \"%s\"",
+ 	  gettext_noop("Trusted"));
+ 
+ 	if (verbose)
+ 	{
+ 			appendPQExpBuffer(&buf,
+ 			  ",\n   l.lanplcallfoid::regprocedure AS \"%s\",\n"
+ 			  "   l.lanvalidator::regprocedure AS \"%s\",\n"
+ 			  "   NOT l.lanispl AS \"%s\",\n   ",
+ 			  gettext_noop("Call Handler"),
+ 			  gettext_noop("Validator"),
+ 			  gettext_noop("System Language"));
+ 			if (pset.sversion >= 9)
+ appendPQExpBuffer(&buf, "l.laninline != 0 AS \"%s\",\n   ",
+   gettext_noop("DO Blocks?"));
+ 			printACLColumn(&buf, "l.lanacl");
+ 	}
+ 
+ 	appendPQExpBuffer(&buf,
+ 	  "\nFROM pg_catalog.pg_language l\n");
+ 
+ 	processSQLNamePattern(pset.db, &buf, pattern, false, false,
+ 		  NULL, "l.lanname", NULL, NULL);
+ 
+ 	if (!showSystem && !pattern)
+ 		appendPQExpBuffer(&buf, "WHERE lanplcallfoid != 0\n");
+ 
+ 	appendPQExpBuffer(&buf, "ORDER BY 1;");
+ 
+ 	res = PSQLexec(buf.data, false);
+ 	termPQExpBuffer(&buf);
+ 	if (!res)
+ 		return false;
+ 
+ 	myopt.nullPrint = NULL;
+ 	myopt.title = _("List of languages");
+ 	myopt.translate_header = true;
+ 
+ 	printQuery(res, &myopt, pset.queryFout, pset.logfile);
+ 
+ 	PQclear(res);
+ 	return true;
+ }
+ 
+ 
  /*
   * \dD
   *
diff --git a/src/bin/psql/describe.h b/src/bin/psql/describe.h
index 2029ef8..4e80bcf 100644
*** a/src/bin/psql/describe.h
--- b/src/bin/psql/describe.h
*** extern bool listUserMappings(const char
*** 84,88 
--- 84,90 
  /* \det */

Re: [HACKERS] Extending opfamilies for GIN indexes

2011-01-18 Thread Tom Lane

I wrote:
> So here's what I'm thinking: we could redefine a GIN opclass, per se, as
> needing only compare() and extractValue() procs to be bound into it.
> The other three procs, as well as the query operators, could be "loose"
> in the containing opfamily.  The index AM would choose which set of the
> other support procedures to use for a specific query by matching their
> amproclefttype/amprocrighttype to the declared input types of the query
> operator, much as btree does.

> Having done that, contrib/intarray could work by adding "loose"
> operators and support procs to the core opfamily for integer[].

Oh, wait a minute: there's a bad restriction there, namely that a
contrib module could only add "loose" operators that had different
declared input types from the ones known to the core opclass.  Otherwise
there'd be a conflict with the contrib module and core needing to insert
similarly-keyed support functions.  This would actually be enough for
contrib/intarray (because the core operator entries are for "anyarray"
not for "integer[]") but it is easy to foresee cases where that wouldn't
be good enough.  Seems like we'd need an additional key column in
pg_amproc to really make this cover all cases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log_hostname and pg_stat_activity

2011-01-18 Thread Steve Singer



Here is my review for this patch


Submission Review

-Patch applies cleanly

-Patch does not include documentation changes.  At a minimum: update the 
table that lists what pg_stat_activity and pg_stat_replication includes 
in monitoring.sgml but I propose more below.


-No tests are included but writing unit tests that depend on produce the 
same output involving the hostname of the client is not possible.



Usability review


See my comments below in the testing section. The patch does do what it 
says but the log_hostname issue is a usability issue (it not being 
obvious why you get only null owhen log_hostname is turned off). 
Documenting it might be fine.  If log_hostname were new to 9.1 I would 
encourage renaming it to something that implies it does more than just 
control logging output but I don't see this as a good enough reason to 
rename a parameter.


I think being able to see the hostnames connections in pg_stat_activity 
come from is useful and it is a feature we don't already have.




Feature Test


If my connection is authorized through a line in pg_hba that uses 
client_hostname then the column shows what I expect even with 
log_hostname set to off.


However if I connect with a line in pg_hba that matches on an IP network 
then my client_hostname is always null unless log_hostname is set to 
true.  This is consistent with the behavior you describe but I think the 
average user will find it a bit confusing.  Having a column that is 
always null unless a GUC is set is less than ideal but I understand why 
log_hostname isn't on by default.


I would be inclined to add an entry for these views in catalogs.sgml 
that where we can then give users a pointer that they need to set 
log_hostname to get anything out of this column.


If I connect through unix sockets (say psql -h /tmp --port 5433)
I was also expecting to see "[local]" in client_hostname but I am just 
getting NULL.  This this lookup is static I don't see why it should be 
dependent on log_hostname (even with log_hostname set I'm not seeing 
[local])


I have not tested this with ipv6

Performance Review
--
The lookup is done when the connection is established not each time the 
view is queried.  I don't see any performance issues



Coding Review
-

As Alvaro pointed out BackendStatusShmemSize should be updated.

To answer his question about why clientaddr works:  clientaddr is a 
SockAddr which is a structure not a pointer so the data gets copied by 
value to beentry.   That won't work for strings,  I have no issues with 
how your allocating space in beentry and then strlcpy'ing the values.


I see no issues with the implementation

I'm marking this as 'Waiting for Author' pending documentation changes 
and maybe a fix the behaviour with unix sockets


Steve



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Simone Aiken


-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com] 
Sent: Tuesday, January 18, 2011 2:53 PM
To: Simone Aiken
Cc: Alvaro Herrera; pgsql-hackers
Subject: Re: [HACKERS] ToDo List Item - System Table Index Clustering


>Sure - my point is just that we usually have as a criteria for any
>performance related patch that it actually does improve performance.


Sorry wasn't arguing your point.   Conceding it actually. =)  
I wasn't explaining why I chose it anyway to contest your statements,
but as an invitation for you to point me towards something more useful 
that fit what I was looking for in a task. 


>
> Uh... I don't know what this means.
>

Pages like this one have column comments for the system tables:

http://www.psql.it/manuale/8.3/catalog-pg-attribute.html

But in my database when I look for comments they aren't there:

qcc=> \d+ pg_attribute
  Table "pg_catalog.pg_attribute"
Column |   Type   | Modifiers | Description
---+--+---+-
 attrelid  | oid  | not null  |
 attname   | name | not null  |
 atttypid  | oid  | not null  |
 attstattarget | integer  | not null  |
 attlen| smallint | not null  |
 attnum| smallint | not null  |
 attndims  | integer  | not null  |
 attcacheoff   | integer  | not null  |
 atttypmod | integer  | not null  |
 attbyval  | boolean  | not null  |
 attstorage| "char"   | not null  |
 attalign  | "char"   | not null  |
 attnotnull| boolean  | not null  |
 atthasdef | boolean  | not null  |
 attisdropped  | boolean  | not null  |
 attislocal| boolean  | not null  |
 attinhcount   | integer  | not null  |


So I have to fire up a web browser and start googling to learn 
about the columns.  Putting them in pg_description would be 
more handy, no?


-Simone Aiken



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Extending opfamilies for GIN indexes

2011-01-18 Thread Tom Lane

I just got annoyed by the fact that contrib/intarray has support for
queries on GIN indexes on integer[] columns, but they only work if you
use the intarray-provided opclass, not the core-provided GIN opclass for
integer[] columns.  In general, of course, two different GIN opclasses
aren't compatible, but here there is precious little reason why not:
the contents of the index are the same both ways, ie, all the individual
integer keys in the arrays.  It would be a real usability improvement,
and would eliminate a foot-gun, if contrib/intarray could somehow be an
extension to the core opclass instead of an independent thing.

It seems to me that this should be possible within the opfamily/opclass
data structure.  Right now, there isn't any real application for
opfamilies for GIN (or GiST) indexes, because both of those AMs pay
attention only to the "default" support procs that are bound into the
opclass for an index.  But that could change.

In particular, only two of the five support procs used by GIN are
actually associated with "the index", in the sense of having some impact
on what's stored in the index: the compare() and extractValue() procs.
The other three are more associated with queries, though they do depend
on having knowledge about the behavior of the compare and extractValue
procs.

So here's what I'm thinking: we could redefine a GIN opclass, per se, as
needing only compare() and extractValue() procs to be bound into it.
The other three procs, as well as the query operators, could be "loose"
in the containing opfamily.  The index AM would choose which set of the
other support procedures to use for a specific query by matching their
amproclefttype/amprocrighttype to the declared input types of the query
operator, much as btree does.

Having done that, contrib/intarray could work by adding "loose"
operators and support procs to the core opfamily for integer[].

It's possible that this scheme would also make it really useful to have
multiple opclasses within one GIN opfamily; though offhand I'm not sure
of an application for that.  (Right now, the only reason to do that is
if you want to give opclasses for different types the same name, as we
do with the core "array_ops".)

Perhaps the same could be done with GiST, although I'm less sure about
the possible usefulness there.

Comments?

BTW, this idea means that amproc entries would no longer be tightly
associated with specific GIN opclasses, so the contentious patch for
getObjectDescription should indeed get applied.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler

On Jan 18, 2011, at 3:46 PM, Tom Lane wrote:

> The above is "using" the index, but only as a guide to where the rows
> satisfying the partial-index predicate are --- note the lack of any
> index condition in the indexscan node.  That's because the query_int
> query is not in fact compatible with the core-provided index opclass.
> We get much better results using intarray's gin__int_ops opclass:

Ah-hah! Confirmed, thank you. I now get 23.2 ms and an index condition. Much, 
much better.

Thank you! And I think we can use the improved GIN indexes for this app, once 
9.1 drops.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

I wrote:
> No, I see no reason to think that has much to do with it.  I'm wondering
> if your table is itself a bit bloated ...

Actually ... I notice you did not show EXPLAIN ANALYZE output for your
tests.  Now I'm wondering whether you tested the right thing at all.
I got burnt that way too.  Observe:

regression=# create index idx_gin_features on listings using gin(features) 
WHERE deleted_at IS NULL AND status = 1;
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND status = 1;
 QUERY PLAN 


 Aggregate  (cost=9158.24..9158.25 rows=1 width=0) (actual 
time=153.633..153.634 rows=1 loops=1)
   ->  Seq Scan on listings  (cost=0.00..9157.22 rows=406 width=0) (actual 
time=0.048..153.493 rows=772 loops=1)
 Filter: ((deleted_at IS NULL) AND (features @@ '1368799 & 1368800 & 
1369043'::query_int) AND (status = 1))
 Total runtime: 153.713 ms
(4 rows)

regression=# set enable_seqscan TO 0;
SET
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND status = 1;
   QUERY PLAN   


 Aggregate  (cost=13253.42..13253.43 rows=1 width=0) (actual 
time=331.990..331.990 rows=1 loops=1)
   ->  Bitmap Heap Scan on listings  (cost=4095.18..13252.40 rows=406 width=0) 
(actual time=164.785..331.858 rows=772 loops=1)
 Recheck Cond: ((deleted_at IS NULL) AND (status = 1))
 Filter: (features @@ '1368799 & 1368800 & 1369043'::query_int)
 ->  Bitmap Index Scan on idx_gin_features  (cost=0.00..4095.07 
rows=406215 width=0) (actual time=164.045..164.045 rows=406215 loops=1)
 Total runtime: 332.169 ms
(6 rows)

The above is "using" the index, but only as a guide to where the rows
satisfying the partial-index predicate are --- note the lack of any
index condition in the indexscan node.  That's because the query_int
query is not in fact compatible with the core-provided index opclass.
We get much better results using intarray's gin__int_ops opclass:

regression=# drop index idx_gin_features;
DROP INDEX
regression=# create index idx_gin_features on listings using gin(features 
gin__int_ops) WHERE deleted_at IS NULL AND status = 1;
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND status = 1;
  QUERY PLAN
  
--
 Aggregate  (cost=1159.20..1159.21 rows=1 width=0) (actual time=23.896..23.896 
rows=1 loops=1)
   ->  Bitmap Heap Scan on listings  (cost=31.15..1158.18 rows=406 width=0) 
(actual time=22.912..23.813 rows=772 loops=1)
 Recheck Cond: ((features @@ '1368799 & 1368800 & 1369043'::query_int) 
AND (deleted_at IS NULL) AND (status = 1))
 ->  Bitmap Index Scan on idx_gin_features  (cost=0.00..31.05 rows=406 
width=0) (actual time=22.811..22.811 rows=772 loops=1)
   Index Cond: (features @@ '1368799 & 1368800 & 
1369043'::query_int)
 Total runtime: 24.036 ms
(6 rows)


regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler

On Jan 18, 2011, at 3:08 PM, Tom Lane wrote:

>> Shall I send you data with the other two columns?:
> 
> No, I see no reason to think that has much to do with it.  I'm wondering
> if your table is itself a bit bloated ...

Nope. Just ran the bloat query from check_postgres.pl. Bloat is 0. Not 
surprising since the table was created by inserting from another table (the 
original with the gist indexes) and has not been updated since.

Any other ideas?

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

A.M. wrote:
> 
> On Jan 18, 2011, at 5:41 PM, Bruce Momjian wrote:
> 
> > A.M. wrote:
>  Because the fastest option may not be syncing to disk. For example,
>  the only option that makes sense on OS X is fsync_writethrough- it
>  would be helpful if the tool pointed that out (on OS X only, obviously).
> >>> 
> >>> Yes, that would be a serious problem.  :-(
> >>> 
> >>> I am not sure how we would address this --- your point is a good one.
> >> 
> >> One general idea I had would be to offer some heuristics such as "this
> >> sync rate is comparable to that of one SATA drive" or "comparable to
> >> RAID 10 with X drives" or "this rate is likely too fast to be actually
> >> be syncing". But then you are stuck with making sure that the heuristics
> >> are kept up-to-date, which would be annoying.
> > 
> > That fails for RAID BBUs.
> 
> Well, it's nothing more than a heuristic- it is still nice to know whether or 
> not the fancy hardware RAID I just setup is similar to Josh Berkus' RAID 
> setup or a single SATA drive (which would hint at a misconfiguration). As you 
> said, perhaps a wiki is better for this. But a wiki won't integrate with this 
> tool, which I why I would hesitate to point novices to this tool... should 
> the tool point to the wiki?
> 
> > 
> >> Otherwise, the only option I see is to detect the kernel and compare
> >> against a list of known problematic methods. Perhaps it would be easier
> >> to compare against a whitelist. Also, the tool would likely need to
> >> parse "mount" output to account for problems with specific filesystems.
> >> 
> >> I am just throwing around some ideas...
> > 
> > That sounds pretty complicated.  One idea would be the creation of a
> > wiki where people could post their results, or ideally a tool that could
> > read the output and load it into a database for analysis with other
> > results.
> 
> The OS X example is pretty cut-and-dry- it would be nice if there were
> some kind of hints in the tool pointing in the right direction, or at
> least a few words of warning: "the fastest option may not be the safest-
> read the docs".

We have a wal reliability section in the docs that attempts to address
this:

http://developer.postgresql.org/pgdocs/postgres/wal-reliability.html

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread A.M.

On Jan 18, 2011, at 5:41 PM, Bruce Momjian wrote:

> A.M. wrote:
 Because the fastest option may not be syncing to disk. For example,
 the only option that makes sense on OS X is fsync_writethrough- it
 would be helpful if the tool pointed that out (on OS X only, obviously).
>>> 
>>> Yes, that would be a serious problem.  :-(
>>> 
>>> I am not sure how we would address this --- your point is a good one.
>> 
>> One general idea I had would be to offer some heuristics such as "this
>> sync rate is comparable to that of one SATA drive" or "comparable to
>> RAID 10 with X drives" or "this rate is likely too fast to be actually
>> be syncing". But then you are stuck with making sure that the heuristics
>> are kept up-to-date, which would be annoying.
> 
> That fails for RAID BBUs.

Well, it's nothing more than a heuristic- it is still nice to know whether or 
not the fancy hardware RAID I just setup is similar to Josh Berkus' RAID setup 
or a single SATA drive (which would hint at a misconfiguration). As you said, 
perhaps a wiki is better for this. But a wiki won't integrate with this tool, 
which I why I would hesitate to point novices to this tool... should the tool 
point to the wiki?

> 
>> Otherwise, the only option I see is to detect the kernel and compare
>> against a list of known problematic methods. Perhaps it would be easier
>> to compare against a whitelist. Also, the tool would likely need to
>> parse "mount" output to account for problems with specific filesystems.
>> 
>> I am just throwing around some ideas...
> 
> That sounds pretty complicated.  One idea would be the creation of a
> wiki where people could post their results, or ideally a tool that could
> read the output and load it into a database for analysis with other
> results.

The OS X example is pretty cut-and-dry- it would be nice if there were some 
kind of hints in the tool pointing in the right direction, or at least a few 
words of warning: "the fastest option may not be the safest- read the docs".

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

Tom Lane wrote:
> Bruce Momjian  writes:
> > Tom Lane wrote:
> >> Bruce Momjian  writes:
> >>> I have modified test_fsync to use test labels that match wal_sync_method
> >>> values, and and added more tests for open_sync with different sizes. 
> 
> >> Given that it was unclear whether the first such test was of any value,
> >> why are you slowing down the program by adding more?
> 
> > Greg Smith indicated it has value, so I made it more complete.  No?
> 
> My recollection of that discussion is a bit different: there wasn't a
> clear-cut reason to rip it out.  But the more tests you add to
> test_fsync, the less useful it becomes.

Well, this is Greg Smith's text:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg01717.php

> Might be some value for determining things like what the optimal WAL 
> block size to use is.  All these tests are kind of hard to use 
> effectively still, I'm not sure if it's time to start trimming tests 
yet 
> until we've made more progress on interpreting results first.

so I figured the test should be complete;  a partial test is pretty
useless.  What I am thinking is that the program should just run the
first test by default (to choose wal_sync_method), and add a -v option
to run the additional tests.  Yes?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

"David E. Wheeler"  writes:
> On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:
>> I'm noticing also that I get different rowcounts than you do, although
>> possibly that has something to do with the partial-index conditions,
>> which I'm not trying to duplicate here (all rows in my table pass those
>> two tests).

> Shall I send you data with the other two columns?:

No, I see no reason to think that has much to do with it.  I'm wondering
if your table is itself a bit bloated ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Tom Lane

Bruce Momjian  writes:
> Tom Lane wrote:
>> Bruce Momjian  writes:
>>> I have modified test_fsync to use test labels that match wal_sync_method
>>> values, and and added more tests for open_sync with different sizes. 

>> Given that it was unclear whether the first such test was of any value,
>> why are you slowing down the program by adding more?

> Greg Smith indicated it has value, so I made it more complete.  No?

My recollection of that discussion is a bit different: there wasn't a
clear-cut reason to rip it out.  But the more tests you add to
test_fsync, the less useful it becomes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler

On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:

> I'm noticing also that I get different rowcounts than you do, although
> possibly that has something to do with the partial-index conditions,
> which I'm not trying to duplicate here (all rows in my table pass those
> two tests).

Shall I send you data with the other two columns?:

>> * Why does it take 3-4x longer to create the GIN than the GiST index
>> on tsvector?
> 
> Perhaps more maintenance_work_mem would help with that; although the
> fine manual says specifically that GIN text search indexes take about
> three times longer to build than equivalent GiST indexes, so maybe that
> behavior is as designed.

Okay then, thanks.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

Tom Lane wrote:
> Bruce Momjian  writes:
> > I have modified test_fsync to use test labels that match wal_sync_method
> > values, and and added more tests for open_sync with different sizes. 
> 
> Given that it was unclear whether the first such test was of any value,
> why are you slowing down the program by adding more?

Greg Smith indicated it has value, so I made it more complete.  No?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Tom Lane

Bruce Momjian  writes:
> I have modified test_fsync to use test labels that match wal_sync_method
> values, and and added more tests for open_sync with different sizes. 

Given that it was unclear whether the first such test was of any value,
why are you slowing down the program by adding more?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

A.M. wrote:
> >> Because the fastest option may not be syncing to disk. For example,
> >> the only option that makes sense on OS X is fsync_writethrough- it
> >> would be helpful if the tool pointed that out (on OS X only, obviously).
> > 
> > Yes, that would be a serious problem.  :-(
> > 
> > I am not sure how we would address this --- your point is a good one.
> 
> One general idea I had would be to offer some heuristics such as "this
> sync rate is comparable to that of one SATA drive" or "comparable to
> RAID 10 with X drives" or "this rate is likely too fast to be actually
> be syncing". But then you are stuck with making sure that the heuristics
> are kept up-to-date, which would be annoying.

That fails for RAID BBUs.

> Otherwise, the only option I see is to detect the kernel and compare
> against a list of known problematic methods. Perhaps it would be easier
> to compare against a whitelist. Also, the tool would likely need to
> parse "mount" output to account for problems with specific filesystems.
> 
> I am just throwing around some ideas...

That sounds pretty complicated.  One idea would be the creation of a
wiki where people could post their results, or ideally a tool that could
read the output and load it into a database for analysis with other
results.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE ... ADD FOREIGN KEY ... NOT ENFORCED

2011-01-18 Thread Simon Riggs

On Tue, 2011-01-18 at 14:26 -0600, Jim Nasby wrote:
> > 
> > 2 sub-command changes:
> > 
> > ALTER TABLE foo ADD FOREIGN KEY fkoo ... NOT VALID;
> > 
> > ALTER TABLE foo VALIDATE CONSTRAINT fkoo;
> 
> Sorry for the late reply; I just saw this...
> 
> Is there any way to be able to get the bad records out of the ALTER ... 
> VALIDATE? I know it's pretty unusual, but for a set of large tables, it could 
> take hours to run a separate query that gives you the results.
> 
> BTW, I agree that this should be a DDL command, it would be very odd if it 
> wasn't. But I also see it being very useful to be able to get the set of bad 
> rows at the same time. Maybe if there was an SRF that did the real work and 
> the ALTER just ignored the resultset?

I agree. Unfortunately that wasn't the consensus when I proposed that
earlier, so its just a simple validation true/false.

I could add an SRF that ran the validation query but brought back the
rows, but if zero then it sets valid. We could have both...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread A.M.


On Jan 18, 2011, at 5:21 PM, Bruce Momjian wrote:

> A.M. wrote:
>> 
>> On Jan 18, 2011, at 5:16 PM, Bruce Momjian wrote:
>> 
>>> A.M. wrote:
 
 On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:
 
> I have modified test_fsync to use test labels that match wal_sync_method
> values, and and added more tests for open_sync with different sizes. 
> This should make the program easier for novices to understand.  Here is
> a test run for Ubuntu 11.04:
> 
>   $ ./test_fsync
>   2000 operations per test
>   
>   Compare file sync methods using one 8k write:
>   (in wal_sync_method preference order, except fdatasync
>   is Linux's default)
>   open_datasync (non-direct I/O)*85.127 ops/sec
>   open_datasync (direct I/O) 87.119 ops/sec
>   fdatasync  81.006 ops/sec
>   fsync  82.621 ops/sec
>   fsync_writethroughn/a
>   open_sync (non-direct I/O)*84.412 ops/sec
>   open_sync (direct I/O) 91.006 ops/sec
>   * This non-direct I/O mode is not used by Postgres.
 
 I am curious how this is targeted at novices. A naive user might enable
 the "fastest" option which could be exactly wrong. For this to be useful
 to novices, I suspect the tool will need to generate platform-specific
 suggestions, no?
>>> 
>>> Uh, why isn't the fastest option right for them?  It is hardware/kernel
>>> specific when you run it --- how could it be better?
>> 
>> Because the fastest option may not be syncing to disk. For example,
>> the only option that makes sense on OS X is fsync_writethrough- it
>> would be helpful if the tool pointed that out (on OS X only, obviously).
> 
> Yes, that would be a serious problem.  :-(
> 
> I am not sure how we would address this --- your point is a good one.

One general idea I had would be to offer some heuristics such as "this sync 
rate is comparable to that of one SATA drive" or "comparable to RAID 10 with X 
drives" or "this rate is likely too fast to be actually be syncing". But then 
you are stuck with making sure that the heuristics are kept up-to-date, which 
would be annoying.

Otherwise, the only option I see is to detect the kernel and compare against a 
list of known problematic methods. Perhaps it would be easier to compare 
against a whitelist. Also, the tool would likely need to parse "mount" output 
to account for problems with specific filesystems.

I am just throwing around some ideas...

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE ... ADD FOREIGN KEY ... NOT ENFORCED

2011-01-18 Thread Jim Nasby

On Jan 14, 2011, at 5:15 AM, Simon Riggs wrote:
> On Mon, 2010-12-13 at 17:15 +, Peter Geoghegan wrote:
>> On 13 December 2010 16:08, Robert Haas  wrote:
>>> On Mon, Dec 13, 2010 at 10:49 AM, Simon Riggs  wrote:
 2. pg_validate_foreign_key('constraint name');
 Returns immediately if FK is valid
 Returns SETOF rows that violate the constraint, or if no rows are
 returned it updates constraint to show it is now valid.
 Lock held: AccessShareLock
>>> 
>>> I'm less sure about this part.  I think there should be a DDL
>>> statement to validate the foreign key.  The "return the problem" rows
>>> behavior could be done some other way, or just left to the user to
>>> write their own query.
>> 
>> +1. I think that a DDL statement is more appropriate, because it makes
>> the process sort of symmetrical.
> 
> Patch to implement the proposed feature attached, for CFJan2011.
> 
> 2 sub-command changes:
> 
> ALTER TABLE foo ADD FOREIGN KEY fkoo ... NOT VALID;
> 
> ALTER TABLE foo VALIDATE CONSTRAINT fkoo;

Sorry for the late reply; I just saw this...

Is there any way to be able to get the bad records out of the ALTER ... 
VALIDATE? I know it's pretty unusual, but for a set of large tables, it could 
take hours to run a separate query that gives you the results.

BTW, I agree that this should be a DDL command, it would be very odd if it 
wasn't. But I also see it being very useful to be able to get the set of bad 
rows at the same time. Maybe if there was an SRF that did the real work and the 
ALTER just ignored the resultset?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Pavel Stehule

2011/1/18 Tom Lane :
> Pavel Stehule  writes:
>> 2011/1/18 Tom Lane :
>>> I looked at this patch and found it fairly awkward.  What is the point
>>> of adding an additional flag to every variable, as opposed to just
>>> forcibly detoasting during assignment?
>
>> But detoasting on assignment isn't enought:
>
>> for i in array_lower(a,1) .. array_upper(a,1)
>> loop
>>   if x < a[i] then
>>     x = a[i];
>>   end if;
>> end loop;
>
>> in this cycle the variable a wasn't modified. Any access to this
>> variable means a detoast and decompres.
>
> How so?  In what I'm envisioning, a would have been decompressed when it
> was originally assigned to.
>

oh, I understand now. I afraid so it can be overhad, because there can
be path where you doesn't use a some variables from parameter list.

There are lot of user procedures, where not all parameters are used,
so I think is better to wait on first usage. Probably these procedures
can be written in SQL or C, but it can decrese a performance of some
current trivial functions in plpgsql.  So my strategy is simple - wait
with detoasting to last moment, but don't repeat detoasting. My
opinion isn't strong in this topic. One or twenty useless detoasting
isn't really significant in almost use cases (problem is thousands
detoasting).

Regards

Pavel Stehule

>                        regards, tom lane
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/python refactoring

2011-01-18 Thread Peter Eisentraut

On mån, 2011-01-17 at 21:49 +0200, Peter Eisentraut wrote:
> On sön, 2011-01-02 at 12:41 +0100, Jan Urbański wrote:
> > Here they are. There are 16 patches in total. They amount to what's
> > currently in my refactor branch (and almost to what I've sent as the
> > complete refactor patch, while doing the splitting I discovered a few
> > useless hunks that I've discarded).
> 
> Committed 0001, rest later.

Today committed: 3, 5, 10, 11, 12, 13

Discussion:

#2: It looks like this loses some information/formatting in the error
message.  Should we keep the pointing arrow there?

--- a/src/pl/plpython/expected/plpython_error.out
+++ b/src/pl/plpython/expected/plpython_error.out
@@ -10,10 +10,7 @@ CREATE FUNCTION sql_syntax_error() RETURNS text
 SELECT sql_syntax_error();
 WARNING:  PL/Python: plpy.SPIError: unrecognized error in PLy_spi_execute_query
 CONTEXT:  PL/Python function "sql_syntax_error"
-ERROR:  syntax error at or near "syntax"
-LINE 1: syntax error
-^
-QUERY:  syntax error
+ERROR:  PL/Python: plpy.SPIError: syntax error at or near "syntax"
 CONTEXT:  PL/Python function "sql_syntax_error"
 /* check the handling of uncaught python exceptions
  */

#7: This is unnecessary because the remaining fields are automatically
initialized with NULL.  The Python documentation uses initializations
like that.  The way I understand it, newer Python versions might add
more fields at the end, and they will rely on the fact that they get
automatically initialized even if the source code was for an older
version.  So I would rather not touch this, as it doesn't buy anything.

#16: This is probably pointless because pgindent will reformat this.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

A.M. wrote:
> 
> On Jan 18, 2011, at 5:16 PM, Bruce Momjian wrote:
> 
> > A.M. wrote:
> >> 
> >> On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:
> >> 
> >>> I have modified test_fsync to use test labels that match wal_sync_method
> >>> values, and and added more tests for open_sync with different sizes. 
> >>> This should make the program easier for novices to understand.  Here is
> >>> a test run for Ubuntu 11.04:
> >>> 
> >>>   $ ./test_fsync
> >>>   2000 operations per test
> >>>   
> >>>   Compare file sync methods using one 8k write:
> >>>   (in wal_sync_method preference order, except fdatasync
> >>>   is Linux's default)
> >>>   open_datasync (non-direct I/O)*85.127 ops/sec
> >>>   open_datasync (direct I/O) 87.119 ops/sec
> >>>   fdatasync  81.006 ops/sec
> >>>   fsync  82.621 ops/sec
> >>>   fsync_writethroughn/a
> >>>   open_sync (non-direct I/O)*84.412 ops/sec
> >>>   open_sync (direct I/O) 91.006 ops/sec
> >>>   * This non-direct I/O mode is not used by Postgres.
> >> 
> >> I am curious how this is targeted at novices. A naive user might enable
> >> the "fastest" option which could be exactly wrong. For this to be useful
> >> to novices, I suspect the tool will need to generate platform-specific
> >> suggestions, no?
> > 
> > Uh, why isn't the fastest option right for them?  It is hardware/kernel
> > specific when you run it --- how could it be better?
> 
> Because the fastest option may not be syncing to disk. For example,
> the only option that makes sense on OS X is fsync_writethrough- it
> would be helpful if the tool pointed that out (on OS X only, obviously).

Yes, that would be a serious problem.  :-(

I am not sure how we would address this --- your point is a good one.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread A.M.


On Jan 18, 2011, at 5:16 PM, Bruce Momjian wrote:

> A.M. wrote:
>> 
>> On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:
>> 
>>> I have modified test_fsync to use test labels that match wal_sync_method
>>> values, and and added more tests for open_sync with different sizes. 
>>> This should make the program easier for novices to understand.  Here is
>>> a test run for Ubuntu 11.04:
>>> 
>>> $ ./test_fsync
>>> 2000 operations per test
>>> 
>>> Compare file sync methods using one 8k write:
>>> (in wal_sync_method preference order, except fdatasync
>>> is Linux's default)
>>> open_datasync (non-direct I/O)*85.127 ops/sec
>>> open_datasync (direct I/O) 87.119 ops/sec
>>> fdatasync  81.006 ops/sec
>>> fsync  82.621 ops/sec
>>> fsync_writethroughn/a
>>> open_sync (non-direct I/O)*84.412 ops/sec
>>> open_sync (direct I/O) 91.006 ops/sec
>>> * This non-direct I/O mode is not used by Postgres.
>> 
>> I am curious how this is targeted at novices. A naive user might enable
>> the "fastest" option which could be exactly wrong. For this to be useful
>> to novices, I suspect the tool will need to generate platform-specific
>> suggestions, no?
> 
> Uh, why isn't the fastest option right for them?  It is hardware/kernel
> specific when you run it --- how could it be better?

Because the fastest option may not be syncing to disk. For example, the only 
option that makes sense on OS X is fsync_writethrough- it would be helpful if 
the tool pointed that out (on OS X only, obviously).

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tomas Vondra

Dne 18.1.2011 18:12, Robert Haas napsal(a):
> On Tue, Jan 18, 2011 at 4:53 AM,   wrote:
>> So the most important question is how to intercept the new/updated rows,
>> and where to store them. I think each backend should maintain it's own
>> private list of new records and forward them only in case of commit. Does
>> that sound reasonable?
> 
> At the risk of sounding demoralizing, nothing about this proposal
> sounds very promising to me, and that sounds like a particularly bad
> idea.  What happens if the transaction updates a billion records?  Or
> even a million records?  Are you going to store all of those in
> backend-local memory until commit time?  Or spool them in a
> potentially-gigantic disk file somewhere?  That memory allocation - or
> file - could grow to be larger than the size of the entire database in
> the worst case.  And COMMIT could take an awfully long time if it has
> to spool megabytes or gigabytes of data off to some other process.
> And what happens if there's a crash after the COMMIT but before all
> that data is sent?  The estimates become permanently wrong?

Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.

I was thinking about it and I believe at least one of the algorithms
(the "adaptive sampling") might handle "merging" i.e. the backends would
not need to store the list of values, just a private copy of the
estimator and update it. And at the end (after commit), the estimator
would be merged back into the actual one.

So the algorithm would be something like this:

1. create copy for all distinct estimators influenced by the INSERT
2. update the local copy
3. commit and merge the local copies back into the original estimator

The amount of copied data is very small, depending on the number of
distinct items it can handle and precision (see the adaptive.c for
details). Some examples

2^48 items + 1% error => 86 kB
2^48 items + 5% error => 3.4 kB
2^64 items + 1% error => 115 kB
2^64 items + 5% error => 4.6 kB

so it's much better than storing all the data.

But I really need to check this is possible (right now I'm quite busy
with organizing a local conference, so maybe next week).

Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the write of the merged estimator
fails, but I think it might be possible to write the estimator to xlog
or something like that. So it would be replayed during recovery etc. Or
is it a stupid idea?

> And are we doing all of this just to get a more accurate estimate of
> ndistinct?  For the amount of effort that it will probably take to get
> this working at all, you could probably implement index-only scans and
> have enough bandwidth left over to tackle global temporary tables.
> And unless I'm missing something, the chances that the performance
> consequences will be tolerable are pretty much zero.  And it would
> only benefit the tiny fraction of users for whom bad n_distinct
> estimates cause bad plans, and then the even tinier percentage of
> those who can't conveniently fix it by using the manual override that
> we already have in place - which presumably means people who have
> gigantic tables that are regularly rewritten with massive changes in
> the data distribution that affect plan choice.  Is that more than the
> empty set?

First of all, I've never proposed to use this as a default behaviour.
Actually I've strongly argued agains that idea on several occasions. So
the user would have to enable that explicitly for some columns, I really
am not an idiot to force everyone to pay for this.

So the goal is to help the "small fraction" of users and keep the
current solution for the rest of them.

And yes, the main reason why I am working on this are the multi-column
stats (in case of discrete columns). You can fix the ndistinct estimate
by manually overriding the estimate, but for the multi-column stats you
need an estimate of ndistinct for the combination of columns, and that's
a bit difficult to do manually.

The other reason is I've studied statistics (and I enjoy it, which seems
weird to most people). And it's a good way to learn the internals.

> Maybe the idea here is that this wouldn't fix just ndistinct estimates
> but would also help with multi-column statistics.  Even if that's the
> case, I think it's almost certainly a dead end from a performance
> standpoint.  Some kind of manual ANALYZE process that can be invoked
> when needed to scan the entire table would be painful but maybe
> acceptable for certain people with a problem they can't fix any other
> way, but doing this automatically for everyone seems like a really bad
> idea.

Yes, as I've mentioned above, the multi-column stats are the reason why
I started working on this. And yes, it really should work like this:

1. user realizes there's something

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

A.M. wrote:
> 
> On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:
> 
> > I have modified test_fsync to use test labels that match wal_sync_method
> > values, and and added more tests for open_sync with different sizes. 
> > This should make the program easier for novices to understand.  Here is
> > a test run for Ubuntu 11.04:
> > 
> > $ ./test_fsync
> > 2000 operations per test
> > 
> > Compare file sync methods using one 8k write:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> > open_datasync (non-direct I/O)*85.127 ops/sec
> > open_datasync (direct I/O) 87.119 ops/sec
> > fdatasync  81.006 ops/sec
> > fsync  82.621 ops/sec
> > fsync_writethroughn/a
> > open_sync (non-direct I/O)*84.412 ops/sec
> > open_sync (direct I/O) 91.006 ops/sec
> > * This non-direct I/O mode is not used by Postgres.
> 
> I am curious how this is targeted at novices. A naive user might enable
> the "fastest" option which could be exactly wrong. For this to be useful
> to novices, I suspect the tool will need to generate platform-specific
> suggestions, no?

Uh, why isn't the fastest option right for them?  It is hardware/kernel
specific when you run it --- how could it be better?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

"David E. Wheeler"  writes:
> On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:
>> At the moment my opinion is that gist__int_ops is too broken to be
>> usable, and it's also too uncommented to be fixable by anyone other
>> than the original author.

> That seems to jibe with your comments from last year:
>   http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Well, based on what I saw just now, I believe there are one or more
actual bugs in there, not just that it's straining the design capacity
of the opclass.  I think the "lossy" aspect isn't working at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler

On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:

> At the moment my opinion is that gist__int_ops is too broken to be
> usable, and it's also too uncommented to be fixable by anyone other
> than the original author.

That seems to jibe with your comments from last year:

  http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Best,

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Tom Lane

Pavel Stehule  writes:
> 2011/1/18 Tom Lane :
>> I looked at this patch and found it fairly awkward. Â What is the point
>> of adding an additional flag to every variable, as opposed to just
>> forcibly detoasting during assignment?

> But detoasting on assignment isn't enought:

> for i in array_lower(a,1) .. array_upper(a,1)
> loop
>   if x < a[i] then
> x = a[i];
>   end if;
> end loop;

> in this cycle the variable a wasn't modified. Any access to this
> variable means a detoast and decompres.

How so?  In what I'm envisioning, a would have been decompressed when it
was originally assigned to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

"David E. Wheeler"  writes:
> These numbers are a bit crazy-making, but the upshot is that Gist is
> slow out of the gate, but with data cached, it's pretty speedy. With
> indexscan and bitmapscan disabled, these queries all took 300-400
> ms. So GIN was never better performing than a table scan.

I could not replicate that here at all --- GIN indexscans were
consistently better than seqscans for me, eg

regression=# set enable_bitmapscan TO 1;
SET
Time: 0.673 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND status = 1;
  QUERY PLAN
  
--
 Aggregate  (cost=1159.20..1159.21 rows=1 width=0) (actual time=23.964..23.964 
rows=1 loops=1)
   ->  Bitmap Heap Scan on listings  (cost=31.15..1158.18 rows=406 width=0) 
(actual time=23.014..23.876 rows=772 loops=1)
 Recheck Cond: ((features @@ '1368799 & 1368800 & 1369043'::query_int) 
AND (deleted_at IS NULL) AND (status = 1))
 ->  Bitmap Index Scan on idx_gin_features  (cost=0.00..31.05 rows=406 
width=0) (actual time=22.913..22.913 rows=772 loops=1)
   Index Cond: (features @@ '1368799 & 1368800 & 
1369043'::query_int)
 Total runtime: 24.040 ms
(6 rows)

Time: 24.968 ms
regression=# set enable_bitmapscan TO 0;
SET
Time: 0.458 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(1368799&1368800&1369043)'::query_int
AND deleted_at IS NULL AND status = 1;
 QUERY PLAN 
  

 Aggregate  (cost=9158.24..9158.25 rows=1 width=0) (actual 
time=145.121..145.121 rows=1 loops=1)
   ->  Seq Scan on listings  (cost=0.00..9157.22 rows=406 width=0) (actual 
time=0.025..144.982 rows=772 loops=1)
 Filter: ((deleted_at IS NULL) AND (features @@ '1368799 & 1368800 & 
1369043'::query_int) AND (status = 1))
 Total runtime: 145.177 ms
(4 rows)

Time: 146.228 ms

I'm noticing also that I get different rowcounts than you do, although
possibly that has something to do with the partial-index conditions,
which I'm not trying to duplicate here (all rows in my table pass those
two tests).

> * Why does it take 3-4x longer to create the GIN than the GiST index
> on tsvector?

Perhaps more maintenance_work_mem would help with that; although the
fine manual says specifically that GIN text search indexes take about
three times longer to build than equivalent GiST indexes, so maybe that
behavior is as designed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread A.M.


On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:

> I have modified test_fsync to use test labels that match wal_sync_method
> values, and and added more tests for open_sync with different sizes. 
> This should make the program easier for novices to understand.  Here is
> a test run for Ubuntu 11.04:
> 
>   $ ./test_fsync
>   2000 operations per test
>   
>   Compare file sync methods using one 8k write:
>   (in wal_sync_method preference order, except fdatasync
>   is Linux's default)
>   open_datasync (non-direct I/O)*85.127 ops/sec
>   open_datasync (direct I/O) 87.119 ops/sec
>   fdatasync  81.006 ops/sec
>   fsync  82.621 ops/sec
>   fsync_writethroughn/a
>   open_sync (non-direct I/O)*84.412 ops/sec
>   open_sync (direct I/O) 91.006 ops/sec
>   * This non-direct I/O mode is not used by Postgres.

I am curious how this is targeted at novices. A naive user might enable the 
"fastest" option which could be exactly wrong. For this to be useful to 
novices, I suspect the tool will need to generate platform-specific 
suggestions, no?

Cheers,
M


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:16 PM, Simone Aiken  wrote:
> When I'm learning a new system I like to first learn how to use it,
> second learn its data model, third start seriously looking at the code.
> So that Todo is ideal for my learning method.

Sure - my point is just that we usually have as a criteria for any
performance related patch that it actually does improve performance.
So, we'd need a test case.

> If there is something else that would also involve studying all the system
> tables it would also be great.  For example, I noticed we have column
> level comments on the web but not in the database itself.  This seems
> silly.  Why not have the comments in the database and have the web
> query the tables of template databases for the given versions?

Uh... I don't know what this means.

> I'm open to other suggestions as well.

Here are a few TODO items that look relatively easy to me (they may
not actually be easy when you dig in, of course):

Clear table counters on TRUNCATE
Allow the clearing of cluster-level statistics
Allow ALTER TABLE ... ALTER CONSTRAINT ... RENAME
Allow ALTER TABLE to change constraint deferrability and actions

Unfortunately we don't have a lot of easy TODOs.  People keep doing
the ones we think up...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Mark Kirkwood


On 19/01/11 05:51, Robert Haas wrote:


I'm not sure why they'd care, but it certainly doesn't seem worth
spending the amount of time arguing about it that we are.  David and
Mark are, of course, free to spend their time petitioning Red Hat for
relicensing if they are so inclined, but they aren't entitled to tell
you how to spend yours.  And even if they were, I would hope that
they'd want you to spend it committing patches rather than arguing
with your employer about relicensing of a utility that's freely
available anyway and of use to 0.1% of our user base.



Funny how people can read an email and derive a completely different 
meaning from what you intended - just to be clear: it certainly wasn't 
my intention to tell anyone how to spend their time.


Anyway, good to have pg_filedump on pgfoundry, thanks Tom and RH.

regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Pavel Stehule

2011/1/18 Tom Lane :
> Noah Misch  writes:
>> On Sun, Jan 16, 2011 at 06:49:27PM +0100, Pavel Stehule wrote:
>>> I am sending a updated version with little bit more comments. But I am
>>> sure, so somebody with good English have to edit my comments.
>>> Minimally I hope, so your questions will be answered.
>
>> Thanks.  I edited the comments and white space somewhat, as attached.  I'll 
>> go
>> ahead and mark it Ready for Committer.
>
> I looked at this patch and found it fairly awkward.  What is the point
> of adding an additional flag to every variable, as opposed to just
> forcibly detoasting during assignment?  If it's to avoid detoasting when
> the variable is read in a way that doesn't require detoasting, it fails
> rather completely IMO, since exec_eval_datum certainly doesn't know
> that.

I am not sure about false overhead of detoasting. This is a safe
variant. I don't would to decrease performance. Not sure if it's
necessary.

But detoasting on assignment isn't enought:

some most simple example - searching a maximum

for i in array_lower(a,1) .. array_upper(a,1)
loop
  if x < a[i] then
x = a[i];
  end if;
end loop;

in this cycle the variable a wasn't modified. Any access to this
variable means a detoast and decompres. So there is necessary to
modify a process. Detoast not on assign, but detoast on usage.

Regards

Pavel Stehule

>
> The added memory context variable seems redundant as well.

I didn't find a pointer on top execution context available from
execution state. I am sure, so we have to switch to this context
explicitly.

Regards

Pavel Stehule

>
>                        regards, tom lane
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

"David E. Wheeler"  writes:
> One of the reasons our client wants GIN for the integer[] column so bad is 
> because recreating the GiST integer[] index is quite painful. Before I duped 
> the table, I was just dropping and recreating the index on the original 
> table. It was great to create the GIN index; for 400K rows, it took 1300 ms. 
> After my initial round of testing, I dropped it and created the GiST index. 
> That ran forwell, *hours*. Four at least, maybe six or seven (I forgot 
> \timing and was letting it run on screen while I did some iOS hacking). I 
> think something might be really wrong with GiST index creation for integer 
> arrays, because the difference was just appalling. 

I poked into this a bit, although it's really off-topic for what I was
doing in GIN, and I agree that there is something fundamentally rotten
in the gist__int_ops logic.  oprofile shows that the code is spending
all its time in g_int_compress and g_int_union during index build:

samples  %app name symbol name
5274740.2486  _int.so  g_int_compress
2709220.6726  libc-2.12.2.so   _wordcopy_fwd_aligned
1893814.4506  _int.so  compASC
1825613.9302  postgres pg_qsort
5741  4.3807  no-vmlinux   /no-vmlinux
3297  2.5158  postgres swapfunc
864   0.6593  _int.so  g_int_decompress
826   0.6303  _int.so  _int_unique
700   0.5341  _int.so  inner_int_union
617   0.4708  postgres med3
588   0.4487  libc-2.12.2.so   memset
371   0.2831  _int.so  g_int_same
143   0.1091  libc-2.12.2.so   memcpy
123   0.0939  postgres ArrayGetNItems
670.0511  _int.so  inner_int_inter
480.0366  postgres AllocSetAlloc
470.0359  libc-2.12.2.so   memmove
400.0305  postgres MemoryContextAllocZero

The time in g_int_compress is all on this loop:

while (len > MAXNUMRANGE * 2)
{
min = 0x7fff;
for (i = 2; i < len; i += 2)
if (min > (dr[i] - dr[i - 1]))
{
min = (dr[i] - dr[i - 1]);
cand = i;
}
memmove((void *) &dr[cand - 1], (void *) &dr[cand + 1], (len - cand 
- 1) * sizeof(int32));
len -= 2;
}

which is not too surprising since that's obviously O(N^2).  The other
high-percentage functions are qsort and subsidiaries, and those calls
are coming from g_int_union.  That part could probably be sped up, since
most of the calls are unioning two inputs, which could be done a lot
cheaper than this (the inputs are always presorted no?).  But there is
something really fundamentally wrong in the logic, I think, because
while poking at it with gdb I saw it union-ing fifty-thousand-element
arrays.  Surely it should get lossy before it gets to that.  I'm also
wondering how it tells the difference between lossy and non-lossy keys
--- although it's hard to tell what such spectacularly uncommented code
is supposed to be doing, it looks like the lossy case is supposed to be
pairs of values representing ranges, in which case g_int_compress is
surely doing the Wrong Thing with already-lossy inputs, and union'ing
lossy inputs is an entirely insane operation as well.

At the moment my opinion is that gist__int_ops is too broken to be
usable, and it's also too uncommented to be fixable by anyone other
than the original author.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Bruce Momjian

Robert Haas wrote:
> On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
>  wrote:
> > Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
> >>
> >> Hello Postgres Hackers,
> >>
> >> In reference to this todo item about clustering system table indexes,
> >> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
> >> I have been studying the system tables to see which would benefit ?from
> >> clustering. ?I have some index suggestions and a question if you have a
> >> moment.
> >
> > Wow, this is really old stuff. ?I don't know if this is really of any
> > benefit, given that these catalogs are loaded into syscaches anyway.
> > Furthermore, if you cluster at initdb time, they will soon lose the
> > ordering, given that updates move tuples around and inserts put them
> > anywhere. ?So you'd need the catalogs to be re-clustered once in a
> > while, and I don't see how you'd do that (except by asking the user to
> > do it, which doesn't sound so great).
> 
> The idea of the TODO seems to have been to set the default clustering
> to something reasonable.  That doesn't necessarily seem like a bad
> idea even if we can't automatically maintain the cluster order, but
> there's some question in my mind whether we'd get any measurable
> benefit from the clustering.  Even on a database with a gigantic
> number of tables, it seems likely that the relevant system catalogs
> will stay fully cached and, as you point out, the system caches will
> further blunt the impact of any work in this area.  I think the first
> thing to do would be to try to come up with a reproducible test case
> where clustering the tables improves performance.  If we can't, that
> might mean it's time to remove this TODO.

I think CLUSTER is a win when you are looking up multiple rows in the
same table, either using a non-unique index or a range search.  What
places do such lookups?  Having them all in adjacent pages would be a
win --- single-row lookups are usually not.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

I have modified test_fsync to use test labels that match wal_sync_method
values, and and added more tests for open_sync with different sizes. 
This should make the program easier for novices to understand.  Here is
a test run for Ubuntu 11.04:

$ ./test_fsync
2000 operations per test

Compare file sync methods using one 8k write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync (non-direct I/O)*85.127 ops/sec
open_datasync (direct I/O) 87.119 ops/sec
fdatasync  81.006 ops/sec
fsync  82.621 ops/sec
fsync_writethroughn/a
open_sync (non-direct I/O)*84.412 ops/sec
open_sync (direct I/O) 91.006 ops/sec
* This non-direct I/O mode is not used by Postgres.

Compare file sync methods using two 8k writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync (non-direct I/O)*42.721 ops/sec
open_datasync (direct I/O) 45.296 ops/sec
fdatasync  76.665 ops/sec
fsync  78.361 ops/sec
fsync_writethroughn/a
open_sync (non-direct I/O)*42.311 ops/sec
open_sync (direct I/O) 45.247 ops/sec
* This non-direct I/O mode is not used by Postgres.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16k
in different write open_sync sizes.)
 1 16k open_sync write 86.740 ops/sec
 2  8k open_sync writes44.709 ops/sec
 4  4k open_sync writes22.096 ops/sec
 8  2k open_sync writes10.856 ops/sec
16  1k open_sync writes 5.434 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close86.802 ops/sec
write, close, fsync85.766 ops/sec

Non-sync'ed 8k writes:
write  83.068 ops/sec



Applied patch attached.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index b1cec74..9c829ba 100644
*** /tmp/pgdiff.3331/yCQf2a_test_fsync.c	Tue Jan 18 15:06:39 2011
--- src/tools/fsync/test_fsync.c	Tue Jan 18 14:43:58 2011
*** void		test_open(void);
*** 47,52 
--- 47,53 
  void		test_non_sync(void);
  void		test_sync(int writes_per_op);
  void		test_open_syncs(void);
+ void		test_open_sync(const char *msg, int writes_size);
  void		test_file_descriptor_sync(void);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
  void		die(char *str);
*** main(int argc, char *argv[])
*** 61,68 
  
  	test_open();
  	
- 	test_non_sync();
- 	
  	/* Test using 1 8k write */
  	test_sync(1);
  
--- 62,67 
*** main(int argc, char *argv[])
*** 73,78 
--- 72,79 
  
  	test_file_descriptor_sync();
  	
+ 	test_non_sync();
+ 	
  	unlink(filename);
  
  	return 0;
*** handle_args(int argc, char *argv[])
*** 105,111 
  	}
  
  	while ((option = getopt_long(argc, argv, "f:o:",
!  long_options, &optindex)) != -1)
  	{
  		switch (option)
  		{
--- 106,112 
  	}
  
  	while ((option = getopt_long(argc, argv, "f:o:",
! 			long_options, &optindex)) != -1)
  	{
  		switch (option)
  		{
*** handle_args(int argc, char *argv[])
*** 126,132 
  		}
  	}
  
! 	printf("%d operations per test\n\n", ops_per_test);
  }
  
  void
--- 127,133 
  		}
  	}
  
! 	printf("%d operations per test\n", ops_per_test);
  }
  
  void
*** test_open(void)
*** 162,201 
  }
  
  void
- test_non_sync(void)
- {
- 	int			tmpfile, ops;
- 
- 	/*
- 	 * Test a simple write without fsync
- 	 */
- 	printf("Simple non-sync'ed write:\n");
- 	printf(LABEL_FORMAT, "8k write");
- 	fflush(stdout);
- 
- 	gettimeofday(&start_t, NULL);
- 	for (ops = 0; ops < ops_per_test; ops++)
- 	{
- 		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
- 			die("Cannot open output file.");
- 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
- 			die("write failed");
- 		close(tmpfile);
- 	}
- 	gettimeofday(&stop_t, NULL);
- 	print_elapse(start_t, stop_t);
- }
- 
- void
  test_sync(int writes_per_op)
  {
  	int			tmpfile, ops, writes;

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Kevin Grittner

"Kevin Grittner"  wrote:

> If my back-of-the-envelope math is right, a carefully constructed
> pessimal load could need up to (max_connections / 2)^2 -- so 100
> connections could conceivably require 2500 structures, although
> such a scenario would be hard to create.  Current "picked from
> thin air" numbers would have space for 500.

Er, actually, we would have space for 5000, because it's five times
the number of SERIALIZABLEXACT structures which is ten times
max_connections. I guess that would explain why I've never seen a
report of a problem.

Still, someone who creates a very large number of connections and
pounds them heavily with SERIALIZABLE transactions might conceivably
run into it.  Since that's something the docs explicitly warn you
*not* to do with serializable transactions, I'm not sure we need to
do more than make sure the error message and hint are good. 
Thoughts?

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:18 PM, Kevin Grittner
 wrote:
> That's all of them.

Our existing code has plenty of TODOs in it already, so I see no
problem with continuing to comment places where future enhancements
are possible, as long as they don't reflect deficiencies that are
crippling in the present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:00 PM, Heikki Linnakangas
 wrote:
> On 18.01.2011 21:16, Robert Haas wrote:
>>
>> On Tue, Jan 18, 2011 at 1:32 PM, Tom Lane  wrote:
>>>
>>> While I was trying to performance-test the texteq patch, it occurred to
>>> me that this proposed hint-bit change has got a serious drawback.  To
>>> wit, that it will totally destroy reproducibility of any performance
>>> test that involves table scans.  Right now, you know that you can take
>>> hint bits out of the equation by doing a vacuum analyze and checkpoint;
>>> after that, all hint bits in the table are known to be set and written
>>> to disk.  Then you can get on with comparing the effects of some patch
>>> or other.  With the proposed patch, it will never be clear whether
>>> all the hint bits are set, because the patch specifically removes the
>>> deterministic ways to get a hint bit written out.  So you'll never be
>>> very sure whether a performance difference you think you see is real,
>>> or whether one case or the other got affected by extra clog lookups.
>>> It's hard enough already to be sure about performance changes on the
>>> order of 1%, but this will make it impossible.
>>
>> True.  You could perhaps fix that by adding a GUC, but that feels
>> awfully like making it the user's problem to fix our broken
>> implementation.  Maybe we could live with it if the GUC were only
>> something developers ever needed to use, but I expect different people
>> would have different ideas about the correct setting in production.
>
> VACUUM (SET HINT BITS) 

Something along those lines could work too, but I don't see much
problem with making VACUUM doing it unconditionally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Kevin Grittner

Heikki Linnakangas  wrote:

> There's a few remaining TODO comments in the code, which obviously
> need to be resolved one way or another

Not all of these are "must haves" for 9.1.  Here's how they break
down:

The two in predicate_internals.h mark places which would need to be
touched if we further modify the btree AM to do index-tuple level
predicate locking.  I know that Dan was eager to tackle that because
his group at MIT has predicate locking working at that granularity
for their transaction-aware caching which works with PostgreSQL.  At
this point, though, I'm very skeptical about starting that effort
for 9.1; it seems like something for 9.2.

There's one TODO related to whether we can drop the volatile
modifier from MySerializableXact.  I've started reviewing code
around this, but am not yet done.  It wouldn't be terrible to leave
it there and worry about this for 9.2, but it will be even better if
we can clear it up one way or the other now.

Three are marking the spots we want to touch if a patch becomes
available which lets us cancel the transaction on one connection
from another.  I'd love to take care of these, but we can't without
a separate patch that I know *I'm* not in a position to write for
9.1.  We may need to leave these for 9.2, as well.

One is about the desirability of "taking some action" (probably
reporting a warning) in the SLRU summarization code if we've burned
through over a billion transaction IDs while one read write
transaction has remained active.  That puts us close to where we'd
need to start canceling attempts to start a new transaction due to
resource issues.  Seems like we should issue that warning for 9.1. 
Not big or dangerous.  I'll do it.

One is related to the heuristics for promoting the granularity of
predicate locks.  We picked numbers out of the air which seemed
reasonable at the time.  I'm sure we can make this more
sophisticated, but what we have seems to be working OK.  I'm tempted
to suggest that we leave this alone until 9.2 unless someone has a
suggestion for a particular improvement.

One is related to the number of structures to allocate for detailed
conflict checking.  We've never seen a problem with this limit, at
least that has reached me.  On the other hand, it's certainly at
least theoretically possible to hit the limit and cause confusing
errors.  Perhaps we can put this one on a GUC and make sure the
error message is clear with a hint to think about adjusting the GUC?
The GUC would be named something like
max_conflicts_per_serializable_transaction (or something to that
general effect).  We should probably do that or bump up the limit.
If my back-of-the-envelope math is right, a carefully constructed
pessimal load could need up to (max_connections / 2)^2 -- so 100
connections could conceivably require 2500 structures, although such
a scenario would be hard to create.  Current "picked from thin air"
numbers would have space for 500.  The structure is six times the
pointer size, so 24 bytes each at 32 bit and 48 bytes each at 64
bit.

Three TODOs have popped up since Heikki's post because I pushed
Dan's 2PC WIP to the repo -- it's at a point where behavior should
be right for cases where the server keeps running.  He's working on
the persistence aspect now, and there are TODOs in the new stubs
he's filling out.  Definitely 9.1.

Then there's one still lurking outside the new predicate* files, in
heapam.c.  It's about 475 lines into the heap_update function (25
lines from the bottom).  In reviewing where we needed to acquire
predicate locks, this struck me as a place where we might need to
duplicate the predicate lock from one tuple to another, but I wasn't
entirely sure.  I tried for a few hours one day to construct a
scenario which would show a serialization anomaly if I didn't do
this, and failed create one.  If someone who understands both the
patch and heapam.c wants to look at this and offer an opinion, I'd
be most grateful.  I'll take another look and see if I can get my
head around it better, but failing that, I'm disinclined to either
add locking I'm not sure we need or to remove the comment that says
we *might* need it there.

That's all of them.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Heikki Linnakangas


On 18.01.2011 21:16, Robert Haas wrote:

On Tue, Jan 18, 2011 at 1:32 PM, Tom Lane  wrote:

While I was trying to performance-test the texteq patch, it occurred to
me that this proposed hint-bit change has got a serious drawback.  To
wit, that it will totally destroy reproducibility of any performance
test that involves table scans.  Right now, you know that you can take
hint bits out of the equation by doing a vacuum analyze and checkpoint;
after that, all hint bits in the table are known to be set and written
to disk.  Then you can get on with comparing the effects of some patch
or other.  With the proposed patch, it will never be clear whether
all the hint bits are set, because the patch specifically removes the
deterministic ways to get a hint bit written out.  So you'll never be
very sure whether a performance difference you think you see is real,
or whether one case or the other got affected by extra clog lookups.
It's hard enough already to be sure about performance changes on the
order of 1%, but this will make it impossible.


True.  You could perhaps fix that by adding a GUC, but that feels
awfully like making it the user's problem to fix our broken
implementation.  Maybe we could live with it if the GUC were only
something developers ever needed to use, but I expect different people
would have different ideas about the correct setting in production.


VACUUM (SET HINT BITS) 

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Tom Lane

Noah Misch  writes:
> On Sun, Jan 16, 2011 at 06:49:27PM +0100, Pavel Stehule wrote:
>> I am sending a updated version with little bit more comments. But I am
>> sure, so somebody with good English have to edit my comments.
>> Minimally I hope, so your questions will be answered.

> Thanks.  I edited the comments and white space somewhat, as attached.  I'll go
> ahead and mark it Ready for Committer.

I looked at this patch and found it fairly awkward.  What is the point
of adding an additional flag to every variable, as opposed to just
forcibly detoasting during assignment?  If it's to avoid detoasting when
the variable is read in a way that doesn't require detoasting, it fails
rather completely IMO, since exec_eval_datum certainly doesn't know
that.

The added memory context variable seems redundant as well.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Josh Berkus

All,

Just speaking as someone who does a lot of log-crunching for performance
review, I don't find having a separate log_connections option for
replication terribly useful.  It's easy enough for me to just log all
connections and filter for the type of connections I want.

Even in cases where there's a ton of connection activity (e.g. PHP
webapp without a connection pool) logging all connections still doesn't
generate that many MB of output, in my experience.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 1:32 PM, Tom Lane  wrote:
> Robert Haas  writes:
 I think you may be confused about what the patch does - currently,
 pages with hint bit changes are considered dirty, period.
 Therefore, they are written whenever any other dirty page would be
 written: by the background writer cleaning scan, at checkpoints,
 and when a backend must write a dirty buffer before reallocating it
 to hold a different page. The patch keeps the first of these and
 changes the second two
>
> While I was trying to performance-test the texteq patch, it occurred to
> me that this proposed hint-bit change has got a serious drawback.  To
> wit, that it will totally destroy reproducibility of any performance
> test that involves table scans.  Right now, you know that you can take
> hint bits out of the equation by doing a vacuum analyze and checkpoint;
> after that, all hint bits in the table are known to be set and written
> to disk.  Then you can get on with comparing the effects of some patch
> or other.  With the proposed patch, it will never be clear whether
> all the hint bits are set, because the patch specifically removes the
> deterministic ways to get a hint bit written out.  So you'll never be
> very sure whether a performance difference you think you see is real,
> or whether one case or the other got affected by extra clog lookups.
> It's hard enough already to be sure about performance changes on the
> order of 1%, but this will make it impossible.

True.  You could perhaps fix that by adding a GUC, but that feels
awfully like making it the user's problem to fix our broken
implementation.  Maybe we could live with it if the GUC were only
something developers ever needed to use, but I expect different people
would have different ideas about the correct setting in production.

If I'm not failing to understand the situation, the problem with the
first sequential scan after a bulk load is that we're cycling through
a ring of buffers that all have hint-bit changes and therefore all
have to be written.  The first pass through the ring is OK, but after
that every new buffer we bring in requires evicting a buffer that we
first have to write.  Of course, with the patch, this bottleneck is
removed by skipping all those writes, but that now causes a second
problem: the pages only get written if the background writer happens
to notice them before the backend gets all the way around the ring,
and that's pretty hit-or-miss, so we basically dribble hint bits out
to disk here and there but the steady state never really converges to
"all hint bits on disk".

Maybe we could work around this by making the algorithm a little more
sophisticated.  Instead of the rather unilateral policy "backends
don't write pages that are only dirty due to hint bit changes!" we
could have some more nuanced rules.  For example, we might decree that
a backend will maintain a counter of the number of non-dirty pages
it's allocated.  Once it's allocated 20 pages that are either clean or
dirty-only-for-hint-bits, it writes that (or the next)
dirty-only-for-hint-bits it encounters.  That way, the effort of hint
bit setting would be spread out over the first 20 table scans, and
after that you converge to steady state.  We could also possibly
special-case vacuum to always write dirty-only-for-hint bits pages, on
the theory that the work is going to have to be done at some point,
and we're better off doing it during a maintenance task than
elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast

2011-01-18 Thread Tom Lane

Noah Misch  writes:
> texteq, textne, byteaeq and byteane detoast their arguments, then check for
> equality of length.  Unequal lengths imply the answer trivially; given equal
> lengths, the functions proceed to compare the actual bytes.  We can skip
> detoasting entirely when the lengths are unequal.  The attached patch 
> implements
> this.

Applied with stylistic changes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Josh Berkus


> To be frank, I really don't care about fixing this behavior on ext3,
> especially in the context of that sort of hack.  That filesystem is not
> the future, it's not possible to ever really make it work right, and
> every minute spent on pandering to its limitations would be better spent
> elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide
> some proper context for the worst-case behavior people can see right
> now, and to make sure refactoring here doesn't make things worse on it. 
> My target is same or slightly better on ext3, much better on XFS and ext4.

Please don't forget that we need to avoid performance regressions on
NTFS and ZFS as well.  They don't need to improve, but we can't let them
regress.  I think we can ignore BSD/UFS and Solaris/UFS, as well as
HFS+, though.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 17:33, Robert Haas  wrote:
> On Tue, Jan 18, 2011 at 2:15 AM, Heikki Linnakangas
>  wrote:
>> I also find it weird that incoming replication connections are logged by
>> default. In the standby, we also log "streaming replication successfully
>> connected to primary", which serves much of the same debugging purpose.
>
> Oh, good point.  All right, I withdraw my objection.  Let's just make
> it all controlled by log_connections and go home.

Done.

Oh, and the proper answer is yellow. Everybody knows that.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PgEast: 2011, 2nd call for papers

2011-01-18 Thread Joshua D. Drake

Hey folks,

PgEast is being held in NYC this year from 03/22-03-25. Get your papers
in, the deadline is soon!

http://www.postgresqlconference.org/

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync open_sync test

2011-01-18 Thread Bruce Momjian

Greg Smith wrote:
> Bruce Momjian wrote:
> > Is there a value to this test_fsync test?
> >
> > Compare open_sync with different sizes:
> > (This is designed to compare the cost of one large
> > sync'ed write and two smaller sync'ed writes.)
> > open_sync 16k write   242.563 ops/sec
> > 2 open_sync 8k writes 752.752 ops/sec
> >
> > It compares the cost of doing larger vs. two smaller open_sync writes.
> >   
> 
> Might be some value for determining things like what the optimal WAL 
> block size to use is.  All these tests are kind of hard to use 
> effectively still, I'm not sure if it's time to start trimming tests yet 
> until we've made more progress on interpreting results first.

FYI, I beefed up the testing for this in test_fsync:

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16k
in different write open_sync sizes.)
 1 16k open_sync write723.824 ops/sec
 2  8k open_sync writes   347.979 ops/sec
 4  4k open_sync writes   215.114 ops/sec
 8  2k open_sync writes   174.613 ops/sec
16  1k open_sync writes87.496 ops/sec

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Performance bug in DO blocks

2011-01-18 Thread Tom Lane

I just noticed that if you execute the same DO command over and over
within a session, it gets slower and slower.  And if you keep it up
you'll notice the backend's RAM consumption bloating too.  The cause
appears to be that we leak the cached plans created for any SQL
statements or expressions within the DO command --- the next iteration
won't reuse those, but rather create its own set.  Probably ought to
look into releasing those when the DO block is over.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2011-01-18 Thread Greg Smith


Euler Taveira de Oliveira wrote:
(i) If we want to support and scale factor greater than 21474 we have 
to convert some columns to bigint; it will change the test. From the 
portability point it is a pity but as we have never supported it I'm 
not too worried about it. Why? Because it will use bigint columns only 
if the scale factor is greater than 21474. Is it a problem? I don't 
think so because generally people compare tests with the same scale 
factor.


(ii) From the performance perspective, we need to test if the 
modifications don't impact performance. I don't create another code 
path for 64-bit modifications (it is too ugly) and I'm afraid some 
modifications affect the 32-bit performance. I'm in a position to test 
it though because I don't have a big machine ATM. Greg, could you lead 
these tests?


(iii) I decided to copy scanint8() (called strtoint64 there) from 
backend (Robert suggestion [1]) because Tom pointed out that strtoll() 
has portability issues. I replaced atoi() with strtoint64() but didn't 
do any performance tests.


(i):  Completely agreed.

(ii):  There is no such thing as a "big machine" that is 32 bits now; 
anything that's 32 is a tiny system here in 2011.  What I can do is 
check for degredation on the only 32-bit system I have left here, my 
laptop.  I'll pick a sensitive test case and take a look.


(iii) This is an important thing to test, particularly given it has the 
potential to impact 64-bit results too.


Thanks for picking this up again and finishing the thing off.  I'll add 
this into my queue of performance tests to run and we can see if this is 
worth applying.  Probably take a little longer than the usual CF review 
time.  But as this doesn't interfere with other code people are working 
on and is sort of a bug fix, I don't think it will be a problem if it 
takes a little longer to get this done.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Simon Riggs

On Tue, 2011-01-18 at 18:03 +0100, Magnus Hagander wrote:

> So it'd be pg_receive_wal and pg_receive_base_backup then?

OK for me.

Maybe even pg_receive_wal_stream

Don't see any reason why command names can't be long. We have many
function names already that long.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: Add \dL to show languages

2011-01-18 Thread Andreas Karlsson

Hi Josh,

Nope, I do not have any better ideas than "DO Blocks?".

Everything looks good with the exception one bug now.

\dL foo
* QUERY **
SELECT l.lanname AS "Name",
   pg_catalog.pg_get_userbyid(l.lanowner) as "Owner",
   l.lanpltrusted AS "Trusted"
FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'

ORDER BY 1;
**

ERROR:  syntax error at or near "l"
LINE 4: FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'


I believe the fix is to move \n from before the WHERE clause to after
the FROM, and from before ORDER BY to after WHERE.

Fix this bug and I believe this patch is ready for a committer.

PS. You added some trailing withspace after printACLColumn, A
recommendation if you want to avoid it is to either have a git commit
hook which checks for that and/or have colouring of git diffs so you can
see it marked in red. I use both. :)

Andreas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Tom Lane

Robert Haas  writes:
>>> I think you may be confused about what the patch does - currently,
>>> pages with hint bit changes are considered dirty, period.
>>> Therefore, they are written whenever any other dirty page would be
>>> written: by the background writer cleaning scan, at checkpoints,
>>> and when a backend must write a dirty buffer before reallocating it
>>> to hold a different page. The patch keeps the first of these and
>>> changes the second two

While I was trying to performance-test the texteq patch, it occurred to
me that this proposed hint-bit change has got a serious drawback.  To
wit, that it will totally destroy reproducibility of any performance
test that involves table scans.  Right now, you know that you can take
hint bits out of the equation by doing a vacuum analyze and checkpoint;
after that, all hint bits in the table are known to be set and written
to disk.  Then you can get on with comparing the effects of some patch
or other.  With the proposed patch, it will never be clear whether
all the hint bits are set, because the patch specifically removes the
deterministic ways to get a hint bit written out.  So you'll never be
very sure whether a performance difference you think you see is real,
or whether one case or the other got affected by extra clog lookups.
It's hard enough already to be sure about performance changes on the
order of 1%, but this will make it impossible.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) & a question about duplicated ctid

2011-01-18 Thread Greg Smith


David Fetter wrote:

I think I haven't communicated clearly what I'm suggesting, which is
that we ship with both an UPSERT and a MERGE, the former being ugly,
crude and simple, and the latter festooned with dire warnings about
isolation levels and locking.
  


I don't know that I completely agree with the idea that the UPSERT 
version should always be crude and the MERGE one necessarily heavy from 
a performance perspective.  However, I do feel you raise a legitimate 
point that once the hard stuff is done, and the locking issues around 
SQL proper MERGE sorted, having an UPSERT/REPLACE synonym that maps to 
it under the hood may be a useful way to provide a better user 
experience.  The way I see this, the right goal is to first provide the 
full spec behavior with good performance, and get all that plumbing 
right.  There's nothing that says we can't provide an easier syntax than 
the admittedly complicated MERGE one to the users though.  (I am tempted 
to make a joke about how you could probaby


So, as for this patch...there's about half a dozen significant open 
issues with the current code, along with a smattering of smaller ones.  
The problems remaining are deep enough that I think it would be 
challenging to work through them for this CF under the best conditions.  
And given that we haven't heard from Boxuan since early December, we're 
definitely understaffed to tackle major revisions.


My hope was that we'd get an updated patch from him before the CF 
deadline.  Even without it, maybe get some more internal help here.  
Given my focus on checkpoint issues, Simon on Sync Rep, and Dimitri 
still on his Extensions push, that's second part isn't going to happen.


I am marking MERGE officially "Returned With Feedback" for 9.1.  Lots of 
progress made, much better community understanding of the unresolved 
issues now than when we started, but not in shape to go into this 
release.  Since we still have some significant interest here in getting 
this finished, I'll see if I can get put together a better game-plan for 
how to get this actually done for 9.2 in time to discuss at release 
planning time.  The main thing that's become much more obvious to me 
just recently is how the remaining issues left here relate to the true 
serialization work, so worrying about that first is probably the right 
order anyway.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Alvaro Herrera

Excerpts from Magnus Hagander's message of mar ene 18 11:53:55 -0300 2011:
> On Tue, Jan 18, 2011 at 15:49, Alvaro Herrera
>  wrote:
> > Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:
> >
> >> Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
> >> time for another patch, PFA :-)
> >
> > Thanks ...  Message nitpick:
> > +   if (compresslevel > 0)
> > +   {
> > +       fprintf(stderr,
> > +               _("%s: this build does not support compression\n"),
> > +               progname);
> > +       exit(1);
> > +   }
> >
> > pg_dump uses the following wording:
> >
> > "WARNING: archive is compressed, but this installation does not support "
> > "compression -- no data will be available\n"
> >
> > So perhaps yours should s/build/installation/
> 
> That shows up when a the *archive* is compressed, though? There are a
> number of other cases that use build in the backend, such as:
> src/backend/utils/misc/guc.c:   errmsg("assertion checking is not
> supported by this build")));
> src/backend/utils/misc/guc.c: errmsg("Bonjour is not 
> supported by
> this build")));
> src/backend/utils/misc/guc.c: errmsg("SSL is not supported by 
> this
> build")));

Hmm.  I think I'd s/build/installation/ on all those messages for
consistency.

> > Also, in messages of this kind,
> > +               if (gzsetparams(ztarfile, compresslevel, 
> > Z_DEFAULT_STRATEGY) != Z_OK)
> > +               {
> > +                   fprintf(stderr, _("%s: could not set compression level 
> > %i\n"),
> > +                           progname, compresslevel);
> >
> > Shouldn't you also be emitting the gzerror()? ... oh I see you're
> > already doing it for most gz calls.
> 
> It's not clear from the zlib documentation I have that gzerror() works
> after a gzsetparams(). Do you have any docs that say differently by
> any chance?

Ah, no.  I was reading zlib.h, which is ambiguous as you say:

ZEXTERN int ZEXPORT gzsetparams OF((gzFile file, int level, int strategy));
/*
 Dynamically update the compression level or strategy. See the description
   of deflateInit2 for the meaning of these parameters.
 gzsetparams returns Z_OK if success, or Z_STREAM_ERROR if the file was not
   opened for writing.
*/

ZEXTERN const char * ZEXPORT gzerror OF((gzFile file, int *errnum));
/*
 Returns the error message for the last error which occurred on the
   given compressed file. errnum is set to zlib error number. If an
   error occurred in the file system and not in the compression library,
   errnum is set to Z_ERRNO and the application may consult errno
   to get the exact error code.


... but a quick look at the code says that it sets gz_stream->z_err
which is what gzerror returns:

int ZEXPORT gzsetparams (file, level, strategy)
gzFile file;
int level;
int strategy;
{
gz_stream *s = (gz_stream*)file;

if (s == NULL || s->mode != 'w') return Z_STREAM_ERROR;

/* Make room to allow flushing */
if (s->stream.avail_out == 0) {

s->stream.next_out = s->outbuf;
if (fwrite(s->outbuf, 1, Z_BUFSIZE, s->file) != Z_BUFSIZE) {
s->z_err = Z_ERRNO;
}
s->stream.avail_out = Z_BUFSIZE;
}

return deflateParams (&(s->stream), level, strategy);
}

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Simone Aiken

On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera 
wrote:
>> Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
>>>
>> >Hello Postgres Hackers,
>>>
>> >In reference to this todo item about clustering system table indexes, 
>>> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) 
>>
>> Wow, this is really old stuff.  I don't know if this is really of any 
>
>If we can't, that might mean it's time to remove this TODO.

When I'm learning a new system I like to first learn how to use it,
second learn its data model, third start seriously looking at the code.
So that Todo is ideal for my learning method.  

If there is something else that would also involve studying all the system
tables it would also be great.  For example, I noticed we have column 
level comments on the web but not in the database itself.  This seems
silly.  Why not have the comments in the database and have the web
query the tables of template databases for the given versions?

That way \d+ pg_tablename would provide instant gratification for users.
And we all like our gratification to be instant.  They could be worked into
The .h files though as inserts to pg_description they wouldn't provide an
excuse to learn bison.

I'm open to other suggestions as well.

-Simone Aiken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Simone Aiken

On Jan 18, 2011, at 6:35 AM, Alvaro Herrera wrote:
>> 
> 
> Wow, this is really old stuff.  I don't know if this is really of any
> benefit, given that these catalogs are loaded into syscaches anyway.

The benefit is educational primarily.  I was looking for a todo list item
that would expose me to the system tables.  Learning the data model
of a new system is always step 1 for me.  So that one was perfect as
it would have me study and consider each one to determine if there
was any benefit from clustering on its initial load into cache.  

> Furthermore, if you cluster at initdb time, they will soon lose the
> ordering, given that updates move tuples around and inserts put them
> anywhere.  So you'd need the catalogs to be re-clustered once in a
> while, and I don't see how you'd do that (except by asking the user to
> do it, which doesn't sound so great).

I did discover that last night.  I'm used to databases that keep up their
clustering.  One that falls apart over time is distinctly strange.  And the
way you guys do your re-clustering logic is overkill if just a few rows
are out of place.  On the upside, a call to mass re-clustering goes
and updates all the clustered indexes in the system and that includes
these tables.  Will have to study auto-vacuum as well to consider that.

>  (unless you just want to play with
> Bison for educational purposes, of course, which is always a good thing
> to do).

Pretty much, yeah.  

- Simone Aiken

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:03 PM, Magnus Hagander  wrote:
> So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
> others? (it's easy to rename so far, so I'll keep plugging away under
> the name pg_basebackup based on Fujii-sans comments until such a time
> as we have a reasonable consensus :-)

I like pg_receive_wal.  pg_receive_base_backup I would be inclined to
shorten to pg_basebackup or pg_streambackup, but I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby  wrote:
>> How's that different from what vacuum does on a TOAST table now?
>
> TOAST vacuum is currently an entirely separate vacuum. It might run at the 
> same time as the main table vacuum, but it still has all the work that would 
> be associated with vacuuming a table with the definition of a toast table. In 
> fact, at one point vacuuming toast took two passes: the first deleted the 
> toast rows that were no longer needed, then you had to go back and vacuum 
> those deleted rows.

I don't know whether it still works that way or not, but if it does,
fixing it could perhaps be done with far less drastic surgery than
what you're proposing here.

>>> Possibly, but you'd be paying tuple overhead twice, which is what I was 
>>> looking to avoid with forks.
>>
>> What exactly do you mean by "tuple overhead"?
>
> typedef struct HeapTupleHeaderData. With only two tables it might not be that 
> bad, depending on the fields. Beyond two tables it's almost certainly a loser.

A struct definition by itself doesn't cause overhead.  Are you talking
about storage on disk?  CPU usage for MVCC visibility testing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure  wrote:
> a few weeks back I hacked an experimental patch that removed the hint
> bit action completely.  the results were very premature and/or
> incorrect, but my initial findings suggested that hint bits might not
> be worth the cost from performance standpoint.  i'd like to see some
> more investigation in this direction before going with a complex
> application mechanism (although that would be beneficial vs the status
> quo).

I think it's not very responsible to allege that hint bits aren't
providing a benefit without providing the patch that you used and the
tests that you ran.  This is a topic that needs careful analysis, and
I think that saying "hint bits don't provide a benefit... maybe..."
doesn't do anything but confuse the issue.  How about doing some tests
with the patch from my OP and posting the results?  If removing hint
bits entirely doesn't degrade performance, then surely the
less-drastic approach I've taken here ought to be OK too.  But in my
testing, it didn't look too good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:47 AM, Jim Nasby  wrote:
> On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
>> Robert Haas  wrote:
>>
>>> a quick-and-dirty attempt to limit the amount of I/O caused by hint
>>> bits. I'm still very interested in knowing what people think about
>>> that.
>>
>> I found the elimination of the response-time spike promising.  I
>> don't think I've seen enough data yet to feel comfortable endorsing
>> it, though.  I guess the question in my head is: how much of the
>> lingering performance hit was due to having to go to clog and how
>> much was due to competition with the deferred writes?  If much of it
>> is due to repeated recalculation of visibility based on clog info, I
>> think there would need to be some way to limit how many times that
>> happened before the hint bits were saved.
>
> What if we sped up the case where hint bits aren't set? Has anyone collected 
> data on the actual pain points of checking visibility when hint bits aren't 
> set?

I think that's worth looking into, but I don't have any present plan
to actually do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 11:32 AM, Tom Lane wrote:
> Robert Haas  writes:
>> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby  wrote:
>>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby  wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
>> 
>> That seems like an interesting idea, but I actually don't see why it
>> would be any more efficient, and it seems like you'd end up
>> reinventing things like vacuum and free space map management.
>>> 
>>> The FSM would take some effort, but I don't think vacuum would be that hard 
>>> to deal with; you'd just have to free up the space in any referenced toast 
>>> forks at the same time that you vacuumed the heap.
> 
>> How's that different from what vacuum does on a TOAST table now?
> 
> Even more to the point: Jim hasn't provided one single reason to suppose
> that this would be better-performing than the existing approach.  It
> looks to me like a large amount of work, and loss of on-disk
> compatibility, for nothing at all except the sake of change.

Yes, I was only pointing out that there were possible uses for allowing a 
variable number of forks per relation if Tomas felt the need to go that 
direction. Changing toast would certainly require some very convincing 
arguments as to the benefits.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 11:24 AM, Robert Haas wrote:
> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby  wrote:
>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby  wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
>>> 
>>> That seems like an interesting idea, but I actually don't see why it
>>> would be any more efficient, and it seems like you'd end up
>>> reinventing things like vacuum and free space map management.
>> 
>> The FSM would take some effort, but I don't think vacuum would be that hard 
>> to deal with; you'd just have to free up the space in any referenced toast 
>> forks at the same time that you vacuumed the heap.
> 
> How's that different from what vacuum does on a TOAST table now?

TOAST vacuum is currently an entirely separate vacuum. It might run at the same 
time as the main table vacuum, but it still has all the work that would be 
associated with vacuuming a table with the definition of a toast table. In 
fact, at one point vacuuming toast took two passes: the first deleted the toast 
rows that were no longer needed, then you had to go back and vacuum those 
deleted rows.

 - Dynamic forks would make it possible to do a column-store database, or 
 at least something approximating one.
>>> 
>>> I've been wondering whether we could do something like this by
>>> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
>>> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
>>> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
>>> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
>> 
>> Possibly, but you'd be paying tuple overhead twice, which is what I was 
>> looking to avoid with forks.
> 
> What exactly do you mean by "tuple overhead"?

typedef struct HeapTupleHeaderData. With only two tables it might not be that 
bad, depending on the fields. Beyond two tables it's almost certainly a loser.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby  wrote:
>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby  wrote:
>>> - Forks are very possibly a more efficient way to deal with TOAST than 
>>> having separate tables. There's a fair amount of overhead we pay for the 
>>> current setup.
> 
> That seems like an interesting idea, but I actually don't see why it
> would be any more efficient, and it seems like you'd end up
> reinventing things like vacuum and free space map management.
>> 
>> The FSM would take some effort, but I don't think vacuum would be that hard 
>> to deal with; you'd just have to free up the space in any referenced toast 
>> forks at the same time that you vacuumed the heap.

> How's that different from what vacuum does on a TOAST table now?

Even more to the point: Jim hasn't provided one single reason to suppose
that this would be better-performing than the existing approach.  It
looks to me like a large amount of work, and loss of on-disk
compatibility, for nothing at all except the sake of change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby  wrote:
> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby  wrote:
>>> - Forks are very possibly a more efficient way to deal with TOAST than 
>>> having separate tables. There's a fair amount of overhead we pay for the 
>>> current setup.
>>
>> That seems like an interesting idea, but I actually don't see why it
>> would be any more efficient, and it seems like you'd end up
>> reinventing things like vacuum and free space map management.
>
> The FSM would take some effort, but I don't think vacuum would be that hard 
> to deal with; you'd just have to free up the space in any referenced toast 
> forks at the same time that you vacuumed the heap.

How's that different from what vacuum does on a TOAST table now?

>>> - Dynamic forks would make it possible to do a column-store database, or at 
>>> least something approximating one.
>>
>> I've been wondering whether we could do something like this by
>> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
>> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
>> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
>> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
>
> Possibly, but you'd be paying tuple overhead twice, which is what I was 
> looking to avoid with forks.

What exactly do you mean by "tuple overhead"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby  wrote:
>> - Forks are very possibly a more efficient way to deal with TOAST than 
>> having separate tables. There's a fair amount of overhead we pay for the 
>> current setup.
> 
> That seems like an interesting idea, but I actually don't see why it
> would be any more efficient, and it seems like you'd end up
> reinventing things like vacuum and free space map management.

The FSM would take some effort, but I don't think vacuum would be that hard to 
deal with; you'd just have to free up the space in any referenced toast forks 
at the same time that you vacuumed the heap.

>> - Dynamic forks would make it possible to do a column-store database, or at 
>> least something approximating one.
> 
> I've been wondering whether we could do something like this by
> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

Possibly, but you'd be paying tuple overhead twice, which is what I was looking 
to avoid with forks.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 8:24 AM, Merlin Moncure wrote:
> a few weeks back I hacked an experimental patch that removed the hint
> bit action completely.  the results were very premature and/or
> incorrect, but my initial findings suggested that hint bits might not
> be worth the cost from performance standpoint.  i'd like to see some
> more investigation in this direction before going with a complex
> application mechanism (although that would be beneficial vs the status
> quo).

If you're not finding much benefit to hint bits, that's *very* interesting. 
Everything I outlined certainly looks like a pretty damn expensive code path; 
it's really surprising that hint bits don't help.

I think it would be very valuable to profile the cost of the different code 
paths involved in the HeapTupleSatisfies* functions, even if the workload is 
just pgBench.

> an ideal testing environment to compare would be a mature database
> (full clog) with some verifiable performance tests and a mixed
> olap/oltp workload.

We're working on setting such a framework up. Unfortunately it will only be 8.3 
to start, but we hope to be on 9.0 soon.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Cédric Villemain

2011/1/18 Robert Haas :
> On Tue, Jan 18, 2011 at 10:53 AM, Tom Lane  wrote:
>> David Fetter  writes:
>>> I'm guessing there's a PolicyŽ at Red Hat that software made on its
>>> dime be GPL (v2, I'd guess), and that getting an exception would
>>> involve convening its board or similarly drastic action.
>>
>> It's company policy, and while it *might* be possible to get an
>> exception, the effort involved would far exceed the benefit we'd get out
>> of it.  Moreover, despite Mark's creative argument, I really doubt that
>> Red Hat would perceive any benefit to themselves in making an exception.
>
> I'm not sure why they'd care, but it certainly doesn't seem worth
> spending the amount of time arguing about it that we are.  David and
> Mark are, of course, free to spend their time petitioning Red Hat for
> relicensing if they are so inclined, but they aren't entitled to tell
> you how to spend yours.  And even if they were, I would hope that
> they'd want you to spend it committing patches rather than arguing
> with your employer about relicensing of a utility that's freely
> available anyway and of use to 0.1% of our user base.
>

still good, thanks Tom and RH to have push it nearest other PostgreSQL. tools.



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 4:53 AM,   wrote:
> So the most important question is how to intercept the new/updated rows,
> and where to store them. I think each backend should maintain it's own
> private list of new records and forward them only in case of commit. Does
> that sound reasonable?

At the risk of sounding demoralizing, nothing about this proposal
sounds very promising to me, and that sounds like a particularly bad
idea.  What happens if the transaction updates a billion records?  Or
even a million records?  Are you going to store all of those in
backend-local memory until commit time?  Or spool them in a
potentially-gigantic disk file somewhere?  That memory allocation - or
file - could grow to be larger than the size of the entire database in
the worst case.  And COMMIT could take an awfully long time if it has
to spool megabytes or gigabytes of data off to some other process.
And what happens if there's a crash after the COMMIT but before all
that data is sent?  The estimates become permanently wrong?

And are we doing all of this just to get a more accurate estimate of
ndistinct?  For the amount of effort that it will probably take to get
this working at all, you could probably implement index-only scans and
have enough bandwidth left over to tackle global temporary tables.
And unless I'm missing something, the chances that the performance
consequences will be tolerable are pretty much zero.  And it would
only benefit the tiny fraction of users for whom bad n_distinct
estimates cause bad plans, and then the even tinier percentage of
those who can't conveniently fix it by using the manual override that
we already have in place - which presumably means people who have
gigantic tables that are regularly rewritten with massive changes in
the data distribution that affect plan choice.  Is that more than the
empty set?

Maybe the idea here is that this wouldn't fix just ndistinct estimates
but would also help with multi-column statistics.  Even if that's the
case, I think it's almost certainly a dead end from a performance
standpoint.  Some kind of manual ANALYZE process that can be invoked
when needed to scan the entire table would be painful but maybe
acceptable for certain people with a problem they can't fix any other
way, but doing this automatically for everyone seems like a really bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Cédric Villemain

2011/1/18 Magnus Hagander :
> On Tue, Jan 18, 2011 at 17:31, Tom Lane  wrote:
>> Magnus Hagander  writes:
> Actually, after some IM chats, I think pg_streamrecv should be
> renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
> is a lot more specific than that)
>>
 pg_stream_log
 pg_stream_backup
>>
>>> Those seem better.
>>
>>> Tom, would those solve your concerns about it being clear which side
>>> they are on? Or do you think you'd still risk reading them as the
>>> sending side?
>>
>> It's still totally unclear what they do.  How about "pg_receive_log"
>> etc?
>
> I agree with whomever said using "wal" is better than "log" to be unambiguous.
>
> So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
> others? (it's easy to rename so far, so I'll keep plugging away under
> the name pg_basebackup based on Fujii-sans comments until such a time
> as we have a reasonable consensus :-)

pg_receive_wal is good for me.
pg_receive_base_backup in french base is a shortcut for database. here
we backup the whole cluster, I would suggest
pg_receive_cluster(_backup ?)


>
>
> --
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 17:31, Tom Lane  wrote:
> Magnus Hagander  writes:
 Actually, after some IM chats, I think pg_streamrecv should be
 renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
 is a lot more specific than that)
>
>>> pg_stream_log
>>> pg_stream_backup
>
>> Those seem better.
>
>> Tom, would those solve your concerns about it being clear which side
>> they are on? Or do you think you'd still risk reading them as the
>> sending side?
>
> It's still totally unclear what they do.  How about "pg_receive_log"
> etc?

I agree with whomever said using "wal" is better than "log" to be unambiguous.

So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
others? (it's easy to rename so far, so I'll keep plugging away under
the name pg_basebackup based on Fujii-sans comments until such a time
as we have a reasonable consensus :-)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane  wrote:
>> Oh, I misread Itagaki-san's comment to imply that that *was* in the
>> patch.  Maybe I should go read it.

> Perhaps.  :-)

> While you're at it you might commit it.  :-)

Yeah, as penance I'll take this one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 10:53 AM, Tom Lane  wrote:
> David Fetter  writes:
>> I'm guessing there's a PolicyŽ at Red Hat that software made on its
>> dime be GPL (v2, I'd guess), and that getting an exception would
>> involve convening its board or similarly drastic action.
>
> It's company policy, and while it *might* be possible to get an
> exception, the effort involved would far exceed the benefit we'd get out
> of it.  Moreover, despite Mark's creative argument, I really doubt that
> Red Hat would perceive any benefit to themselves in making an exception.

I'm not sure why they'd care, but it certainly doesn't seem worth
spending the amount of time arguing about it that we are.  David and
Mark are, of course, free to spend their time petitioning Red Hat for
relicensing if they are so inclined, but they aren't entitled to tell
you how to spend yours.  And even if they were, I would hope that
they'd want you to spend it committing patches rather than arguing
with your employer about relicensing of a utility that's freely
available anyway and of use to 0.1% of our user base.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane  wrote:
>>> No, I don't think so.  Has any evidence been submitted that that part of
>>> the patch is of benefit?
>
>> I think you might be mixing up what's actually in the patch with
>> another idea that was proposed but isn't actually in the patch.  The
>> patch itself does nothing controversial.
>
> Oh, I misread Itagaki-san's comment to imply that that *was* in the
> patch.  Maybe I should go read it.

Perhaps.  :-)

While you're at it you might commit it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync open_sync test

2011-01-18 Thread Bruce Momjian

Greg Smith wrote:
> Bruce Momjian wrote:
> > Is there a value to this test_fsync test?
> >
> > Compare open_sync with different sizes:
> > (This is designed to compare the cost of one large
> > sync'ed write and two smaller sync'ed writes.)
> > open_sync 16k write   242.563 ops/sec
> > 2 open_sync 8k writes 752.752 ops/sec
> >
> > It compares the cost of doing larger vs. two smaller open_sync writes.
> >   
> 
> Might be some value for determining things like what the optimal WAL 
> block size to use is.  All these tests are kind of hard to use 
> effectively still, I'm not sure if it's time to start trimming tests yet 
> until we've made more progress on interpreting results first.

OK, thanks for the feedback.  I just wanted to make sure it wasn't a
stupid test.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane  wrote:
>> No, I don't think so.  Has any evidence been submitted that that part of
>> the patch is of benefit?

> I think you might be mixing up what's actually in the patch with
> another idea that was proposed but isn't actually in the patch.  The
> patch itself does nothing controversial.

Oh, I misread Itagaki-san's comment to imply that that *was* in the
patch.  Maybe I should go read it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
 wrote:
> Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
>>
>> Hello Postgres Hackers,
>>
>> In reference to this todo item about clustering system table indexes,
>> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
>> I have been studying the system tables to see which would benefit  from
>> clustering.  I have some index suggestions and a question if you have a
>> moment.
>
> Wow, this is really old stuff.  I don't know if this is really of any
> benefit, given that these catalogs are loaded into syscaches anyway.
> Furthermore, if you cluster at initdb time, they will soon lose the
> ordering, given that updates move tuples around and inserts put them
> anywhere.  So you'd need the catalogs to be re-clustered once in a
> while, and I don't see how you'd do that (except by asking the user to
> do it, which doesn't sound so great).

The idea of the TODO seems to have been to set the default clustering
to something reasonable.  That doesn't necessarily seem like a bad
idea even if we can't automatically maintain the cluster order, but
there's some question in my mind whether we'd get any measurable
benefit from the clustering.  Even on a database with a gigantic
number of tables, it seems likely that the relevant system catalogs
will stay fully cached and, as you point out, the system caches will
further blunt the impact of any work in this area.  I think the first
thing to do would be to try to come up with a reproducible test case
where clustering the tables improves performance.  If we can't, that
might mean it's time to remove this TODO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 2:15 AM, Heikki Linnakangas
 wrote:
> I also find it weird that incoming replication connections are logged by
> default. In the standby, we also log "streaming replication successfully
> connected to primary", which serves much of the same debugging purpose.

Oh, good point.  All right, I withdraw my objection.  Let's just make
it all controlled by log_connections and go home.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane  wrote:
>> It's a very light-weight alternative of memcmp the byte data,
>> but there is still the same issue -- we might have different
>> compressed results if we use different algorithm for TOASTing.
>
> Which makes it a lightweight waste of cycles.
>
>> So, it would be better to apply the present patch as-is.
>
> No, I don't think so.  Has any evidence been submitted that that part of
> the patch is of benefit?

I think you might be mixing up what's actually in the patch with
another idea that was proposed but isn't actually in the patch.  The
patch itself does nothing controversial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Tom Lane

Magnus Hagander  writes:
>>> Actually, after some IM chats, I think pg_streamrecv should be
>>> renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
>>> is a lot more specific than that)

>> pg_stream_log
>> pg_stream_backup

> Those seem better.

> Tom, would those solve your concerns about it being clear which side
> they are on? Or do you think you'd still risk reading them as the
> sending side?

It's still totally unclear what they do.  How about "pg_receive_log"
etc?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Itagaki Takahiro  writes:
> On Tue, Jan 18, 2011 at 05:39, Tom Lane  wrote:
>> I haven't looked at this patch, but it seems to me that it would be
>> reasonable to conclude A != B if the va_extsize values in the toast
>> pointers don't agree.

> It's a very light-weight alternative of memcmp the byte data,
> but there is still the same issue -- we might have different
> compressed results if we use different algorithm for TOASTing.

Which makes it a lightweight waste of cycles.

> So, it would be better to apply the present patch as-is.

No, I don't think so.  Has any evidence been submitted that that part of
the patch is of benefit?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Tom Lane

Magnus Hagander  writes:
> Is there *any* usecase for setting them differently though?

I can't believe we're still engaged in painting this bikeshed.  Let's
just control it off log_connections and have done.

BTW, what about log_disconnections --- does a walsender emit a message
according to that?  If not, why not?  If we go through with something
fancy on the connection side, are we going to invent the same extra
complexity for log_disconnections?  And if we do, what happens when
they're set inconsistently?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Tom Lane

David Fetter  writes:
> I'm guessing there's a Policy® at Red Hat that software made on its
> dime be GPL (v2, I'd guess), and that getting an exception would
> involve convening its board or similarly drastic action.

It's company policy, and while it *might* be possible to get an
exception, the effort involved would far exceed the benefit we'd get out
of it.  Moreover, despite Mark's creative argument, I really doubt that
Red Hat would perceive any benefit to themselves in making an exception.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 10:56, Fujii Masao  wrote:
> On Tue, Jan 18, 2011 at 5:17 PM, Magnus Hagander  wrote:
>>> We should treat log_disconnections the same?
>>
>> We could keep it a boolean, but then only log disconnections for the
>> cases that are mentioned in log_connections?
>>
>> It doesn't make sense to log disconnection for a connection we didn't
>> log the connection for...
>
> Maybe true. But, at least for me, it's more intuitive to provide both as
> enum parameters.

Is there *any* usecase for setting them differently though? (other
than connections being  and disconnectoins being ?)
If not, aren't we just encouraging people to configure in a way that
makes no sense?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - file_fdw

2011-01-18 Thread Shigeru HANADA

On Tue, 18 Jan 2011 15:17:12 +0900
Itagaki Takahiro  wrote:

> On Sat, Jan 15, 2011 at 08:35, Shigeru HANADA  
> wrote:
> > Interface of NextCopyFrom() is fixed to return HeapTuple, to support
> > tableoid without any change to TupleTableSlot.
> >
> > 3) 20110114-copy_export_HeapTupe.patch
> > This patch fixes interface of NextCopyFrom() to return results as
> > HeapTuple.
> 
> I think file_fdw can return tuples in virtual tuples forms,
> and ForeignNext() calls ExecMaterializeSlot() to store tableoid.

Thanks for the comment.  I've fixed file_fdw to use tts_values and
tts_isnull to receive results from NextCopyFrom, and store virtual
tuple in the slot.

Attached patch requires FDW API patches and copy_export-20110114.patch.
Please see also:
http://archives.postgresql.org/message-id/20110119002615.8316.69899...@metrosystems.co.jp

And also cost estimation of file_fdw is simplified along your comments
below.

On Tue, 21 Dec 2010 21:32:17 +0900
Itagaki Takahiro  wrote:
> #3. Why do you re-open a foreign table in estimate_costs() ?
> Since the caller seems to have the options for them, you can
> pass them directly, no?
> 
> In addition, passing a half-initialized fplan to estimate_costs()
> is a bad idea. If you think it is an OUT parameter, the OUT params
> should be *startup_cost and *total_cost.

In that message, you also pointed out that FDW must generate
explainInfo in every PlanRelScan call even if the planning is not for
EXPLAIN.  I'll try to defer generating explainInfo until EXPLAIN
VERBOSE really uses it.  It might need new hook point in expalain.c,
though.

Regards,
--
Shigeru Hanada

20110118-file_fdw.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] review: FDW API

2011-01-18 Thread Shigeru HANADA

On Sun, 16 Jan 2011 01:55:19 +0100
Jan Urbański  wrote:

> In general, the feature looks great and I hope it'll make it into 9.1.
> And it we'd get the possibility to write FDW handlers in other PLs than
> C, it would rock so hard...
> 
> I'm going to mark this a Waiting for Author because of the typos, the
> BeginScan/EndScan issue, and the nested loop stopping halfway issue. The
> rest are suggestions or just thoughts, and if you don't see them as
> justified, I'll mark the next patch Ready for Committer.

Thanks a lot for the comments.  I've (hopefully) fixed issues above. 
Please find attached patches.

== patch list ==

1) 20110118-no_fdw_perm_check.patch - This patch is not included in
last post.  This had been proposed on 2011-01-05 first, but maybe has
not been reviewd yet.  I re-propose this patch for SQL standard
conformance.  This patch removes permission check that requires USAGE
on the foreign-data wrapper at CREATE FOREIGN TABLE.
Please see original post for details.
http://archives.postgresql.org/message-id/20110105145206.30fd.69899...@metrosystems.co.jp

2) 20110118-fdw_catalog_lookup.patch - This patch adds GetForeignTables. 
Fixed lack of pg_foreign_table.h inclusion.

3) 20110118-fdw_handler.patch - This patch adds support for HANDLER
option to FOREIGN DATA WRAPPER object.

4) 20110118-foreign_scan.patch - This patch adds ForeignScan executor
node and FDW API hooks based on Heikki's proposal.  As Itagaki-san
suggested on 2010-12-21, FDW must generate information for EXPLAIN
VERBOSE every PlanRelScan() call.  It's better to avoid such overhead,
so new EXPLAIN hook would be needed.  I'll try to make it cleaner.

And I'll reply to the rest of your comments.

> We could use comments about how to return tuples from Iterate and how to
> finish returning them. I had to look at the example to figure out that
> you need ExecClearTuple(slot) in your last Iterate. If Iterate's
> interface were to change, we could just return NULL instead of a tuple
> to say that we're done.

I've added some comments for FDW-developer to fdwapi.h, though they
wouldn't be enough.

> We could be a bit more explicit about how to allocate objects, for
> instance if I'm allocating a FdwPlan in my PlanRelScan with a palloc,
> will it not go away too soon, or too late (IOW leak)?

For that example, the answer is no.  Objects are allocated in
MessageContext if you don't switch context and released when the query
has been finished.  I agree that more documentation or comments for
FDW-developers should be added.

> Maybe PlanNative should get the foreign table OID, not the server OID,
> to resemble PlanRelScan more. The server OID is just a syscache lookup
> away, anyway.

You would missed a case that multiple foreign tables are used in a
query.  Main purpose of PlanNative is to support pass-through
execution of remote query.  In pass-through mode, you can use syntax
as if you have directly connected to external server, so can't use
PostgreSQL's parser.

> If you do EXPLAIN SELECT * FROM foregintable, BeginScan is not called,
> but EndScan is. That's weird, and I noticed because I got a segfault
> when EndScan tried to free things that BeginScan allocated. Maybe just
> don't call EndScan for EXPLAIN?

Fixed to not call EndScan if it was EXPLAIN execution.

Regards,
--
Shigeru Hanada

20110118-no_fdw_perm_check.patch.gz
Description: Binary data

20110118-fdw_catalog_lookup.patch.gz
Description: Binary data

20110118-fdw_handler.patch.gz
Description: Binary data

20110118-foreign_scan.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 15:49, Alvaro Herrera
 wrote:
> Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:
>
>> Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
>> time for another patch, PFA :-)
>
> Thanks ...  Message nitpick:
> +   if (compresslevel > 0)
> +   {
> +       fprintf(stderr,
> +               _("%s: this build does not support compression\n"),
> +               progname);
> +       exit(1);
> +   }
>
> pg_dump uses the following wording:
>
> "WARNING: archive is compressed, but this installation does not support "
> "compression -- no data will be available\n"
>
> So perhaps yours should s/build/installation/

That shows up when a the *archive* is compressed, though? There are a
number of other cases that use build in the backend, such as:
src/backend/utils/misc/guc.c:  errmsg("assertion checking 
is not
supported by this build")));
src/backend/utils/misc/guc.c:errmsg("Bonjour is not 
supported by
this build")));
src/backend/utils/misc/guc.c:errmsg("SSL is not 
supported by this
build")));


> Also, in messages of this kind,
> +               if (gzsetparams(ztarfile, compresslevel, Z_DEFAULT_STRATEGY) 
> != Z_OK)
> +               {
> +                   fprintf(stderr, _("%s: could not set compression level 
> %i\n"),
> +                           progname, compresslevel);
>
> Shouldn't you also be emitting the gzerror()? ... oh I see you're
> already doing it for most gz calls.

It's not clear from the zlib documentation I have that gzerror() works
after a gzsetparams(). Do you have any docs that say differently by
any chance?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Alvaro Herrera

Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:

> Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
> time for another patch, PFA :-)

Thanks ...  Message nitpick:
+   if (compresslevel > 0)
+   {
+   fprintf(stderr,
+   _("%s: this build does not support compression\n"),
+   progname);
+   exit(1);
+   }

pg_dump uses the following wording:

"WARNING: archive is compressed, but this installation does not support "
"compression -- no data will be available\n"

So perhaps yours should s/build/installation/


Also, in messages of this kind,
+   if (gzsetparams(ztarfile, compresslevel, Z_DEFAULT_STRATEGY) != 
Z_OK)
+   {
+   fprintf(stderr, _("%s: could not set compression level 
%i\n"),
+   progname, compresslevel);

Shouldn't you also be emitting the gzerror()? ... oh I see you're
already doing it for most gz calls.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 137 matches

Mail list logo