date:20110118

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Heikki Linnakangas


On 17.01.2011 22:33, Tom Lane wrote:

Peter Eisentrautpete...@gmx.net  writes:

On mÃ¥n, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:

In fact, aren't there cases where the *length test* also fails?



Currently, two text values are only equal of strcoll() considers them
equal and the bits are the same.  So this patch is safe in that regard.



There is, however, some desire to loosen this.


That isn't ever going to happen, unless you'd like to give up hash joins
and hash aggregation on text values.


You could canonicalize the string first in the hash function. I'm not 
sure if we have all the necessary information at hand there, but at 
least with some encoding/locale-specific support functions it'd be possible.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Itagaki Takahiro

On Tue, Jan 18, 2011 at 05:39, Tom Lane t...@sss.pgh.pa.us wrote:
 I haven't looked at this patch, but it seems to me that it would be
 reasonable to conclude A != B if the va_extsize values in the toast
 pointers don't agree.

It's a very light-weight alternative of memcmp the byte data,
but there is still the same issue -- we might have different
compressed results if we use different algorithm for TOASTing.

So, it would be better to apply the present patch as-is.
We can improve the comparison logic over the patch in another
development cycle if possible.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Mark Kirkwood


On 18/01/11 18:04, Tom Lane wrote:

David Fetterda...@fetter.org  writes:

Who's the copyright holder(s)?  If it's all individual contributors,
Red Hat policy is not in play.

Sorry David, it was written on the company's dime.



However, I doubt that Red Hat derives any value from this useful product 
being excluded from contrib by the choice of license - would they be 
receptive to the idea that it would be free marketing to have it in the 
main tarball/rpm/deb (etc) with merely a decision to change it GPL-BSD?


regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 08:21, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jan 18, 2011 at 4:15 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 I also find it weird that incoming replication connections are logged by
 default. In the standby, we also log streaming replication successfully
 connected to primary, which serves much of the same debugging purpose. That
 standby-side message is ok, we have a tradition of being more verbose during
 recovery, we also emit the restored log file \%s\ from archive message
 for every WAL segment restored from archive for example.

 We could turn log_connections into an enum, like log_statement:

 log_connections = 'none'        # none, replication, regular, all

It almost seems overkill, but probably less so than a completely new guc :)


 We should treat log_disconnections the same?

We could keep it a boolean, but then only log disconnections for the
cases that are mentioned in log_connections?

It doesn't make sense to log disconnection for a connection we didn't
log the connection for...


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Anssi Kääriäinen


On 01/17/2011 07:58 PM, Kääriäinen Anssi wrote:

The issue I saw was this: assume you have an extension foo, containing one 
function, test().

CREATE EXTENSION foo;
DROP FUNCTION test();
-- restricted due to dependency

ALTER FUNCTION test() RENAME TO test2;
DROP FUNCTION test2();
-- not restricted!

The same can be done using CREATE OR REPLACE.

I hope this is not an error on my part. It is possible because I had a lot of 
schemas and my search_path might have been wrong...
The rename is an error on my part, sorry for that. Renaming can be done, 
but dropping is not possible even after rename. But a function in a 
package can be CREATE OR REPLACEd, and after that the function can be 
dropped. COR should be restricted in the same way as DROP is. I will 
check if this is still a problem with the latest patch.


Another problem is that you can ALTER FUNCTION  test() SET SCHEMA = 
something_else, and you can alter the functions search_path, which could 
be a problem for non-relocatable extensions. Even if the schema is 
changed, dropping extension / altering extension will work as expected. 
The function is just in different schema than the extension. But, both 
of these IMO fall in the category don't do that.


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Jim Nasby

On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
 Robert Haas  wrote:
 
 a quick-and-dirty attempt to limit the amount of I/O caused by hint
 bits. I'm still very interested in knowing what people think about
 that.
 
 I found the elimination of the response-time spike promising.  I
 don't think I've seen enough data yet to feel comfortable endorsing
 it, though.  I guess the question in my head is: how much of the
 lingering performance hit was due to having to go to clog and how
 much was due to competition with the deferred writes?  If much of it
 is due to repeated recalculation of visibility based on clog info, I
 think there would need to be some way to limit how many times that
 happened before the hint bits were saved.

What if we sped up the case where hint bits aren't set? Has anyone collected 
data on the actual pain points of checking visibility when hint bits aren't 
set? How about when setting hint bits is intentionally delayed? I wish we had 
some more infrastructure around the XIDCACHE counters; having that info 
available for people's general workloads might be extremely valuable. Even if I 
was to compile with it turned on, it seems the only way to get at it is via 
stderr, which is very hard to deal with.

Lacking performance data (and for my own education), I've spent the past few 
hours studying HeapTupleSatisfiesNow(). If I'm understanding it correctly, the 
three critical functions from a performance standpoint are 
TransactionIdIsCurrentTransactionId, TransactionIdIsInProgress and 
TransactionIdDidCommit. Note that all 3 can potentially be called twice; once 
to check xmin and once to check xmax.

ISTM TransactionIdIsCurrentTransactionId is missing a shortcut: shouldn't we be 
able to immediately return false if the XID we're checking is older than some 
value, like global xmin? Maybe it's only worth checking that case if we hit a 
subtransaction, but if the check is faster than one or two loops through the 
binary search... I would think this at least warrants a one XID cache ala 
cachedFetchXidStatus (though it would need to be a different cache...) Another 
issue is that TransactionIdIsInProgress will call this function as well, unless 
it skips out because the transaction is  RecentXmin.

TransactionIdIsInProgress does a fair amount of easy checking already... the 
biggest thing is that if it's less than RecentXmin we bounce out immediately. 
If we can't bounce out immediately though, this routine gets pretty expensive 
unless the XID is currently running and is top-level. It's worse if there are 
subxacts and can be horribly bad if any subxact caches have overflowed. Note 
that if anything has overflowed, then we end up going to clog and possibly 
pg_subtrans.

Finally, TransactionIdDidCommit hits clog.

So the degenerate cases seem to be:

- Really old XIDs. These suck because there's a good chance we'll have to read 
from clog.
- XIDs  RecontXmin that are not currently running top-level transactions. The 
pain here increases with subtransaction use.

For the second case, if we can ensure that RecentXmin is not very old then 
there's generally a smaller chance that TransactionIdIsInProgress has to do a 
lot of work. My experience is that most systems that have a high transaction 
rate don't end up with a lot of long-running transactions. Storing a list of 
the X oldest transactions would allow us to keep RecentXmin closer to the most 
recent XID.

For the first case, we should be able to create a more optimized clog lookup 
method that works for older XIDs. If we restrict this to XIDs that are older 
than GlobalXmin then we can simplify things because we don't have to worry 
about transactions that are in-progress. We also don't need to differentiate 
between subtransactions and their parents (though, we obviously need to figure 
out whether a subtransaction is considered to be committed or not). Because 
we're restricting this to XIDs that we know we can determine the state of, we 
only need to store a maximum of 1 bit per XID. That's already half the size of 
clog. But because we don't have to build this list on the fly (we're don't need 
to update it on every commit/abort as long as we know the range of XIDs that 
are stored), we don't have to support random writes. That means we can use a 
structure that's more complex to maintain than a simple bitmap. Or maybe we 
stick with a bitmap but compress it.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing comment in TransactionIdIsInProgress

2011-01-18 Thread Heikki Linnakangas


On 18.01.2011 07:15, Jim Nasby wrote:

Shouldn't the comment read If first time through?

/*
 * If not first time through, get workspace to remember main XIDs in. We
 * malloc it permanently to avoid repeated palloc/pfree overhead.
 */
if (xids == NULL)
{
...
xids = (TransactionId *) malloc(maxxids * 
sizeof(TransactionId));


Huh, yes, I'm amazed that no-one have noticed. I must've read that piece 
of code dozens of times in the last couple of years myself, and that 
sentence was even copy-pasted to GetConflictingVirtualXIDs() later in 
that file, including that thinko.


Thanks, fixed both copies.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Anssi Kääriäinen


On 01/17/2011 06:53 PM, Dimitri Fontaine wrote:


Usability review:

The patch implements a way to create extensions. While the patch is labeled
extensions support for pg_dump, it actually implements more. It implements a
new way to package and install extension, and changes contrib extensions to
use the new way.

Well, all there's in the patch is infrastructure to be able to issue
those single lines in your dump :)
Is this supposed to be used mainly by contrib and PGXN extensions? When 
I saw the documentation, I immediately thought that this is a nice way 
to package my application's stored procedures. If this is not one of the 
intended usages, it should be documented. I can see that this could be 
problematic when updating PostgreSQL and when recovering from backups.


When recovering from backup, you need to have the locally created 
extension available. But it might be that the extension files are lost 
when the system went down in flames. Now, the backup is unusable 
(right?) until extension files can be recovered from source control or 
where ever they might be stored. This is why I suggested having multiple 
locations for the extensions. It would be easy to backup locally created 
extensions if those were in a single directory. All in all, I have a 
nervous feeling that somebody someday will have an unusable dump because 
they used this feature, but do not have the extension files available...


Also, this part of documentation:

The goal of using extensions is so that applicationpg_dump/ knows
not to dump all the object definitions into your backups, but rather
issue a single xref linkend=SQL-CREATEEXTENSION command.

From that, it is not entirely clear to me why this is actually wanted 
in PostgreSQL. I suppose this will make dump/restore easier to use. But 
from that paragraph, I get the feeling the only advantage is that your 
dump will be smaller. And I still have a feeling this implements more. 
Not that it is a bad thing at all.

It used to work this way with \i, obviously.  Should the extension patch
care about that and how?  Do we want to RESET search_path or to manually
manage it like we do for the client_min_messages and log_min_messages?
It was unintuitive to me to have search_path changed by SQL command as a 
side effect. When using \i, not so much.

When trying to load extensions which contain identical signatured functions,
if the loading will succeed is dependent on the search path. If there is a
conflicting function in search path (first extension loaded to schema
public), then the second extension load will fail. But if the order is
reversed (first extension loaded to schema foo which is not in search path),
then the second extension can be loaded to the public schema.

Well I think that's an expected limitation here.  In the future we might
want to add suport for inter-extension dependencies and conflicts, but
we're certainly not there yet.
OK with this as is. It is just a bit surprising that you can create the 
extensions in one order but not in another.

There is no validation for the extension names in share/contrib/. It is
possible to have extensions control files named .control, .name.control,
name''.control etc. While it is stupid to name them so, it might be better
to give an explicit warning / error in these cases. It is of course possible
to use these extensions.

I don't have a strong opinion here, will wait for some votes.
Yeah, this is not a big problem in practice. Just don't name them like 
that. And if you do, you will find out soon enough that you made a 
mistake. By the way, in my comment above It is of course possible to 
use these extensions. - It is of course NOT possible 

I haven't had time to review the pg_dump part of the patch yet, will do that
next (tomorrow). I hope it is OK to post a partial review...

 From here, it's very good!  Will you continue from the git repository,
where the fixes are available, or do you want a v26 already?
It is easy for me to continue from the Git repo. I will next continue 
doing the pg_dump part of the review. I hope I have time to complete 
that today.


 - Anssi


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Determining client_encoding from client locale

2011-01-18 Thread Susanne Ebrecht


On 17.01.2011 20:18, Peter Eisentraut wrote:

That may be worth investigating, but I don't think it's related to the
present patch.



As I already said - not at all.
The patch was ok for me.

Susanne

--
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Warning compiling pg_dump (MinGW, Windows XP)

2011-01-18 Thread Pavel Golub

Hello, Andrew.

You wrote:



AD On 01/17/2011 03:51 PM, Tom Lane wrote:
 Andrew Dunstanand...@dunslane.net  writes:
 On 01/17/2011 07:18 AM, Pavel Golub wrote:
 So you think I should just ignore these warnings? Because I can't
 remember the same behaviour on 8.x branches...
 We've had them all along, as I said. See
 http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=frogmouthdt=2011-01-04%2023%3A54%3A16stg=make
 for the same thing in an 8.2 build.
 I wonder why mingw's gcc is complaining about %m when other versions of
 gcc do not?  If you can't get it to shut up about that, there's not
 going to be much percentage in silencing warnings about %lld.

   

AD We could add -Wno-format to the flags. That makes it shut up, but maybe
AD we don't want to use such a sledgehammer.

I want to understand the only thing. Are these warnings really
dangerous? Or I should just ignore them?

AD cheers

AD andrew



-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MingW + GCC 4.5.2 + Relocate libpq.dll = SegFault

2011-01-18 Thread Pavel Golub

Hello, Charlie.

Can you please express your opinion about my request Warning
compiling pg_dump (MinGW, Windows XP) to pgsql-hackers on Thu, 13 Jan
2011. Do you have the same warnings using MinGW environment?

You wrote:

CS I'm compiling postgresql 9.0.2 using msys + mingw + gcc 4.5.2 (latest 
CS official release from mingw).  This is on Windows 7 64-bit.

CS Unfortunately the built dlls, at least libpq.dll, crash if they need to
CS be relocated.  This happens to me when loading libpq.dll into a project
CS that has a number of other dll requirements.  Note this does NOT happen
CS when building with gcc 3.4.5.

CS Using GDB to track down the problem, the error occurs in 
CS __gcc_register_frame and looks to be the same error described here:

CS 
http://old.nabble.com/Bad-DLL-relocation---reproducible-w--test-case-td18292380.html

CS Note a similar sounding error described, and fixed, in newer releases of
CS binutils (which mingw provides and I am using) is described here:

CS 
http://lists-archives.org/mingw-users/11369-error-0xc005-is-now-fixed.html

CS Looking at the postgresql Makefiles, the dlls are built using dllwrap.
CS In particular, see src/Makefile.shlib line 413 and src/Makefile.port 
CS line 25.

CS When I change these to not use dllwrap and instead just use gcc -shared
CS the problem goes away.  Therefore I'd like to propose:

CS 1.  Change line 413 in src/Makefile.shlib

CS $(DLLWRAP) -o $@ --dllname $(shlib) $(DLLWRAP_FLAGS) --def 
CS $(DLL_DEFFILE) $(OBJS) $(LDFLAGS) $(LDFLAGS_SL) $(SHLIB_LINK)

CS To:

CS $(CC) -shared -o $@ $(DLL_DEFFILE) $(OBJS) $(LDFLAGS) $(LDFLAGS_SL) 
CS $(SHLIB_LINK)

CS 2.  Changle line 73 in src/Makefile.port:

CS $(DLLWRAP) -o $@ --def $*.def $ $(LDFLAGS) $(LDFLAGS_SL) $(BE_DLLLIBS)

CS To:

CS $(CC) -shared -o $@ $*.def $ $(LDFLAGS) $(LDFLAGS_SL) $(BE_DLLLIBS)

CS I tested this by intentionally compiling libpq.dll using the link flag
CS -Wl,--image-base to make sure that its base address conflicted with 
CS another dll loaded by my program.  With the proposed changes, windows 
CS successfully relocated libpq.dll without causing a segmentation fault.

CS I don't claim to know why dllwrap is producing dlls that can't be 
CS relocated while gcc -shared is.  But it appears to have been deprecated
CS back in 2006 according to the binutils mailing list:

CS http://www.mail-archive.com/bug-binutils@gnu.org/msg01470.html

CS So between being deprecated and not producing relocatable dlls, it seems
CS like its best to stop using dllwrap.  If this seems like a reasonable 
CS change I can put together a quick patch if that helps.

CS Thanks,

CS Charlie




-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Fujii Masao

On Mon, Jan 17, 2011 at 10:30 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Though I haven't seen the core part of the patch (i.e.,
 ReceiveTarFile, etc..) yet,
 here is the comments against others.

Here are another comments:


When I untar the tar file taken by pg_basebackup, I got the following
messages:

$ tar xf base.tar
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers
tar: Error exit delayed from previous errors

Is this a bug? This happens only when I create $PGDATA by using
initdb -X (i.e., I relocated the pg_xlog directory elsewhere than
$PGDATA).


+   if (compresslevel  0)
+   {
+   snprintf(fn, sizeof(fn), %s/%s.tar.gz, tardir, 
PQgetvalue(res,
rownum, 0));
+   ztarfile = gzopen(fn, wb);

Though I'm not familiar with zlib, isn't gzsetparams() required
here?


+#ifdef HAVE_LIBZ
+   if (!tarfile  !ztarfile)
+#else
+   if (!tarfile)
+#endif
+   {
+   fprintf(stderr, _(%s: could not create file \%s\: %s\n),
+   progname, fn, strerror(errno));

Instead of strerror, get_gz_error seems required when using zlib.


+   if (!res || PQresultStatus(res) != PGRES_COPY_OUT)

The check for !res is not needed since PQresultStatus checks that.


+   r = PQgetCopyData(conn, copybuf, 0);
+   if (r == -1)

Since -1 of PQgetCopyData might indicate an error, in this case,
we would need to call PQgetResult?.


ReceiveTarFile seems refactorable by using GZWRITE and GZCLOSE
macros.


+   fprintf(stderr, _(%s: could not write 
to file '%s': %m\n),

%m in fprintf is portable?


Can't you change '%s' to \%s\ for consistency?


+   /*
+* Make sure we're unpacking into an empty directory
+*/
+   verify_dir_is_empty_or_create(current_path);

Can pg_basebackup take a backup of $PGDATA including a tablespace
directory, without an error? The above code seems to prevent that


+   if (compresslevel = 0)
+   {
+   fprintf(stderr, _(%s: invalid 
compression level \%s\\n),

It's better to check compresslevel  9 here.


+/*
+ * Print a progress report based on the global variables. If verbose output
+ * is disabled, also print the current file name.

Typo: s/disabled/enabled


I request new option which specifies whether pg_start_backup
executes immediate checkpoint or not. Currently it always executes
immediate one. But I'd like to run smoothed one in busy system.
What's your opinion?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread tv

 On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
 1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

 Yeah, when I looked at the fork stuff I was disappointed to find out
 there's essentially no support for dynamically adding forks. There's two
 other possible uses for that I can think of:

 - Forks are very possibly a more efficient way to deal with TOAST than
 having separate tables. There's a fair amount of overhead we pay for the
 current setup.
 - Dynamic forks would make it possible to do a column-store database, or
 at least something approximating one.

 Without some research, there's no way to know if either of the above makes
 sense; but without dynamic forks we're pretty much dead in the water.

 So I wonder what it would take to support dynamically adding forks...

Interesting ideas, but a bit out of scope. I think I'll go with one fork
containing all the estimators for now, although it might be inconvenient
in some cases. I was thinking about rebuilding a single estimator with
increased precision - in that case the size changes so that all the other
data has to be shifted. But this won't be very common (usually all the
estimators will be rebuilt at the same time), and it's actually doable.

So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only in case of commit. Does
that sound reasonable?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Anssi Kääriäinen


On 01/18/2011 11:42 AM, Dimitri Fontaine wrote:

I've fixed the case by having the code remember the function's extension
if any, and restore it along with the other dependencies.
The only question here is should CREATE OR REPLACE be allowed. I just 
realized this could present a new problem. If I am not mistaken, when 
loading from dump, you suddenly get the extension's version back, not 
the one you defined in CREATE OR REPLACE. If this is the case, this 
should NOT be allowed. And by the same reasoning, ALTER FUNCTION 
[anything] should not be allowed either. Or at least then the 
function/(or any object for that matter) should be restored somehow from 
the backup, not from the extension files.


I still haven't had the time to start pg_dump reviewing, so I haven't 
verified if this is really a problem. But I suspect so...


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Fujii Masao

On Tue, Jan 18, 2011 at 5:17 PM, Magnus Hagander mag...@hagander.net wrote:
 We should treat log_disconnections the same?

 We could keep it a boolean, but then only log disconnections for the
 cases that are mentioned in log_connections?

 It doesn't make sense to log disconnection for a connection we didn't
 log the connection for...

Maybe true. But, at least for me, it's more intuitive to provide both as
enum parameters.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Dimitri Fontaine

Anssi Kääriäinen anssi.kaariai...@thl.fi writes:
 Is this supposed to be used mainly by contrib and PGXN extensions? When I
 saw the documentation, I immediately thought that this is a nice way to
 package my application's stored procedures. If this is not one of the
 intended usages, it should be documented. I can see that this could be
 problematic when updating PostgreSQL and when recovering from backups.

Sure, private application's stored procedure are meant to be fully
supported by the extension's facility.

 When recovering from backup, you need to have the locally created extension
 available. But it might be that the extension files are lost when the system
 went down in flames. Now, the backup is unusable (right?) until extension
 files can be recovered from source control or where ever they might be
 stored. This is why I suggested having multiple locations for the
 extensions. It would be easy to backup locally created extensions if those
 were in a single directory. All in all, I have a nervous feeling that
 somebody someday will have an unusable dump because they used this feature,
 but do not have the extension files available...

Well, as said in the documentation, extensions are to be used for
objects you are *not* maintaining in your database, but elsewhere.
Typically you are maintaining your stored procedure code in some VCS,
and you have some packaging (cat *.sql  my-ext.sql in the Makefile
would be the simpler to imagine).

So yes if you tell PostgreSQL that your procedures are managed elsewhere
so that their code won't be part of your dumps, and then fail to manage
them anywhere else, you're hosed.

My viewpoint here is that when you want to use extensions, you want to
package them for your OS of choice (mine is debian, and I've been
working on easing things on this side too with pg_buildext to be found
in the postgresql-server-dev-all package).  If you're an occasional user
just wanting to use new shining facilities… well, think twice…

 Also, this part of documentation:

 The goal of using extensions is so that applicationpg_dump/ knows
 not to dump all the object definitions into your backups, but rather
 issue a single xref linkend=SQL-CREATEEXTENSION command.

So maybe we want to extend this little sentence to add the warnings
around it, that if you're not packaging your extension's delivery to
your servers, you're likely shooting yourself in the foot?

 From that, it is not entirely clear to me why this is actually wanted in
 PostgreSQL. I suppose this will make dump/restore easier to use. But from
 that paragraph, I get the feeling the only advantage is that your dump will
 be smaller. And I still have a feeling this implements more. Not that it is
 a bad thing at all.

Well try to upgrade from 8.4 to 9.0 with some extensions installed in
there and used in your tables.  Pick any contrib, such as hstore or
ltree or cube, or some external code, such as ip4r or prefix or such.
Then compare to upgrade with the extension facility, and tell me what's
best :)

Hint: the dump contains the extension's script, but does not contain the
  shared object file.  If you're upgrading the OS and the contribs, as
  you often do when upgrading major versions, you're hosed.  You would
  think that pg_upgrade alleviate the concerns here, but you still have
  to upgrade manually the extension's .so.

  All in all, those extensions (contribs, ip4r, etc) are *not*
  maintained in your database and pretending they are by putting their
  scripts in your dumps is only building problems.  This patch aims to
  offer a solution here.

 It used to work this way with \i, obviously.  Should the extension patch
 care about that and how?  Do we want to RESET search_path or to manually
 manage it like we do for the client_min_messages and log_min_messages?
 It was unintuitive to me to have search_path changed by SQL command as a
 side effect. When using \i, not so much.

Agreed.  Will code the manual management way (that is already used for
log settings) later today, unless told to see RESET and how to do that
at the statement level rather than the transaction one.

 It is easy for me to continue from the Git repo. I will next continue doing
 the pg_dump part of the review. I hope I have time to complete that today.

Excellent, will try to continue following your pace :)

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Cédric Villemain

2011/1/18 Greg Smith g...@2ndquadrant.com:
 Bruce Momjian wrote:

 Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?


 The idea of having a dead period doing no work at all between write phase
 and sync phase may have some merit.  I don't have enough test data yet on
 some more fundamental issues in this area to comment on whether that smaller
 optimization would be valuable.  It may be a worthwhile concept to throw
 into the sequencing.

Are we able to have some pause without strict rules like 'stop for 30
sec' ? (case : my hardware is very good and I can write 400MB/sec with
no interrupt, XXX IOPS)

I wonder if we are not going to have issue with  RAID firmware + BBU
+ linux scheduler because we are adding 'unexpected' behavior in the
middle.

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Cédric Villemain

2011/1/18 Fujii Masao masao.fu...@gmail.com:
 On Mon, Jan 17, 2011 at 10:30 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Though I haven't seen the core part of the patch (i.e.,
 ReceiveTarFile, etc..) yet,
 here is the comments against others.

 Here are another comments:


 When I untar the tar file taken by pg_basebackup, I got the following
 messages:

    $ tar xf base.tar
    tar: Skipping to next header
    tar: Archive contains obsolescent base-64 headers
    tar: Error exit delayed from previous errors

 Is this a bug? This happens only when I create $PGDATA by using
 initdb -X (i.e., I relocated the pg_xlog directory elsewhere than
 $PGDATA).


 +               if (compresslevel  0)
 +               {
 +                       snprintf(fn, sizeof(fn), %s/%s.tar.gz, tardir, 
 PQgetvalue(res,
 rownum, 0));
 +                       ztarfile = gzopen(fn, wb);

 Though I'm not familiar with zlib, isn't gzsetparams() required
 here?


 +#ifdef HAVE_LIBZ
 +       if (!tarfile  !ztarfile)
 +#else
 +       if (!tarfile)
 +#endif
 +       {
 +               fprintf(stderr, _(%s: could not create file \%s\: %s\n),
 +                               progname, fn, strerror(errno));

 Instead of strerror, get_gz_error seems required when using zlib.


 +       if (!res || PQresultStatus(res) != PGRES_COPY_OUT)

 The check for !res is not needed since PQresultStatus checks that.


 +               r = PQgetCopyData(conn, copybuf, 0);
 +               if (r == -1)

 Since -1 of PQgetCopyData might indicate an error, in this case,
 we would need to call PQgetResult?.


 ReceiveTarFile seems refactorable by using GZWRITE and GZCLOSE
 macros.


 +                                       fprintf(stderr, _(%s: could not 
 write to file '%s': %m\n),

 %m in fprintf is portable?


 Can't you change '%s' to \%s\ for consistency?


 +       /*
 +        * Make sure we're unpacking into an empty directory
 +        */
 +       verify_dir_is_empty_or_create(current_path);

 Can pg_basebackup take a backup of $PGDATA including a tablespace
 directory, without an error? The above code seems to prevent that


 +                               if (compresslevel = 0)
 +                               {
 +                                       fprintf(stderr, _(%s: invalid 
 compression level \%s\\n),

 It's better to check compresslevel  9 here.


 +/*
 + * Print a progress report based on the global variables. If verbose output
 + * is disabled, also print the current file name.

 Typo: s/disabled/enabled


 I request new option which specifies whether pg_start_backup
 executes immediate checkpoint or not. Currently it always executes
 immediate one. But I'd like to run smoothed one in busy system.
 What's your opinion?


*if* it is possible, this is welcome, the checkpoint hit due to
pg_start_backup is visible, even outside pg_basebasckup.
(it sync everything then it blast cache memory)

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Anssi Kääriäinen


On 01/18/2011 12:11 PM, Anssi Kääriäinen wrote:

The only question here is should CREATE OR REPLACE be allowed. I just
realized this could present a new problem. If I am not mistaken, when
loading from dump, you suddenly get the extension's version back, not
the one you defined in CREATE OR REPLACE. If this is the case, this
should NOT be allowed. And by the same reasoning, ALTER FUNCTION
[anything] should not be allowed either. Or at least then the
function/(or any object for that matter) should be restored somehow from
the backup, not from the extension files.

I still haven't had the time to start pg_dump reviewing, so I haven't
verified if this is really a problem. But I suspect so...
Ok, verified at least for CREATE OR REPLACE, ALTER FUNCTION RENAME and 
ALTER FUNCTION SET search_path. You will get the extensions version back 
when restoring from plain sql dump, not the CORed function, rename is 
lost and same for search_path. I suspect this is a problem for any 
object type supported in extensions. But unfortunately I do not have 
time to verify that.


One more problem with pg_dump. If you have CREATE EXTENSION in you 
extensions .sql file, you will have problems restoring. I know it is 
stupid to do so, but still CREATE EXTENSION inside CREATE EXTENSION 
should be disallowed, as it is possible you find out too late that this 
is stupid thing to do. Also, the functions created in the recursive 
CREATE EXTENSION will be dumped, and the dependencies are not created 
correctly.


Unfortunately I have used up all the time I have for reviewing this 
patch. I can arrange some more time, maybe late this week, maybe a bit 
later. So, I do not have time to do the pg_dump part review in full 
detail right now. Still, I hope the work I have done is helpful.


Should I write up a post that contains all the current outstanding 
issues in one post, or is it enough to just link this thread in the 
CommitFest application?


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Dimitri Fontaine

Anssi Kääriäinen anssi.kaariai...@thl.fi writes:
 The only question here is should CREATE OR REPLACE be allowed. I just

Yes.  Think ALTER EXTENSION UPGRADE, the next patch in the series
(already proposed for this CF too).

 realized this could present a new problem. If I am not mistaken, when
 loading from dump, you suddenly get the extension's version back, not the
 one you defined in CREATE OR REPLACE. If this is the case, this should NOT
 be allowed. And by the same reasoning, ALTER FUNCTION [anything] should not
 be allowed either. Or at least then the function/(or any object for that
 matter) should be restored somehow from the backup, not from the extension
 files.

Well ideally those will get into extension's upgrade scripts, not be
typed interactively by superusers.  But I don't think we should limit
the capability of superusers to quickly fix a packaging mistake…

 I still haven't had the time to start pg_dump reviewing, so I haven't
 verified if this is really a problem. But I suspect so...

Both a problem when badly used and a good thing to have sometime, as in
the upgrade scripts :)
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Dimitri Fontaine

Anssi Kääriäinen anssi.kaariai...@thl.fi writes:
 Ok, verified at least for CREATE OR REPLACE, ALTER FUNCTION RENAME and ALTER
 FUNCTION SET search_path. You will get the extensions version back when
 restoring from plain sql dump, not the CORed function, rename is lost and
 same for search_path. I suspect this is a problem for any object type
 supported in extensions. But unfortunately I do not have time to verify
 that.

Yes it's the same, if you edit an extension's object directly the edited
version is not in the dump.  There's provision to have those objects in
the dump but the function to make it so is currently restricted to only
be called from the extension's script.

  pg_extension_flag_dump() changes the dependency type between an
  extension's and one of its object, given by oid. The default behavior
  of an extension is to skip objects in pg_dump, but in some cases
  that's not what you want to achieve, as an extension author. If your
  extension provides user data (in a table, for example), then flag this
  table so that it's part of the backups, like so:

SELECT pg_extension_flag_dump('schema.tablename'::regclass);

Maybe we should open this function so that it's usable even outside of
the extension's script, but I'm not keen on doing so.

Again, editing the extension's objects out of the scripts is still
limited to superusers and not the only way to shoot yourself in the
foot…

 One more problem with pg_dump. If you have CREATE EXTENSION in you
 extensions .sql file, you will have problems restoring. I know it is stupid
 to do so, but still CREATE EXTENSION inside CREATE EXTENSION should be
 disallowed, as it is possible you find out too late that this is stupid
 thing to do. Also, the functions created in the recursive CREATE EXTENSION
 will be dumped, and the dependencies are not created correctly.

That will be handled later, it's called inter-extension dependencies.
We said we won't address that yet…

 Unfortunately I have used up all the time I have for reviewing this patch. I
 can arrange some more time, maybe late this week, maybe a bit later. So, I
 do not have time to do the pg_dump part review in full detail right
 now. Still, I hope the work I have done is helpful.

Very much so, thanks a lot for the time you already spent on it!

 Should I write up a post that contains all the current outstanding issues in
 one post, or is it enough to just link this thread in the CommitFest
 application?

I'd appreciate a list of yet-to-fix items.  What I have is the
search_path issue where CREATE EXTENSION foo; can leave it changed for
the current session, I intend to fix that later today.

Other than that, I have no further already agreed on code fix to make.
What's your list?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Greg Smith


Robert Haas wrote:

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  Every
30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
closes the file, thus dumping the cache and preventing a ridiculous
growth in the amount of data to be sync'd at checkpoint time.
  


Today's data suggests this problem has been resolved in the latest 
kernels.  I saw the giant flush/series of small flushes pattern quite 
easily on the CentOS5 system I last did heavy pgbench testing on.  The 
one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't 
show it at all.


Here's what a bad checkpoint looks like on this system:

LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec
DEBUG:  checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec
DEBUG:  checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec
DEBUG:  checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec
DEBUG:  checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec
DEBUG:  checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec
DEBUG:  checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec
DEBUG:  checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec
DEBUG:  checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec
DEBUG:  checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec
DEBUG:  checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec
DEBUG:  checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec
DEBUG:  checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec
DEBUG:  checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec
DEBUG:  checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec
DEBUG:  checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec
DEBUG:  checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec
DEBUG:  checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec
DEBUG:  checkpoint sync: number=19 file=base/24746/36596.10 
time=9759.898 msec
DEBUG:  checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 
msec

DEBUG:  checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec
DEBUG:  checkpoint sync: number=22 file=base/24746/36596.11 
time=9153.437 msec

DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1057833 of relation base/24746/36596

[800 more of these]

DEBUG:  checkpoint sync: number=23 file=base/24746/36601.1 
time=48697.179 msec

DEBUG:  could not forward fsync request because request queue is full
DEBUG:  checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec
DEBUG:  checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec
DEBUG:  checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 20619 of relation base/24746/36601
DEBUG:  checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec
DEBUG:  checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 278967 of relation base/24746/36596
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1582400 of relation base/24746/36596
DEBUG:  checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec
DEBUG:  checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec
DEBUG:  checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec
DEBUG:  checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec
DEBUG:  checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec
DEBUG:  checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec
DEBUG:  checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec
DEBUG:  checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec
LOG:  checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log 
file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, 
total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s


Here the file that's been brutally delayed via backend contention is 
#23, after already seeing quite long delays on the earlier ones.  That 
I've never seen with earlier kernels running ext3.


This is good in that it makes it more likely a spread sync approach that 
works on XFS will also work on these newer kernels with ext4.  Then the 
only group we wouldn't be able to help if that works the ext3

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 03:14, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jan 17, 2011 at 10:50 PM, Magnus Hagander mag...@hagander.net wrote:
 +       printf(_(  -D, --pgdata=directory   receive base backup into 
 directory\n));
 +       printf(_(  -T, --tardir=directory    receive base backup into tar 
 files\n
 +                                                    stored in specified 
 directory\n));
 +       printf(_(  -Z, --compress=0-9        compress tar output\n));
 +       printf(_(  -l, --label=label         set backup label\n));
 +       printf(_(  -p, --progress            show progress 
 information\n));
 +       printf(_(  -v, --verbose             output verbose messages\n));

 Can we list those options in alphabetical order as other tools do?

 Certainly. But it makes more sense to have -D and -T next to each
 other, I think - they'd end up way elsewhere. Perhaps we need a group
 taht says target?

 I agree with you if we end up choosing -D and -T for a target rather
 than pg_dump-like options I proposed.

Yeah. What do others think between tohse two options? -D/-T followed
by directory, or -D dir and -Fformat?


 Requiring PQfinish() might be more reasonable since it will give you a
 log on the server if you don't, but I'm not convinced that's necessary
 either?

 At least it's required for each password-retry. Otherwise, previous
 connections remain during backup. You can see this problem easily

Oh yeah, I've put that one in my git branch already.

 by repeating the password input in pg_basebackup.

    LOG:  could not send data to client: Connection reset by peer
    LOG:  could not send data to client: Broken pipe
    FATAL:  base backup could not send data, aborting backup

 As you said, if PQfinish is not called at exit(1), the above messages
 would be output. Those messages look ugly and should be
 suppressed whenever we *can*. Also I believe other tools would
 do that.

Yeah, agreed. I'll add that, shouldn't be too hard.


 +       keywords[2] = fallback_application_name;
 +       values[2] = pg_basebackup;

 Using the progname variable seems better rather than the fixed word
 pg_basebackup.

 I don't think so - that turns into argv[0], which means that if you
 use the full path of the program (/usr/local/pgsql/bin/pg_basebackup
 for example), the entire path goes into fallback_application_name -
 not just the program name.

 No. get_progname extracts the actual name.

Hmm. I see it does. I wonder what I did to make that not work.

Then I agree with the change :-)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 10:49, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jan 17, 2011 at 10:30 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Though I haven't seen the core part of the patch (i.e.,
 ReceiveTarFile, etc..) yet,
 here is the comments against others.

 Here are another comments:

Thanks! These are all good and useful comments!


 When I untar the tar file taken by pg_basebackup, I got the following
 messages:

    $ tar xf base.tar
    tar: Skipping to next header
    tar: Archive contains obsolescent base-64 headers
    tar: Error exit delayed from previous errors

 Is this a bug? This happens only when I create $PGDATA by using
 initdb -X (i.e., I relocated the pg_xlog directory elsewhere than
 $PGDATA).

Interesting. What version of tar and what platform? I can't reproduce
that here...

It certainly is a bug, that should not happen.


 +               if (compresslevel  0)
 +               {
 +                       snprintf(fn, sizeof(fn), %s/%s.tar.gz, tardir, 
 PQgetvalue(res,
 rownum, 0));
 +                       ztarfile = gzopen(fn, wb);

 Though I'm not familiar with zlib, isn't gzsetparams() required
 here?

Uh. It certainly is! I clearly forgot it there...


 +#ifdef HAVE_LIBZ
 +       if (!tarfile  !ztarfile)
 +#else
 +       if (!tarfile)
 +#endif
 +       {
 +               fprintf(stderr, _(%s: could not create file \%s\: %s\n),
 +                               progname, fn, strerror(errno));

 Instead of strerror, get_gz_error seems required when using zlib.

Indeed it is. I think it needs to be this:
#ifdef HAVE_LIBZ
if (compresslevel  0  !ztarfile)
{
/* Compression is in use */
fprintf(stderr, _(%s: could not create compressed file \%s\: 
%s\n),
progname, fn, get_gz_error(ztarfile));
exit(1);
}
else
#endif
{
/* Either no zlib support, or zlib support but compresslevel = 
0 */
if (!tarfile)
{
fprintf(stderr, _(%s: could not create file \%s\: 
%s\n),
progname, fn, strerror(errno));
exit(1);
}
}


 +       if (!res || PQresultStatus(res) != PGRES_COPY_OUT)

 The check for !res is not needed since PQresultStatus checks that.

Hah. I still keep doing that from old habit. I know you've pointed
that out before, with libpqwalreceiver :-)


 +               r = PQgetCopyData(conn, copybuf, 0);
 +               if (r == -1)

 Since -1 of PQgetCopyData might indicate an error, in this case,
 we would need to call PQgetResult?.

Uh, -1 means end of data, no? -2 means error?


 ReceiveTarFile seems refactorable by using GZWRITE and GZCLOSE
 macros.

You mean the ones from pg_dump? I don't think so. We can't use
gzwrite() with compression level 0 on the tar output, because it will
still write a gz header. With pg_dump, that is ok because it's our
format, but with a .tar (without .gz) I don't think it is.

at least that's how I interpreted the function.


 +                                       fprintf(stderr, _(%s: could not 
 write to file '%s': %m\n),

 %m in fprintf is portable?

Hmm. I just assumed it was because we use it elsewhere, but I now see
we only really use it for ereport() stuff. Bottom line is, I don't
know - perhaps it needs to be changed to use strerror()?


 Can't you change '%s' to \%s\ for consistency?

Yeah, absolutely. Clearly I was way inconsistent, and it got worse
with some copy/paste :-(

 +       /*
 +        * Make sure we're unpacking into an empty directory
 +        */
 +       verify_dir_is_empty_or_create(current_path);

 Can pg_basebackup take a backup of $PGDATA including a tablespace
 directory, without an error? The above code seems to prevent that

Uh, how do you mean it woul dprevent that? It requires that the
directory you're writing the tablespace to is empty or nonexistant,
but that shouldn't prevent a backup, no? It will prevent you from
overwriting things with your backup, but that's intentional - if you
don't need the old dir, just remove it.


 +                               if (compresslevel = 0)
 +                               {
 +                                       fprintf(stderr, _(%s: invalid 
 compression level \%s\\n),

 It's better to check compresslevel  9 here.

Agreed (well, check for both of them of course)


 +/*
 + * Print a progress report based on the global variables. If verbose output
 + * is disabled, also print the current file name.

 Typo: s/disabled/enabled

Indeed.


 I request new option which specifies whether pg_start_backup
 executes immediate checkpoint or not. Currently it always executes
 immediate one. But I'd like to run smoothed one in busy system.
 What's your opinion?

Yeah that sounds like a good idea. Shouldn't be too hard to do (will
reuqire a backend patch as well, of course). Should we use -f for
fast? Though

Re: [HACKERS] auto-sizing wal_buffers

2011-01-18 Thread Greg Smith


Fujii Masao wrote:

+/* Minimum setting used for a lower bound on wal_buffers */
+#define XLOG_BUFFER_MIN4

Why didn't you use XLOG_BUFFER_MIN instead of XLOGbuffersMin?
XLOG_BUFFER_MIN is not used anywhere for now.
  


That's a typo; will fix.


+   if (XLOGbuffers  (XLOGbuffersMin * 2))
+   XLOGbuffers = XLOGbuffersMin * 2;
+   }

Why is the minimum value 64kB only when wal_buffers is set to
-1? This seems confusing for users.
  


That's because the current default on older versions is 64kB.  Since the 
automatic selection is going to be the new default, I hope, I don't want 
it to be possible it will pick a number smaller than the default of 
older versions.  So the automatic lower limit is 64kB, while the actual 
manually set lower limit remains 32kB, as before.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-18 Thread Greg Smith


Josh Berkus wrote:

I think we can be more specific on that last sentence; is there even any
*theoretical* benefit to settings above 16MB, the size of a WAL segment?
 Certainly there have been no test results to show any.
  


There was the set Marti just reminded about.  The old wording suggested 
big enough to fix a single transaction was big enough, and that let to 
many people being confused and setting this parameter way too low.  
Since it's possible I'm wrong about 16MB being the upper limit, I didn't 
want the wording to specifically rule out people testing that size to 
see what happens.  We believe  there's never any advantage due to the 
forced wal segment switch, but having test results to the contrary 
floating around keeps me from being too aggressive in how the wording 
there goes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Warning compiling pg_dump (MinGW, Windows XP)

2011-01-18 Thread Andrew Dunstan




On 01/18/2011 04:40 AM, Pavel Golub wrote:


AD  We could add -Wno-format to the flags. That makes it shut up, but maybe
AD  we don't want to use such a sledgehammer.

I want to understand the only thing. Are these warnings really
dangerous? Or I should just ignore them?





As I pointed out previously, we have had these warnings for years and 
years. I am not aware of a single issue that has been reported as 
arising from them.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Anssi Kääriäinen


On 01/18/2011 01:03 PM, Dimitri Fontaine wrote:

I'd appreciate a list of yet-to-fix items.  What I have is the
search_path issue where CREATE EXTENSION foo; can leave it changed for
the current session, I intend to fix that later today.

Other than that, I have no further already agreed on code fix to make.
What's your list?
There is only documentation fixes, and I am not sure if even those are 
agreed to be necessary. It might be good if the documentation contained:


 - A warning that you need to have the files for your extensions 
readily available to be able to restore from a dump. This might be 
obvious, but better safe than sorry...
 - A warning that you will be restored to the extension's version if 
you ALTER or CREATE OR REPLACE a function.


From the current documentation, it is maybe too easy to miss these 
risks. I am seeing this from non-experienced user's angle, and thus see 
these as potential foot guns.


Other than that, I don't think there is anything more. I am a little 
nervous of restoring to extension's version of a function when the 
function has been CREATE OR REPLACEd, but that might be just me over 
thinking this. Also, from the previous posts, there is just the control 
file naming issue, and the issue of load order if two extensions contain 
similarly named and signatured functions. But these were agreed to be 
issues not needing any further work.


Now, I need help what to do next. Should I leave the status as Needs 
Review as the pg_dump part is almost completely non-reviewed? And then 
attach this thread as a comment? Or as a review?


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Alvaro Herrera

Excerpts from Dimitri Fontaine's message of mar ene 18 07:01:55 -0300 2011:
 Anssi Kääriäinen anssi.kaariai...@thl.fi writes:

  It used to work this way with \i, obviously.  Should the extension patch
  care about that and how?  Do we want to RESET search_path or to manually
  manage it like we do for the client_min_messages and log_min_messages?
  It was unintuitive to me to have search_path changed by SQL command as a
  side effect. When using \i, not so much.

If the CREATE EXTENSION stuff all works in a transaction, perhaps it
would be easier if you used SET LOCAL.  At the end of the transaction it
would reset to the original value automatically.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Mon, Jan 17, 2011 at 16:27, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2011-01-17 at 16:20 +0100, Magnus Hagander wrote:
 On Mon, Jan 17, 2011 at 16:18, Robert Haas robertmh...@gmail.com wrote:
  On Mon, Jan 17, 2011 at 8:55 AM, Magnus Hagander mag...@hagander.net 
  wrote:
  Hmm. I don't like those names at all :(
 
  I agree.  I don't think your original names are bad, as long as
  they're well-documented.  I sympathize with Simon's desire to make it
  clear that these use the replication framework, but I really don't
  want the command names to be that long.

 Actually, after some IM chats, I think pg_streamrecv should be
 renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
 is a lot more specific than that)

 pg_stream_log
 pg_stream_backup

Those seem better.

Tom, would those solve your concerns about it being clear which side
they are on? Or do you think you'd still risk reading them as the
sending side?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 12:40, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Jan 18, 2011 at 10:49, Fujii Masao masao.fu...@gmail.com wrote:
 Yeah that sounds like a good idea. Shouldn't be too hard to do (will
 reuqire a backend patch as well, of course). Should we use -f for
 fast? Though that may be an unfortunate overload of the usual usecase
 for -f, so maybe -c fast|slow for checkpoint fast/slow?

Was easy, done with -c fast|slow.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-18 Thread Dimitri Fontaine

Anssi Kääriäinen anssi.kaariai...@thl.fi writes:

 On 01/18/2011 01:03 PM, Dimitri Fontaine wrote:
 I'd appreciate a list of yet-to-fix items.  What I have is the
 search_path issue where CREATE EXTENSION foo; can leave it changed for
 the current session, I intend to fix that later today.

After some reading of backend/catalog/namespace.c and some testing, my
take would be to have extension authors use the following form when that
is necessary:

  SET LOCAL search_path TO public;

Now, none of the contribs use that form (they all are relocatable except
for adminpack which forces its objects to get installed in pg_catalog),
so I've only updated the documentation.

 There is only documentation fixes, and I am not sure if even those are
 agreed to be necessary. It might be good if the documentation contained:

  - A warning that you need to have the files for your extensions readily
 available to be able to restore from a dump. This might be obvious, but
 better safe than sorry...

Added a § about that in the docs.

  - A warning that you will be restored to the extension's version if you
 ALTER or CREATE OR REPLACE a function.

Well I don't want to encourage people to manually edit any extension
objects directly, so I've added a note about not doing that.

 Other than that, I don't think there is anything more. I am a little nervous
 of restoring to extension's version of a function when the function has been
 CREATE OR REPLACEd, but that might be just me over thinking this. Also, from

Well with the next patch in the series, ALTER EXTENSION UPGRADE, you
will have a way to edit extension's objects and have the new version
installed next time, but there's no magic bullet here, it means work to
do by the extension's author (preparing upgrade scripts and new
version's install-from-scratch scripts).

 the previous posts, there is just the control file naming issue, and the
 issue of load order if two extensions contain similarly named and signatured
 functions. But these were agreed to be issues not needing any further work.

Yes, extension dependencies and conflicts are not meant to be covered yet.

 Now, I need help what to do next. Should I leave the status as Needs Review
 as the pg_dump part is almost completely non-reviewed? And then attach this
 thread as a comment? Or as a review?

My guess would be to leave it as Needs Review and add this thread as a
review.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Alvaro Herrera

Excerpts from Magnus Hagander's message of mar ene 18 08:40:50 -0300 2011:
 On Tue, Jan 18, 2011 at 10:49, Fujii Masao masao.fu...@gmail.com wrote:

  +                                       fprintf(stderr, _(%s: could not 
  write to file '%s': %m\n),
 
  %m in fprintf is portable?
 
 Hmm. I just assumed it was because we use it elsewhere, but I now see
 we only really use it for ereport() stuff. Bottom line is, I don't
 know - perhaps it needs to be changed to use strerror()?

Some libc's (such as glibc) know about %m, others presumably don't (it's
a GNU extension, according to my manpage).  ereport does the expansion
by itself, see expand_fmt_string().  Probably just using strerror() is
the easiest.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Alvaro Herrera

Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
 
 Hello Postgres Hackers,
 
 In reference to this todo item about clustering system table indexes, 
   
 ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) 
 I have been studying the system tables to see which would benefit  from 
 clustering.  I have some index suggestions and a question if you have a 
 moment.

Wow, this is really old stuff.  I don't know if this is really of any
benefit, given that these catalogs are loaded into syscaches anyway.
Furthermore, if you cluster at initdb time, they will soon lose the
ordering, given that updates move tuples around and inserts put them
anywhere.  So you'd need the catalogs to be re-clustered once in a
while, and I don't see how you'd do that (except by asking the user to
do it, which doesn't sound so great).

I think you need some more discussion on the operational details before
engaging in the bootstrap bison stuff (unless you just want to play with
Bison for educational purposes, of course, which is always a good thing
to do).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Alvaro Herrera

Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
 
 Hello Postgres Hackers,

BTW whatever you do, don't start a new thread by replying to an existing
message and just changing the subject line.  It will mess up the
threading for some readers, and some might not even see your message.
Compose a fresh message instead.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Merlin Moncure

On Tue, Jan 18, 2011 at 3:47 AM, Jim Nasby j...@nasby.net wrote:
 On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
 Robert Haas  wrote:

 a quick-and-dirty attempt to limit the amount of I/O caused by hint
 bits. I'm still very interested in knowing what people think about
 that.

 I found the elimination of the response-time spike promising.  I
 don't think I've seen enough data yet to feel comfortable endorsing
 it, though.  I guess the question in my head is: how much of the
 lingering performance hit was due to having to go to clog and how
 much was due to competition with the deferred writes?  If much of it
 is due to repeated recalculation of visibility based on clog info, I
 think there would need to be some way to limit how many times that
 happened before the hint bits were saved.

 What if we sped up the case where hint bits aren't set? Has anyone collected 
 data on the actual pain points of checking visibility when hint bits aren't 
 set? How about when setting hint bits is intentionally delayed? I wish we had 
 some more infrastructure around the XIDCACHE counters; having that info 
 available for people's general workloads might be extremely valuable. Even if 
 I was to compile with it turned on, it seems the only way to get at it is via 
 stderr, which is very hard to deal with.

 Lacking performance data (and for my own education), I've spent the past few 
 hours studying HeapTupleSatisfiesNow(). If I'm understanding it correctly, 
 the three critical functions from a performance standpoint are 
 TransactionIdIsCurrentTransactionId, TransactionIdIsInProgress and 
 TransactionIdDidCommit. Note that all 3 can potentially be called twice; once 
 to check xmin and once to check xmax.

hint bits give you two benefits: you don't have to lwlock the clog and
you don't have to go look them up.  a lookup is either a lru cache
lookup or an i/o lookup on the clog.  the cost of course is extra
writing out the bits.  in most workloads they are not even noticed but
in particular cases they are an i/o multiplier.

a few weeks back I hacked an experimental patch that removed the hint
bit action completely.  the results were very premature and/or
incorrect, but my initial findings suggested that hint bits might not
be worth the cost from performance standpoint.  i'd like to see some
more investigation in this direction before going with a complex
application mechanism (although that would be beneficial vs the status
quo).

an ideal testing environment to compare would be a mature database
(full clog) with some verifiable performance tests and a mixed
olap/oltp workload.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread David Fetter

On Tue, Jan 18, 2011 at 09:14:41PM +1300, Mark Kirkwood wrote:
 On 18/01/11 18:04, Tom Lane wrote:
 David Fetterda...@fetter.org  writes:
 Who's the copyright holder(s)?  If it's all individual
 contributors, Red Hat policy is not in play.
 Sorry David, it was written on the company's dime.
 
 However, I doubt that Red Hat derives any value from this useful
 product being excluded from contrib by the choice of license - would
 they be receptive to the idea that it would be free marketing to
 have it in the main tarball/rpm/deb (etc) with merely a decision to
 change it GPL-BSD?

I'm guessing there's a Policy® at Red Hat that software made on its
dime be GPL (v2, I'd guess), and that getting an exception would
involve convening its board or similarly drastic action.

Is that about right?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Alvaro Herrera

Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:

 Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
 time for another patch, PFA :-)

Thanks ...  Message nitpick:
+   if (compresslevel  0)
+   {
+   fprintf(stderr,
+   _(%s: this build does not support compression\n),
+   progname);
+   exit(1);
+   }

pg_dump uses the following wording:

WARNING: archive is compressed, but this installation does not support 
compression -- no data will be available\n

So perhaps yours should s/build/installation/


Also, in messages of this kind,
+   if (gzsetparams(ztarfile, compresslevel, Z_DEFAULT_STRATEGY) != 
Z_OK)
+   {
+   fprintf(stderr, _(%s: could not set compression level 
%i\n),
+   progname, compresslevel);

Shouldn't you also be emitting the gzerror()? ... oh I see you're
already doing it for most gz calls.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 15:49, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:

 Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
 time for another patch, PFA :-)

 Thanks ...  Message nitpick:
 +   if (compresslevel  0)
 +   {
 +       fprintf(stderr,
 +               _(%s: this build does not support compression\n),
 +               progname);
 +       exit(1);
 +   }

 pg_dump uses the following wording:

 WARNING: archive is compressed, but this installation does not support 
 compression -- no data will be available\n

 So perhaps yours should s/build/installation/

That shows up when a the *archive* is compressed, though? There are a
number of other cases that use build in the backend, such as:
src/backend/utils/misc/guc.c:  errmsg(assertion checking 
is not
supported by this build)));
src/backend/utils/misc/guc.c:errmsg(Bonjour is not 
supported by
this build)));
src/backend/utils/misc/guc.c:errmsg(SSL is not 
supported by this
build)));


 Also, in messages of this kind,
 +               if (gzsetparams(ztarfile, compresslevel, Z_DEFAULT_STRATEGY) 
 != Z_OK)
 +               {
 +                   fprintf(stderr, _(%s: could not set compression level 
 %i\n),
 +                           progname, compresslevel);

 Shouldn't you also be emitting the gzerror()? ... oh I see you're
 already doing it for most gz calls.

It's not clear from the zlib documentation I have that gzerror() works
after a gzsetparams(). Do you have any docs that say differently by
any chance?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] review: FDW API

2011-01-18 Thread Shigeru HANADA

On Sun, 16 Jan 2011 01:55:19 +0100
Jan Urbański wulc...@wulczer.org wrote:
snip
 In general, the feature looks great and I hope it'll make it into 9.1.
 And it we'd get the possibility to write FDW handlers in other PLs than
 C, it would rock so hard...
 
 I'm going to mark this a Waiting for Author because of the typos, the
 BeginScan/EndScan issue, and the nested loop stopping halfway issue. The
 rest are suggestions or just thoughts, and if you don't see them as
 justified, I'll mark the next patch Ready for Committer.

Thanks a lot for the comments.  I've (hopefully) fixed issues above. 
Please find attached patches.

== patch list ==

1) 20110118-no_fdw_perm_check.patch - This patch is not included in
last post.  This had been proposed on 2011-01-05 first, but maybe has
not been reviewd yet.  I re-propose this patch for SQL standard
conformance.  This patch removes permission check that requires USAGE
on the foreign-data wrapper at CREATE FOREIGN TABLE.
Please see original post for details.
http://archives.postgresql.org/message-id/20110105145206.30fd.69899...@metrosystems.co.jp

2) 20110118-fdw_catalog_lookup.patch - This patch adds GetForeignTables. 
Fixed lack of pg_foreign_table.h inclusion.

3) 20110118-fdw_handler.patch - This patch adds support for HANDLER
option to FOREIGN DATA WRAPPER object.

4) 20110118-foreign_scan.patch - This patch adds ForeignScan executor
node and FDW API hooks based on Heikki's proposal.  As Itagaki-san
suggested on 2010-12-21, FDW must generate information for EXPLAIN
VERBOSE every PlanRelScan() call.  It's better to avoid such overhead,
so new EXPLAIN hook would be needed.  I'll try to make it cleaner.



And I'll reply to the rest of your comments.

 We could use comments about how to return tuples from Iterate and how to
 finish returning them. I had to look at the example to figure out that
 you need ExecClearTuple(slot) in your last Iterate. If Iterate's
 interface were to change, we could just return NULL instead of a tuple
 to say that we're done.

I've added some comments for FDW-developer to fdwapi.h, though they
wouldn't be enough.

 We could be a bit more explicit about how to allocate objects, for
 instance if I'm allocating a FdwPlan in my PlanRelScan with a palloc,
 will it not go away too soon, or too late (IOW leak)?

For that example, the answer is no.  Objects are allocated in
MessageContext if you don't switch context and released when the query
has been finished.  I agree that more documentation or comments for
FDW-developers should be added.

 Maybe PlanNative should get the foreign table OID, not the server OID,
 to resemble PlanRelScan more. The server OID is just a syscache lookup
 away, anyway.

You would missed a case that multiple foreign tables are used in a
query.  Main purpose of PlanNative is to support pass-through
execution of remote query.  In pass-through mode, you can use syntax
as if you have directly connected to external server, so can't use
PostgreSQL's parser.

 If you do EXPLAIN SELECT * FROM foregintable, BeginScan is not called,
 but EndScan is. That's weird, and I noticed because I got a segfault
 when EndScan tried to free things that BeginScan allocated. Maybe just
 don't call EndScan for EXPLAIN?

Fixed to not call EndScan if it was EXPLAIN execution.

Regards,
--
Shigeru Hanada


20110118-no_fdw_perm_check.patch.gz
Description: Binary data


20110118-fdw_catalog_lookup.patch.gz
Description: Binary data


20110118-fdw_handler.patch.gz
Description: Binary data


20110118-foreign_scan.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - file_fdw

2011-01-18 Thread Shigeru HANADA

On Tue, 18 Jan 2011 15:17:12 +0900
Itagaki Takahiro itagaki.takah...@gmail.com wrote:

 On Sat, Jan 15, 2011 at 08:35, Shigeru HANADA han...@metrosystems.co.jp 
 wrote:
  Interface of NextCopyFrom() is fixed to return HeapTuple, to support
  tableoid without any change to TupleTableSlot.
 
  3) 20110114-copy_export_HeapTupe.patch
  This patch fixes interface of NextCopyFrom() to return results as
  HeapTuple.
 
 I think file_fdw can return tuples in virtual tuples forms,
 and ForeignNext() calls ExecMaterializeSlot() to store tableoid.

Thanks for the comment.  I've fixed file_fdw to use tts_values and
tts_isnull to receive results from NextCopyFrom, and store virtual
tuple in the slot.

Attached patch requires FDW API patches and copy_export-20110114.patch.
Please see also:
http://archives.postgresql.org/message-id/20110119002615.8316.69899...@metrosystems.co.jp

And also cost estimation of file_fdw is simplified along your comments
below.

On Tue, 21 Dec 2010 21:32:17 +0900
Itagaki Takahiro itagaki.takah...@gmail.com wrote:
 #3. Why do you re-open a foreign table in estimate_costs() ?
 Since the caller seems to have the options for them, you can
 pass them directly, no?
 
 In addition, passing a half-initialized fplan to estimate_costs()
 is a bad idea. If you think it is an OUT parameter, the OUT params
 should be *startup_cost and *total_cost.

In that message, you also pointed out that FDW must generate
explainInfo in every PlanRelScan call even if the planning is not for
EXPLAIN.  I'll try to defer generating explainInfo until EXPLAIN
VERBOSE really uses it.  It might need new hook point in expalain.c,
though.

Regards,
--
Shigeru Hanada


20110118-file_fdw.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 10:56, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jan 18, 2011 at 5:17 PM, Magnus Hagander mag...@hagander.net wrote:
 We should treat log_disconnections the same?

 We could keep it a boolean, but then only log disconnections for the
 cases that are mentioned in log_connections?

 It doesn't make sense to log disconnection for a connection we didn't
 log the connection for...

 Maybe true. But, at least for me, it's more intuitive to provide both as
 enum parameters.

Is there *any* usecase for setting them differently though? (other
than connections being something and disconnectoins being none?)
If not, aren't we just encouraging people to configure in a way that
makes no sense?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Tom Lane

David Fetter da...@fetter.org writes:
 I'm guessing there's a Policy® at Red Hat that software made on its
 dime be GPL (v2, I'd guess), and that getting an exception would
 involve convening its board or similarly drastic action.

It's company policy, and while it *might* be possible to get an
exception, the effort involved would far exceed the benefit we'd get out
of it.  Moreover, despite Mark's creative argument, I really doubt that
Red Hat would perceive any benefit to themselves in making an exception.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Tom Lane

Magnus Hagander mag...@hagander.net writes:
 Is there *any* usecase for setting them differently though?

I can't believe we're still engaged in painting this bikeshed.  Let's
just control it off log_connections and have done.

BTW, what about log_disconnections --- does a walsender emit a message
according to that?  If not, why not?  If we go through with something
fancy on the connection side, are we going to invent the same extra
complexity for log_disconnections?  And if we do, what happens when
they're set inconsistently?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Itagaki Takahiro itagaki.takah...@gmail.com writes:
 On Tue, Jan 18, 2011 at 05:39, Tom Lane t...@sss.pgh.pa.us wrote:
 I haven't looked at this patch, but it seems to me that it would be
 reasonable to conclude A != B if the va_extsize values in the toast
 pointers don't agree.

 It's a very light-weight alternative of memcmp the byte data,
 but there is still the same issue -- we might have different
 compressed results if we use different algorithm for TOASTing.

Which makes it a lightweight waste of cycles.

 So, it would be better to apply the present patch as-is.

No, I don't think so.  Has any evidence been submitted that that part of
the patch is of benefit?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Tom Lane

Magnus Hagander mag...@hagander.net writes:
 Actually, after some IM chats, I think pg_streamrecv should be
 renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
 is a lot more specific than that)

 pg_stream_log
 pg_stream_backup

 Those seem better.

 Tom, would those solve your concerns about it being clear which side
 they are on? Or do you think you'd still risk reading them as the
 sending side?

It's still totally unclear what they do.  How about pg_receive_log
etc?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's a very light-weight alternative of memcmp the byte data,
 but there is still the same issue -- we might have different
 compressed results if we use different algorithm for TOASTing.

 Which makes it a lightweight waste of cycles.

 So, it would be better to apply the present patch as-is.

 No, I don't think so.  Has any evidence been submitted that that part of
 the patch is of benefit?

I think you might be mixing up what's actually in the patch with
another idea that was proposed but isn't actually in the patch.  The
patch itself does nothing controversial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 2:15 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I also find it weird that incoming replication connections are logged by
 default. In the standby, we also log streaming replication successfully
 connected to primary, which serves much of the same debugging purpose.

Oh, good point.  All right, I withdraw my objection.  Let's just make
it all controlled by log_connections and go home.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:

 Hello Postgres Hackers,

 In reference to this todo item about clustering system table indexes,
 ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
 I have been studying the system tables to see which would benefit  from
 clustering.  I have some index suggestions and a question if you have a
 moment.

 Wow, this is really old stuff.  I don't know if this is really of any
 benefit, given that these catalogs are loaded into syscaches anyway.
 Furthermore, if you cluster at initdb time, they will soon lose the
 ordering, given that updates move tuples around and inserts put them
 anywhere.  So you'd need the catalogs to be re-clustered once in a
 while, and I don't see how you'd do that (except by asking the user to
 do it, which doesn't sound so great).

The idea of the TODO seems to have been to set the default clustering
to something reasonable.  That doesn't necessarily seem like a bad
idea even if we can't automatically maintain the cluster order, but
there's some question in my mind whether we'd get any measurable
benefit from the clustering.  Even on a database with a gigantic
number of tables, it seems likely that the relevant system catalogs
will stay fully cached and, as you point out, the system caches will
further blunt the impact of any work in this area.  I think the first
thing to do would be to try to come up with a reproducible test case
where clustering the tables improves performance.  If we can't, that
might mean it's time to remove this TODO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 No, I don't think so.  Has any evidence been submitted that that part of
 the patch is of benefit?

 I think you might be mixing up what's actually in the patch with
 another idea that was proposed but isn't actually in the patch.  The
 patch itself does nothing controversial.

Oh, I misread Itagaki-san's comment to imply that that *was* in the
patch.  Maybe I should go read it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync open_sync test

2011-01-18 Thread Bruce Momjian

Greg Smith wrote:
 Bruce Momjian wrote:
  Is there a value to this test_fsync test?
 
  Compare open_sync with different sizes:
  (This is designed to compare the cost of one large
  sync'ed write and two smaller sync'ed writes.)
  open_sync 16k write   242.563 ops/sec
  2 open_sync 8k writes 752.752 ops/sec
 
  It compares the cost of doing larger vs. two smaller open_sync writes.

 
 Might be some value for determining things like what the optimal WAL 
 block size to use is.  All these tests are kind of hard to use 
 effectively still, I'm not sure if it's time to start trimming tests yet 
 until we've made more progress on interpreting results first.

OK, thanks for the feedback.  I just wanted to make sure it wasn't a
stupid test.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 No, I don't think so.  Has any evidence been submitted that that part of
 the patch is of benefit?

 I think you might be mixing up what's actually in the patch with
 another idea that was proposed but isn't actually in the patch.  The
 patch itself does nothing controversial.

 Oh, I misread Itagaki-san's comment to imply that that *was* in the
 patch.  Maybe I should go read it.

Perhaps.  :-)

While you're at it you might commit it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 David Fetter da...@fetter.org writes:
 I'm guessing there's a PolicyŽ at Red Hat that software made on its
 dime be GPL (v2, I'd guess), and that getting an exception would
 involve convening its board or similarly drastic action.

 It's company policy, and while it *might* be possible to get an
 exception, the effort involved would far exceed the benefit we'd get out
 of it.  Moreover, despite Mark's creative argument, I really doubt that
 Red Hat would perceive any benefit to themselves in making an exception.

I'm not sure why they'd care, but it certainly doesn't seem worth
spending the amount of time arguing about it that we are.  David and
Mark are, of course, free to spend their time petitioning Red Hat for
relicensing if they are so inclined, but they aren't entitled to tell
you how to spend yours.  And even if they were, I would hope that
they'd want you to spend it committing patches rather than arguing
with your employer about relicensing of a utility that's freely
available anyway and of use to 0.1% of our user base.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast [REVIEW]

2011-01-18 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Oh, I misread Itagaki-san's comment to imply that that *was* in the
 patch.  Maybe I should go read it.

 Perhaps.  :-)

 While you're at it you might commit it.  :-)

Yeah, as penance I'll take this one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 17:31, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Actually, after some IM chats, I think pg_streamrecv should be
 renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
 is a lot more specific than that)

 pg_stream_log
 pg_stream_backup

 Those seem better.

 Tom, would those solve your concerns about it being clear which side
 they are on? Or do you think you'd still risk reading them as the
 sending side?

 It's still totally unclear what they do.  How about pg_receive_log
 etc?

I agree with whomever said using wal is better than log to be unambiguous.

So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
others? (it's easy to rename so far, so I'll keep plugging away under
the name pg_basebackup based on Fujii-sans comments until such a time
as we have a reasonable consensus :-)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Cédric Villemain

2011/1/18 Magnus Hagander mag...@hagander.net:
 On Tue, Jan 18, 2011 at 17:31, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Actually, after some IM chats, I think pg_streamrecv should be
 renamed, probably to pg_walstream (or pg_logstream, but pg_walstream
 is a lot more specific than that)

 pg_stream_log
 pg_stream_backup

 Those seem better.

 Tom, would those solve your concerns about it being clear which side
 they are on? Or do you think you'd still risk reading them as the
 sending side?

 It's still totally unclear what they do.  How about pg_receive_log
 etc?

 I agree with whomever said using wal is better than log to be unambiguous.

 So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
 others? (it's easy to rename so far, so I'll keep plugging away under
 the name pg_basebackup based on Fujii-sans comments until such a time
 as we have a reasonable consensus :-)

pg_receive_wal is good for me.
pg_receive_base_backup in french base is a shortcut for database. here
we backup the whole cluster, I would suggest
pg_receive_cluster(_backup ?)




 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 4:53 AM,  t...@fuzzy.cz wrote:
 So the most important question is how to intercept the new/updated rows,
 and where to store them. I think each backend should maintain it's own
 private list of new records and forward them only in case of commit. Does
 that sound reasonable?

At the risk of sounding demoralizing, nothing about this proposal
sounds very promising to me, and that sounds like a particularly bad
idea.  What happens if the transaction updates a billion records?  Or
even a million records?  Are you going to store all of those in
backend-local memory until commit time?  Or spool them in a
potentially-gigantic disk file somewhere?  That memory allocation - or
file - could grow to be larger than the size of the entire database in
the worst case.  And COMMIT could take an awfully long time if it has
to spool megabytes or gigabytes of data off to some other process.
And what happens if there's a crash after the COMMIT but before all
that data is sent?  The estimates become permanently wrong?

And are we doing all of this just to get a more accurate estimate of
ndistinct?  For the amount of effort that it will probably take to get
this working at all, you could probably implement index-only scans and
have enough bandwidth left over to tackle global temporary tables.
And unless I'm missing something, the chances that the performance
consequences will be tolerable are pretty much zero.  And it would
only benefit the tiny fraction of users for whom bad n_distinct
estimates cause bad plans, and then the even tinier percentage of
those who can't conveniently fix it by using the manual override that
we already have in place - which presumably means people who have
gigantic tables that are regularly rewritten with massive changes in
the data distribution that affect plan choice.  Is that more than the
empty set?

Maybe the idea here is that this wouldn't fix just ndistinct estimates
but would also help with multi-column statistics.  Even if that's the
case, I think it's almost certainly a dead end from a performance
standpoint.  Some kind of manual ANALYZE process that can be invoked
when needed to scan the entire table would be painful but maybe
acceptable for certain people with a problem they can't fix any other
way, but doing this automatically for everyone seems like a really bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Cédric Villemain

2011/1/18 Robert Haas robertmh...@gmail.com:
 On Tue, Jan 18, 2011 at 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 David Fetter da...@fetter.org writes:
 I'm guessing there's a PolicyŽ at Red Hat that software made on its
 dime be GPL (v2, I'd guess), and that getting an exception would
 involve convening its board or similarly drastic action.

 It's company policy, and while it *might* be possible to get an
 exception, the effort involved would far exceed the benefit we'd get out
 of it.  Moreover, despite Mark's creative argument, I really doubt that
 Red Hat would perceive any benefit to themselves in making an exception.

 I'm not sure why they'd care, but it certainly doesn't seem worth
 spending the amount of time arguing about it that we are.  David and
 Mark are, of course, free to spend their time petitioning Red Hat for
 relicensing if they are so inclined, but they aren't entitled to tell
 you how to spend yours.  And even if they were, I would hope that
 they'd want you to spend it committing patches rather than arguing
 with your employer about relicensing of a utility that's freely
 available anyway and of use to 0.1% of our user base.


still good, thanks Tom and RH to have push it nearest other PostgreSQL. tools.



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 8:24 AM, Merlin Moncure wrote:
 a few weeks back I hacked an experimental patch that removed the hint
 bit action completely.  the results were very premature and/or
 incorrect, but my initial findings suggested that hint bits might not
 be worth the cost from performance standpoint.  i'd like to see some
 more investigation in this direction before going with a complex
 application mechanism (although that would be beneficial vs the status
 quo).

If you're not finding much benefit to hint bits, that's *very* interesting. 
Everything I outlined certainly looks like a pretty damn expensive code path; 
it's really surprising that hint bits don't help.

I think it would be very valuable to profile the cost of the different code 
paths involved in the HeapTupleSatisfies* functions, even if the workload is 
just pgBench.

 an ideal testing environment to compare would be a mature database
 (full clog) with some verifiable performance tests and a mixed
 olap/oltp workload.

We're working on setting such a framework up. Unfortunately it will only be 8.3 
to start, but we hope to be on 9.0 soon.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.

The FSM would take some effort, but I don't think vacuum would be that hard to 
deal with; you'd just have to free up the space in any referenced toast forks 
at the same time that you vacuumed the heap.

 - Dynamic forks would make it possible to do a column-store database, or at 
 least something approximating one.
 
 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

Possibly, but you'd be paying tuple overhead twice, which is what I was looking 
to avoid with forks.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.

 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.

 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.

How's that different from what vacuum does on a TOAST table now?

 - Dynamic forks would make it possible to do a column-store database, or at 
 least something approximating one.

 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.

What exactly do you mean by tuple overhead?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.

 How's that different from what vacuum does on a TOAST table now?

Even more to the point: Jim hasn't provided one single reason to suppose
that this would be better-performing than the existing approach.  It
looks to me like a large amount of work, and loss of on-disk
compatibility, for nothing at all except the sake of change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 11:24 AM, Robert Haas wrote:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.
 
 How's that different from what vacuum does on a TOAST table now?

TOAST vacuum is currently an entirely separate vacuum. It might run at the same 
time as the main table vacuum, but it still has all the work that would be 
associated with vacuuming a table with the definition of a toast table. In 
fact, at one point vacuuming toast took two passes: the first deleted the toast 
rows that were no longer needed, then you had to go back and vacuum those 
deleted rows.

 - Dynamic forks would make it possible to do a column-store database, or 
 at least something approximating one.
 
 I've been wondering whether we could do something like this by
 treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
 tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
 columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
 SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
 
 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.
 
 What exactly do you mean by tuple overhead?

typedef struct HeapTupleHeaderData. With only two tables it might not be that 
bad, depending on the fields. Beyond two tables it's almost certainly a loser.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby

On Jan 18, 2011, at 11:32 AM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
 On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby j...@nasby.net wrote:
 - Forks are very possibly a more efficient way to deal with TOAST than 
 having separate tables. There's a fair amount of overhead we pay for the 
 current setup.
 
 That seems like an interesting idea, but I actually don't see why it
 would be any more efficient, and it seems like you'd end up
 reinventing things like vacuum and free space map management.
 
 The FSM would take some effort, but I don't think vacuum would be that hard 
 to deal with; you'd just have to free up the space in any referenced toast 
 forks at the same time that you vacuumed the heap.
 
 How's that different from what vacuum does on a TOAST table now?
 
 Even more to the point: Jim hasn't provided one single reason to suppose
 that this would be better-performing than the existing approach.  It
 looks to me like a large amount of work, and loss of on-disk
 compatibility, for nothing at all except the sake of change.

Yes, I was only pointing out that there were possible uses for allowing a 
variable number of forks per relation if Tomas felt the need to go that 
direction. Changing toast would certainly require some very convincing 
arguments as to the benefits.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:47 AM, Jim Nasby j...@nasby.net wrote:
 On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
 Robert Haas  wrote:

 a quick-and-dirty attempt to limit the amount of I/O caused by hint
 bits. I'm still very interested in knowing what people think about
 that.

 I found the elimination of the response-time spike promising.  I
 don't think I've seen enough data yet to feel comfortable endorsing
 it, though.  I guess the question in my head is: how much of the
 lingering performance hit was due to having to go to clog and how
 much was due to competition with the deferred writes?  If much of it
 is due to repeated recalculation of visibility based on clog info, I
 think there would need to be some way to limit how many times that
 happened before the hint bits were saved.

 What if we sped up the case where hint bits aren't set? Has anyone collected 
 data on the actual pain points of checking visibility when hint bits aren't 
 set?

I think that's worth looking into, but I don't have any present plan
to actually do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure mmonc...@gmail.com wrote:
 a few weeks back I hacked an experimental patch that removed the hint
 bit action completely.  the results were very premature and/or
 incorrect, but my initial findings suggested that hint bits might not
 be worth the cost from performance standpoint.  i'd like to see some
 more investigation in this direction before going with a complex
 application mechanism (although that would be beneficial vs the status
 quo).

I think it's not very responsible to allege that hint bits aren't
providing a benefit without providing the patch that you used and the
tests that you ran.  This is a topic that needs careful analysis, and
I think that saying hint bits don't provide a benefit... maybe...
doesn't do anything but confuse the issue.  How about doing some tests
with the patch from my OP and posting the results?  If removing hint
bits entirely doesn't degrade performance, then surely the
less-drastic approach I've taken here ought to be OK too.  But in my
testing, it didn't look too good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby j...@nasby.net wrote:
 How's that different from what vacuum does on a TOAST table now?

 TOAST vacuum is currently an entirely separate vacuum. It might run at the 
 same time as the main table vacuum, but it still has all the work that would 
 be associated with vacuuming a table with the definition of a toast table. In 
 fact, at one point vacuuming toast took two passes: the first deleted the 
 toast rows that were no longer needed, then you had to go back and vacuum 
 those deleted rows.

I don't know whether it still works that way or not, but if it does,
fixing it could perhaps be done with far less drastic surgery than
what you're proposing here.

 Possibly, but you'd be paying tuple overhead twice, which is what I was 
 looking to avoid with forks.

 What exactly do you mean by tuple overhead?

 typedef struct HeapTupleHeaderData. With only two tables it might not be that 
 bad, depending on the fields. Beyond two tables it's almost certainly a loser.

A struct definition by itself doesn't cause overhead.  Are you talking
about storage on disk?  CPU usage for MVCC visibility testing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:03 PM, Magnus Hagander mag...@hagander.net wrote:
 So it'd be pg_receive_wal and pg_receive_base_backup then? Votes from
 others? (it's easy to rename so far, so I'll keep plugging away under
 the name pg_basebackup based on Fujii-sans comments until such a time
 as we have a reasonable consensus :-)

I like pg_receive_wal.  pg_receive_base_backup I would be inclined to
shorten to pg_basebackup or pg_streambackup, but I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Simone Aiken


On Jan 18, 2011, at 6:35 AM, Alvaro Herrera wrote:
 
 
 Wow, this is really old stuff.  I don't know if this is really of any
 benefit, given that these catalogs are loaded into syscaches anyway.


The benefit is educational primarily.  I was looking for a todo list item
that would expose me to the system tables.  Learning the data model
of a new system is always step 1 for me.  So that one was perfect as
it would have me study and consider each one to determine if there
was any benefit from clustering on its initial load into cache.  


 Furthermore, if you cluster at initdb time, they will soon lose the
 ordering, given that updates move tuples around and inserts put them
 anywhere.  So you'd need the catalogs to be re-clustered once in a
 while, and I don't see how you'd do that (except by asking the user to
 do it, which doesn't sound so great).


I did discover that last night.  I'm used to databases that keep up their
clustering.  One that falls apart over time is distinctly strange.  And the
way you guys do your re-clustering logic is overkill if just a few rows
are out of place.  On the upside, a call to mass re-clustering goes
and updates all the clustered indexes in the system and that includes
these tables.  Will have to study auto-vacuum as well to consider that.


  (unless you just want to play with
 Bison for educational purposes, of course, which is always a good thing
 to do).

Pretty much, yeah.  


- Simone Aiken

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Simone Aiken



On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera alvhe...@commandprompt.com
wrote:
 Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:

 Hello Postgres Hackers,

 In reference to this todo item about clustering system table indexes, 
 ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) 

 Wow, this is really old stuff.  I don't know if this is really of any 

If we can't, that might mean it's time to remove this TODO.

When I'm learning a new system I like to first learn how to use it,
second learn its data model, third start seriously looking at the code.
So that Todo is ideal for my learning method.  

If there is something else that would also involve studying all the system
tables it would also be great.  For example, I noticed we have column 
level comments on the web but not in the database itself.  This seems
silly.  Why not have the comments in the database and have the web
query the tables of template databases for the given versions?

That way \d+ pg_tablename would provide instant gratification for users.
And we all like our gratification to be instant.  They could be worked into
The .h files though as inserts to pg_description they wouldn't provide an
excuse to learn bison.

I'm open to other suggestions as well.

-Simone Aiken



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Alvaro Herrera

Excerpts from Magnus Hagander's message of mar ene 18 11:53:55 -0300 2011:
 On Tue, Jan 18, 2011 at 15:49, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Excerpts from Magnus Hagander's message of mar ene 18 10:47:03 -0300 2011:
 
  Ok, thanks for clarifying. I've updated to use strerror(). Guess it's
  time for another patch, PFA :-)
 
  Thanks ...  Message nitpick:
  +   if (compresslevel  0)
  +   {
  +       fprintf(stderr,
  +               _(%s: this build does not support compression\n),
  +               progname);
  +       exit(1);
  +   }
 
  pg_dump uses the following wording:
 
  WARNING: archive is compressed, but this installation does not support 
  compression -- no data will be available\n
 
  So perhaps yours should s/build/installation/
 
 That shows up when a the *archive* is compressed, though? There are a
 number of other cases that use build in the backend, such as:
 src/backend/utils/misc/guc.c:   errmsg(assertion checking is not
 supported by this build)));
 src/backend/utils/misc/guc.c: errmsg(Bonjour is not 
 supported by
 this build)));
 src/backend/utils/misc/guc.c: errmsg(SSL is not supported by 
 this
 build)));

Hmm.  I think I'd s/build/installation/ on all those messages for
consistency.

  Also, in messages of this kind,
  +               if (gzsetparams(ztarfile, compresslevel, 
  Z_DEFAULT_STRATEGY) != Z_OK)
  +               {
  +                   fprintf(stderr, _(%s: could not set compression level 
  %i\n),
  +                           progname, compresslevel);
 
  Shouldn't you also be emitting the gzerror()? ... oh I see you're
  already doing it for most gz calls.
 
 It's not clear from the zlib documentation I have that gzerror() works
 after a gzsetparams(). Do you have any docs that say differently by
 any chance?

Ah, no.  I was reading zlib.h, which is ambiguous as you say:

ZEXTERN int ZEXPORT gzsetparams OF((gzFile file, int level, int strategy));
/*
 Dynamically update the compression level or strategy. See the description
   of deflateInit2 for the meaning of these parameters.
 gzsetparams returns Z_OK if success, or Z_STREAM_ERROR if the file was not
   opened for writing.
*/

ZEXTERN const char * ZEXPORT gzerror OF((gzFile file, int *errnum));
/*
 Returns the error message for the last error which occurred on the
   given compressed file. errnum is set to zlib error number. If an
   error occurred in the file system and not in the compression library,
   errnum is set to Z_ERRNO and the application may consult errno
   to get the exact error code.


... but a quick look at the code says that it sets gz_stream-z_err
which is what gzerror returns:

int ZEXPORT gzsetparams (file, level, strategy)
gzFile file;
int level;
int strategy;
{
gz_stream *s = (gz_stream*)file;

if (s == NULL || s-mode != 'w') return Z_STREAM_ERROR;

/* Make room to allow flushing */
if (s-stream.avail_out == 0) {

s-stream.next_out = s-outbuf;
if (fwrite(s-outbuf, 1, Z_BUFSIZE, s-file) != Z_BUFSIZE) {
s-z_err = Z_ERRNO;
}
s-stream.avail_out = Z_BUFSIZE;
}

return deflateParams ((s-stream), level, strategy);
}

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-18 Thread Greg Smith


David Fetter wrote:

I think I haven't communicated clearly what I'm suggesting, which is
that we ship with both an UPSERT and a MERGE, the former being ugly,
crude and simple, and the latter festooned with dire warnings about
isolation levels and locking.
  


I don't know that I completely agree with the idea that the UPSERT 
version should always be crude and the MERGE one necessarily heavy from 
a performance perspective.  However, I do feel you raise a legitimate 
point that once the hard stuff is done, and the locking issues around 
SQL proper MERGE sorted, having an UPSERT/REPLACE synonym that maps to 
it under the hood may be a useful way to provide a better user 
experience.  The way I see this, the right goal is to first provide the 
full spec behavior with good performance, and get all that plumbing 
right.  There's nothing that says we can't provide an easier syntax than 
the admittedly complicated MERGE one to the users though.  (I am tempted 
to make a joke about how you could probaby


So, as for this patch...there's about half a dozen significant open 
issues with the current code, along with a smattering of smaller ones.  
The problems remaining are deep enough that I think it would be 
challenging to work through them for this CF under the best conditions.  
And given that we haven't heard from Boxuan since early December, we're 
definitely understaffed to tackle major revisions.


My hope was that we'd get an updated patch from him before the CF 
deadline.  Even without it, maybe get some more internal help here.  
Given my focus on checkpoint issues, Simon on Sync Rep, and Dimitri 
still on his Extensions push, that's second part isn't going to happen.


I am marking MERGE officially Returned With Feedback for 9.1.  Lots of 
progress made, much better community understanding of the unresolved 
issues now than when we started, but not in shape to go into this 
release.  Since we still have some significant interest here in getting 
this finished, I'll see if I can get put together a better game-plan for 
how to get this actually done for 9.2 in time to discuss at release 
planning time.  The main thing that's become much more obvious to me 
just recently is how the remaining issues left here relate to the true 
serialization work, so worrying about that first is probably the right 
order anyway.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 I think you may be confused about what the patch does - currently,
 pages with hint bit changes are considered dirty, period.
 Therefore, they are written whenever any other dirty page would be
 written: by the background writer cleaning scan, at checkpoints,
 and when a backend must write a dirty buffer before reallocating it
 to hold a different page. The patch keeps the first of these and
 changes the second two

While I was trying to performance-test the texteq patch, it occurred to
me that this proposed hint-bit change has got a serious drawback.  To
wit, that it will totally destroy reproducibility of any performance
test that involves table scans.  Right now, you know that you can take
hint bits out of the equation by doing a vacuum analyze and checkpoint;
after that, all hint bits in the table are known to be set and written
to disk.  Then you can get on with comparing the effects of some patch
or other.  With the proposed patch, it will never be clear whether
all the hint bits are set, because the patch specifically removes the
deterministic ways to get a hint bit written out.  So you'll never be
very sure whether a performance difference you think you see is real,
or whether one case or the other got affected by extra clog lookups.
It's hard enough already to be sure about performance changes on the
order of 1%, but this will make it impossible.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: Add \dL to show languages

2011-01-18 Thread Andreas Karlsson

Hi Josh,

Nope, I do not have any better ideas than DO Blocks?.

Everything looks good with the exception one bug now.

\dL foo
* QUERY **
SELECT l.lanname AS Name,
   pg_catalog.pg_get_userbyid(l.lanowner) as Owner,
   l.lanpltrusted AS Trusted
FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'

ORDER BY 1;
**

ERROR:  syntax error at or near l
LINE 4: FROM pg_catalog.pg_language lWHERE l.lanname ~ '^(foo)$'


I believe the fix is to move \n from before the WHERE clause to after
the FROM, and from before ORDER BY to after WHERE.

Fix this bug and I believe this patch is ready for a committer.

PS. You added some trailing withspace after printACLColumn, A
recommendation if you want to avoid it is to either have a git commit
hook which checks for that and/or have colouring of git diffs so you can
see it marked in red. I use both. :)

Andreas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-18 Thread Simon Riggs

On Tue, 2011-01-18 at 18:03 +0100, Magnus Hagander wrote:

 So it'd be pg_receive_wal and pg_receive_base_backup then?

OK for me.

Maybe even pg_receive_wal_stream

Don't see any reason why command names can't be long. We have many
function names already that long.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2011-01-18 Thread Greg Smith


Euler Taveira de Oliveira wrote:
(i) If we want to support and scale factor greater than 21474 we have 
to convert some columns to bigint; it will change the test. From the 
portability point it is a pity but as we have never supported it I'm 
not too worried about it. Why? Because it will use bigint columns only 
if the scale factor is greater than 21474. Is it a problem? I don't 
think so because generally people compare tests with the same scale 
factor.


(ii) From the performance perspective, we need to test if the 
modifications don't impact performance. I don't create another code 
path for 64-bit modifications (it is too ugly) and I'm afraid some 
modifications affect the 32-bit performance. I'm in a position to test 
it though because I don't have a big machine ATM. Greg, could you lead 
these tests?


(iii) I decided to copy scanint8() (called strtoint64 there) from 
backend (Robert suggestion [1]) because Tom pointed out that strtoll() 
has portability issues. I replaced atoi() with strtoint64() but didn't 
do any performance tests.


(i):  Completely agreed.

(ii):  There is no such thing as a big machine that is 32 bits now; 
anything that's 32 is a tiny system here in 2011.  What I can do is 
check for degredation on the only 32-bit system I have left here, my 
laptop.  I'll pick a sensitive test case and take a look.


(iii) This is an important thing to test, particularly given it has the 
potential to impact 64-bit results too.


Thanks for picking this up again and finishing the thing off.  I'll add 
this into my queue of performance tests to run and we can see if this is 
worth applying.  Probably take a little longer than the usual CF review 
time.  But as this doesn't interfere with other code people are working 
on and is sort of a bug fix, I don't think it will be a problem if it 
takes a little longer to get this done.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Performance bug in DO blocks

2011-01-18 Thread Tom Lane

I just noticed that if you execute the same DO command over and over
within a session, it gets slower and slower.  And if you keep it up
you'll notice the backend's RAM consumption bloating too.  The cause
appears to be that we leak the cached plans created for any SQL
statements or expressions within the DO command --- the next iteration
won't reuse those, but rather create its own set.  Probably ought to
look into releasing those when the DO block is over.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync open_sync test

2011-01-18 Thread Bruce Momjian

Greg Smith wrote:
 Bruce Momjian wrote:
  Is there a value to this test_fsync test?
 
  Compare open_sync with different sizes:
  (This is designed to compare the cost of one large
  sync'ed write and two smaller sync'ed writes.)
  open_sync 16k write   242.563 ops/sec
  2 open_sync 8k writes 752.752 ops/sec
 
  It compares the cost of doing larger vs. two smaller open_sync writes.

 
 Might be some value for determining things like what the optimal WAL 
 block size to use is.  All these tests are kind of hard to use 
 effectively still, I'm not sure if it's time to start trimming tests yet 
 until we've made more progress on interpreting results first.

FYI, I beefed up the testing for this in test_fsync:

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16k
in different write open_sync sizes.)
 1 16k open_sync write723.824 ops/sec
 2  8k open_sync writes   347.979 ops/sec
 4  4k open_sync writes   215.114 ops/sec
 8  2k open_sync writes   174.613 ops/sec
16  1k open_sync writes87.496 ops/sec

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PgEast: 2011, 2nd call for papers

2011-01-18 Thread Joshua D. Drake

Hey folks,

PgEast is being held in NYC this year from 03/22-03-25. Get your papers
in, the deadline is soon!

http://www.postgresqlconference.org/

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Magnus Hagander

On Tue, Jan 18, 2011 at 17:33, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 18, 2011 at 2:15 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 I also find it weird that incoming replication connections are logged by
 default. In the standby, we also log streaming replication successfully
 connected to primary, which serves much of the same debugging purpose.

 Oh, good point.  All right, I withdraw my objection.  Let's just make
 it all controlled by log_connections and go home.

Done.

Oh, and the proper answer is yellow. Everybody knows that.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Josh Berkus


 To be frank, I really don't care about fixing this behavior on ext3,
 especially in the context of that sort of hack.  That filesystem is not
 the future, it's not possible to ever really make it work right, and
 every minute spent on pandering to its limitations would be better spent
 elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide
 some proper context for the worst-case behavior people can see right
 now, and to make sure refactoring here doesn't make things worse on it. 
 My target is same or slightly better on ext3, much better on XFS and ext4.

Please don't forget that we need to avoid performance regressions on
NTFS and ZFS as well.  They don't need to improve, but we can't let them
regress.  I think we can ignore BSD/UFS and Solaris/UFS, as well as
HFS+, though.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] texteq/byteaeq: avoid detoast

2011-01-18 Thread Tom Lane

Noah Misch n...@leadboat.com writes:
 texteq, textne, byteaeq and byteane detoast their arguments, then check for
 equality of length.  Unequal lengths imply the answer trivially; given equal
 lengths, the functions proceed to compare the actual bytes.  We can skip
 detoasting entirely when the lengths are unequal.  The attached patch 
 implements
 this.

Applied with stylistic changes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 1:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think you may be confused about what the patch does - currently,
 pages with hint bit changes are considered dirty, period.
 Therefore, they are written whenever any other dirty page would be
 written: by the background writer cleaning scan, at checkpoints,
 and when a backend must write a dirty buffer before reallocating it
 to hold a different page. The patch keeps the first of these and
 changes the second two

 While I was trying to performance-test the texteq patch, it occurred to
 me that this proposed hint-bit change has got a serious drawback.  To
 wit, that it will totally destroy reproducibility of any performance
 test that involves table scans.  Right now, you know that you can take
 hint bits out of the equation by doing a vacuum analyze and checkpoint;
 after that, all hint bits in the table are known to be set and written
 to disk.  Then you can get on with comparing the effects of some patch
 or other.  With the proposed patch, it will never be clear whether
 all the hint bits are set, because the patch specifically removes the
 deterministic ways to get a hint bit written out.  So you'll never be
 very sure whether a performance difference you think you see is real,
 or whether one case or the other got affected by extra clog lookups.
 It's hard enough already to be sure about performance changes on the
 order of 1%, but this will make it impossible.

True.  You could perhaps fix that by adding a GUC, but that feels
awfully like making it the user's problem to fix our broken
implementation.  Maybe we could live with it if the GUC were only
something developers ever needed to use, but I expect different people
would have different ideas about the correct setting in production.

If I'm not failing to understand the situation, the problem with the
first sequential scan after a bulk load is that we're cycling through
a ring of buffers that all have hint-bit changes and therefore all
have to be written.  The first pass through the ring is OK, but after
that every new buffer we bring in requires evicting a buffer that we
first have to write.  Of course, with the patch, this bottleneck is
removed by skipping all those writes, but that now causes a second
problem: the pages only get written if the background writer happens
to notice them before the backend gets all the way around the ring,
and that's pretty hit-or-miss, so we basically dribble hint bits out
to disk here and there but the steady state never really converges to
all hint bits on disk.

Maybe we could work around this by making the algorithm a little more
sophisticated.  Instead of the rather unilateral policy backends
don't write pages that are only dirty due to hint bit changes! we
could have some more nuanced rules.  For example, we might decree that
a backend will maintain a counter of the number of non-dirty pages
it's allocated.  Once it's allocated 20 pages that are either clean or
dirty-only-for-hint-bits, it writes that (or the next)
dirty-only-for-hint-bits it encounters.  That way, the effort of hint
bit setting would be spread out over the first 20 table scans, and
after that you converge to steady state.  We could also possibly
special-case vacuum to always write dirty-only-for-hint bits pages, on
the theory that the work is going to have to be done at some point,
and we're better off doing it during a maintenance task than
elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication logging

2011-01-18 Thread Josh Berkus

All,

Just speaking as someone who does a lot of log-crunching for performance
review, I don't find having a separate log_connections option for
replication terribly useful.  It's easy enough for me to just log all
connections and filter for the type of connections I want.

Even in cases where there's a ton of connection activity (e.g. PHP
webapp without a connection pool) logging all connections still doesn't
generate that many MB of output, in my experience.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Tom Lane

Noah Misch n...@leadboat.com writes:
 On Sun, Jan 16, 2011 at 06:49:27PM +0100, Pavel Stehule wrote:
 I am sending a updated version with little bit more comments. But I am
 sure, so somebody with good English have to edit my comments.
 Minimally I hope, so your questions will be answered.

 Thanks.  I edited the comments and white space somewhat, as attached.  I'll go
 ahead and mark it Ready for Committer.

I looked at this patch and found it fairly awkward.  What is the point
of adding an additional flag to every variable, as opposed to just
forcibly detoasting during assignment?  If it's to avoid detoasting when
the variable is read in a way that doesn't require detoasting, it fails
rather completely IMO, since exec_eval_datum certainly doesn't know
that.

The added memory context variable seems redundant as well.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Heikki Linnakangas


On 18.01.2011 21:16, Robert Haas wrote:

On Tue, Jan 18, 2011 at 1:32 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

While I was trying to performance-test the texteq patch, it occurred to
me that this proposed hint-bit change has got a serious drawback.  To
wit, that it will totally destroy reproducibility of any performance
test that involves table scans.  Right now, you know that you can take
hint bits out of the equation by doing a vacuum analyze and checkpoint;
after that, all hint bits in the table are known to be set and written
to disk.  Then you can get on with comparing the effects of some patch
or other.  With the proposed patch, it will never be clear whether
all the hint bits are set, because the patch specifically removes the
deterministic ways to get a hint bit written out.  So you'll never be
very sure whether a performance difference you think you see is real,
or whether one case or the other got affected by extra clog lookups.
It's hard enough already to be sure about performance changes on the
order of 1%, but this will make it impossible.


True.  You could perhaps fix that by adding a GUC, but that feels
awfully like making it the user's problem to fix our broken
implementation.  Maybe we could live with it if the GUC were only
something developers ever needed to use, but I expect different people
would have different ideas about the correct setting in production.


VACUUM (SET HINT BITS) table

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 There's a few remaining TODO comments in the code, which obviously
 need to be resolved one way or another
 
Not all of these are must haves for 9.1.  Here's how they break
down:
 
The two in predicate_internals.h mark places which would need to be
touched if we further modify the btree AM to do index-tuple level
predicate locking.  I know that Dan was eager to tackle that because
his group at MIT has predicate locking working at that granularity
for their transaction-aware caching which works with PostgreSQL.  At
this point, though, I'm very skeptical about starting that effort
for 9.1; it seems like something for 9.2.
 
There's one TODO related to whether we can drop the volatile
modifier from MySerializableXact.  I've started reviewing code
around this, but am not yet done.  It wouldn't be terrible to leave
it there and worry about this for 9.2, but it will be even better if
we can clear it up one way or the other now.
 
Three are marking the spots we want to touch if a patch becomes
available which lets us cancel the transaction on one connection
from another.  I'd love to take care of these, but we can't without
a separate patch that I know *I'm* not in a position to write for
9.1.  We may need to leave these for 9.2, as well.
 
One is about the desirability of taking some action (probably
reporting a warning) in the SLRU summarization code if we've burned
through over a billion transaction IDs while one read write
transaction has remained active.  That puts us close to where we'd
need to start canceling attempts to start a new transaction due to
resource issues.  Seems like we should issue that warning for 9.1. 
Not big or dangerous.  I'll do it.
 
One is related to the heuristics for promoting the granularity of
predicate locks.  We picked numbers out of the air which seemed
reasonable at the time.  I'm sure we can make this more
sophisticated, but what we have seems to be working OK.  I'm tempted
to suggest that we leave this alone until 9.2 unless someone has a
suggestion for a particular improvement.
 
One is related to the number of structures to allocate for detailed
conflict checking.  We've never seen a problem with this limit, at
least that has reached me.  On the other hand, it's certainly at
least theoretically possible to hit the limit and cause confusing
errors.  Perhaps we can put this one on a GUC and make sure the
error message is clear with a hint to think about adjusting the GUC?
The GUC would be named something like
max_conflicts_per_serializable_transaction (or something to that
general effect).  We should probably do that or bump up the limit.
If my back-of-the-envelope math is right, a carefully constructed
pessimal load could need up to (max_connections / 2)^2 -- so 100
connections could conceivably require 2500 structures, although such
a scenario would be hard to create.  Current picked from thin air
numbers would have space for 500.  The structure is six times the
pointer size, so 24 bytes each at 32 bit and 48 bytes each at 64
bit.
 
Three TODOs have popped up since Heikki's post because I pushed
Dan's 2PC WIP to the repo -- it's at a point where behavior should
be right for cases where the server keeps running.  He's working on
the persistence aspect now, and there are TODOs in the new stubs
he's filling out.  Definitely 9.1.
 
Then there's one still lurking outside the new predicate* files, in
heapam.c.  It's about 475 lines into the heap_update function (25
lines from the bottom).  In reviewing where we needed to acquire
predicate locks, this struck me as a place where we might need to
duplicate the predicate lock from one tuple to another, but I wasn't
entirely sure.  I tried for a few hours one day to construct a
scenario which would show a serialization anomaly if I didn't do
this, and failed create one.  If someone who understands both the
patch and heapam.c wants to look at this and offer an opinion, I'd
be most grateful.  I'll take another look and see if I can get my
head around it better, but failing that, I'm disinclined to either
add locking I'm not sure we need or to remove the comment that says
we *might* need it there.
 
That's all of them.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:00 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 18.01.2011 21:16, Robert Haas wrote:

 On Tue, Jan 18, 2011 at 1:32 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

 While I was trying to performance-test the texteq patch, it occurred to
 me that this proposed hint-bit change has got a serious drawback.  To
 wit, that it will totally destroy reproducibility of any performance
 test that involves table scans.  Right now, you know that you can take
 hint bits out of the equation by doing a vacuum analyze and checkpoint;
 after that, all hint bits in the table are known to be set and written
 to disk.  Then you can get on with comparing the effects of some patch
 or other.  With the proposed patch, it will never be clear whether
 all the hint bits are set, because the patch specifically removes the
 deterministic ways to get a hint bit written out.  So you'll never be
 very sure whether a performance difference you think you see is real,
 or whether one case or the other got affected by extra clog lookups.
 It's hard enough already to be sure about performance changes on the
 order of 1%, but this will make it impossible.

 True.  You could perhaps fix that by adding a GUC, but that feels
 awfully like making it the user's problem to fix our broken
 implementation.  Maybe we could live with it if the GUC were only
 something developers ever needed to use, but I expect different people
 would have different ideas about the correct setting in production.

 VACUUM (SET HINT BITS) table

Something along those lines could work too, but I don't see much
problem with making VACUUM doing it unconditionally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 3:18 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 That's all of them.

Our existing code has plenty of TODOs in it already, so I see no
problem with continuing to comment places where future enhancements
are possible, as long as they don't reflect deficiencies that are
crippling in the present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI patch version 12

2011-01-18 Thread Kevin Grittner

Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 If my back-of-the-envelope math is right, a carefully constructed
 pessimal load could need up to (max_connections / 2)^2 -- so 100
 connections could conceivably require 2500 structures, although
 such a scenario would be hard to create.  Current picked from
 thin air numbers would have space for 500.
 
Er, actually, we would have space for 5000, because it's five times
the number of SERIALIZABLEXACT structures which is ten times
max_connections. I guess that would explain why I've never seen a
report of a problem.
 
Still, someone who creates a very large number of connections and
pounds them heavily with SERIALIZABLE transactions might conceivably
run into it.  Since that's something the docs explicitly warn you
*not* to do with serializable transactions, I'm not sure we need to
do more than make sure the error message and hint are good. 
Thoughts?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] test_fsync label adjustments

2011-01-18 Thread Bruce Momjian

I have modified test_fsync to use test labels that match wal_sync_method
values, and and added more tests for open_sync with different sizes. 
This should make the program easier for novices to understand.  Here is
a test run for Ubuntu 11.04:

$ ./test_fsync
2000 operations per test

Compare file sync methods using one 8k write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync (non-direct I/O)*85.127 ops/sec
open_datasync (direct I/O) 87.119 ops/sec
fdatasync  81.006 ops/sec
fsync  82.621 ops/sec
fsync_writethroughn/a
open_sync (non-direct I/O)*84.412 ops/sec
open_sync (direct I/O) 91.006 ops/sec
* This non-direct I/O mode is not used by Postgres.

Compare file sync methods using two 8k writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync (non-direct I/O)*42.721 ops/sec
open_datasync (direct I/O) 45.296 ops/sec
fdatasync  76.665 ops/sec
fsync  78.361 ops/sec
fsync_writethroughn/a
open_sync (non-direct I/O)*42.311 ops/sec
open_sync (direct I/O) 45.247 ops/sec
* This non-direct I/O mode is not used by Postgres.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16k
in different write open_sync sizes.)
 1 16k open_sync write 86.740 ops/sec
 2  8k open_sync writes44.709 ops/sec
 4  4k open_sync writes22.096 ops/sec
 8  2k open_sync writes10.856 ops/sec
16  1k open_sync writes 5.434 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close86.802 ops/sec
write, close, fsync85.766 ops/sec

Non-sync'ed 8k writes:
write  83.068 ops/sec



Applied patch attached.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index b1cec74..9c829ba 100644
*** /tmp/pgdiff.3331/yCQf2a_test_fsync.c	Tue Jan 18 15:06:39 2011
--- src/tools/fsync/test_fsync.c	Tue Jan 18 14:43:58 2011
*** void		test_open(void);
*** 47,52 
--- 47,53 
  void		test_non_sync(void);
  void		test_sync(int writes_per_op);
  void		test_open_syncs(void);
+ void		test_open_sync(const char *msg, int writes_size);
  void		test_file_descriptor_sync(void);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
  void		die(char *str);
*** main(int argc, char *argv[])
*** 61,68 
  
  	test_open();
  	
- 	test_non_sync();
- 	
  	/* Test using 1 8k write */
  	test_sync(1);
  
--- 62,67 
*** main(int argc, char *argv[])
*** 73,78 
--- 72,79 
  
  	test_file_descriptor_sync();
  	
+ 	test_non_sync();
+ 	
  	unlink(filename);
  
  	return 0;
*** handle_args(int argc, char *argv[])
*** 105,111 
  	}
  
  	while ((option = getopt_long(argc, argv, f:o:,
!  long_options, optindex)) != -1)
  	{
  		switch (option)
  		{
--- 106,112 
  	}
  
  	while ((option = getopt_long(argc, argv, f:o:,
! 			long_options, optindex)) != -1)
  	{
  		switch (option)
  		{
*** handle_args(int argc, char *argv[])
*** 126,132 
  		}
  	}
  
! 	printf(%d operations per test\n\n, ops_per_test);
  }
  
  void
--- 127,133 
  		}
  	}
  
! 	printf(%d operations per test\n, ops_per_test);
  }
  
  void
*** test_open(void)
*** 162,201 
  }
  
  void
- test_non_sync(void)
- {
- 	int			tmpfile, ops;
- 
- 	/*
- 	 * Test a simple write without fsync
- 	 */
- 	printf(Simple non-sync'ed write:\n);
- 	printf(LABEL_FORMAT, 8k write);
- 	fflush(stdout);
- 
- 	gettimeofday(start_t, NULL);
- 	for (ops = 0; ops  ops_per_test; ops++)
- 	{
- 		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
- 			die(Cannot open output file.);
- 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
- 			die(write failed);
- 		close(tmpfile);
- 	}
- 	gettimeofday(stop_t, NULL);
- 	print_elapse(start_t, stop_t);
- }
- 
- void
  test_sync(int writes_per_op)
  {
  	int			tmpfile, ops, writes;

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Bruce Momjian

Robert Haas wrote:
 On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
 
  Hello Postgres Hackers,
 
  In reference to this todo item about clustering system table indexes,
  ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
  I have been studying the system tables to see which would benefit ?from
  clustering. ?I have some index suggestions and a question if you have a
  moment.
 
  Wow, this is really old stuff. ?I don't know if this is really of any
  benefit, given that these catalogs are loaded into syscaches anyway.
  Furthermore, if you cluster at initdb time, they will soon lose the
  ordering, given that updates move tuples around and inserts put them
  anywhere. ?So you'd need the catalogs to be re-clustered once in a
  while, and I don't see how you'd do that (except by asking the user to
  do it, which doesn't sound so great).
 
 The idea of the TODO seems to have been to set the default clustering
 to something reasonable.  That doesn't necessarily seem like a bad
 idea even if we can't automatically maintain the cluster order, but
 there's some question in my mind whether we'd get any measurable
 benefit from the clustering.  Even on a database with a gigantic
 number of tables, it seems likely that the relevant system catalogs
 will stay fully cached and, as you point out, the system caches will
 further blunt the impact of any work in this area.  I think the first
 thing to do would be to try to come up with a reproducible test case
 where clustering the tables improves performance.  If we can't, that
 might mean it's time to remove this TODO.

I think CLUSTER is a win when you are looking up multiple rows in the
same table, either using a non-unique index or a range search.  What
places do such lookups?  Having them all in adjacent pages would be a
win --- single-row lookups are usually not.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

David E. Wheeler da...@kineticode.com writes:
 One of the reasons our client wants GIN for the integer[] column so bad is 
 because recreating the GiST integer[] index is quite painful. Before I duped 
 the table, I was just dropping and recreating the index on the original 
 table. It was great to create the GIN index; for 400K rows, it took 1300 ms. 
 After my initial round of testing, I dropped it and created the GiST index. 
 That ran forwell, *hours*. Four at least, maybe six or seven (I forgot 
 \timing and was letting it run on screen while I did some iOS hacking). I 
 think something might be really wrong with GiST index creation for integer 
 arrays, because the difference was just appalling. 

I poked into this a bit, although it's really off-topic for what I was
doing in GIN, and I agree that there is something fundamentally rotten
in the gist__int_ops logic.  oprofile shows that the code is spending
all its time in g_int_compress and g_int_union during index build:

samples  %app name symbol name
5274740.2486  _int.so  g_int_compress
2709220.6726  libc-2.12.2.so   _wordcopy_fwd_aligned
1893814.4506  _int.so  compASC
1825613.9302  postgres pg_qsort
5741  4.3807  no-vmlinux   /no-vmlinux
3297  2.5158  postgres swapfunc
864   0.6593  _int.so  g_int_decompress
826   0.6303  _int.so  _int_unique
700   0.5341  _int.so  inner_int_union
617   0.4708  postgres med3
588   0.4487  libc-2.12.2.so   memset
371   0.2831  _int.so  g_int_same
143   0.1091  libc-2.12.2.so   memcpy
123   0.0939  postgres ArrayGetNItems
670.0511  _int.so  inner_int_inter
480.0366  postgres AllocSetAlloc
470.0359  libc-2.12.2.so   memmove
400.0305  postgres MemoryContextAllocZero

The time in g_int_compress is all on this loop:

while (len  MAXNUMRANGE * 2)
{
min = 0x7fff;
for (i = 2; i  len; i += 2)
if (min  (dr[i] - dr[i - 1]))
{
min = (dr[i] - dr[i - 1]);
cand = i;
}
memmove((void *) dr[cand - 1], (void *) dr[cand + 1], (len - cand 
- 1) * sizeof(int32));
len -= 2;
}

which is not too surprising since that's obviously O(N^2).  The other
high-percentage functions are qsort and subsidiaries, and those calls
are coming from g_int_union.  That part could probably be sped up, since
most of the calls are unioning two inputs, which could be done a lot
cheaper than this (the inputs are always presorted no?).  But there is
something really fundamentally wrong in the logic, I think, because
while poking at it with gdb I saw it union-ing fifty-thousand-element
arrays.  Surely it should get lossy before it gets to that.  I'm also
wondering how it tells the difference between lossy and non-lossy keys
--- although it's hard to tell what such spectacularly uncommented code
is supposed to be doing, it looks like the lossy case is supposed to be
pairs of values representing ranges, in which case g_int_compress is
surely doing the Wrong Thing with already-lossy inputs, and union'ing
lossy inputs is an entirely insane operation as well.

At the moment my opinion is that gist__int_ops is too broken to be
usable, and it's also too uncommented to be fixable by anyone other
than the original author.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Pavel Stehule

2011/1/18 Tom Lane t...@sss.pgh.pa.us:
 Noah Misch n...@leadboat.com writes:
 On Sun, Jan 16, 2011 at 06:49:27PM +0100, Pavel Stehule wrote:
 I am sending a updated version with little bit more comments. But I am
 sure, so somebody with good English have to edit my comments.
 Minimally I hope, so your questions will be answered.

 Thanks.  I edited the comments and white space somewhat, as attached.  I'll 
 go
 ahead and mark it Ready for Committer.

 I looked at this patch and found it fairly awkward.  What is the point
 of adding an additional flag to every variable, as opposed to just
 forcibly detoasting during assignment?  If it's to avoid detoasting when
 the variable is read in a way that doesn't require detoasting, it fails
 rather completely IMO, since exec_eval_datum certainly doesn't know
 that.

I am not sure about false overhead of detoasting. This is a safe
variant. I don't would to decrease performance. Not sure if it's
necessary.

But detoasting on assignment isn't enought:

some most simple example - searching a maximum

for i in array_lower(a,1) .. array_upper(a,1)
loop
  if x  a[i] then
x = a[i];
  end if;
end loop;

in this cycle the variable a wasn't modified. Any access to this
variable means a detoast and decompres. So there is necessary to
modify a process. Detoast not on assign, but detoast on usage.

Regards

Pavel Stehule


 The added memory context variable seems redundant as well.

I didn't find a pointer on top execution context available from
execution state. I am sure, so we have to switch to this context
explicitly.

Regards

Pavel Stehule


                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_filedump moved to pgfoundry

2011-01-18 Thread Mark Kirkwood


On 19/01/11 05:51, Robert Haas wrote:


I'm not sure why they'd care, but it certainly doesn't seem worth
spending the amount of time arguing about it that we are.  David and
Mark are, of course, free to spend their time petitioning Red Hat for
relicensing if they are so inclined, but they aren't entitled to tell
you how to spend yours.  And even if they were, I would hope that
they'd want you to spend it committing patches rather than arguing
with your employer about relicensing of a utility that's freely
available anyway and of use to 0.1% of our user base.



Funny how people can read an email and derive a completely different 
meaning from what you intended - just to be clear: it certainly wasn't 
my intention to tell anyone how to spend their time.


Anyway, good to have pg_filedump on pgfoundry, thanks Tom and RH.

regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-18 Thread Robert Haas

On Tue, Jan 18, 2011 at 12:16 PM, Simone Aiken sai...@ulfheim.net wrote:
 When I'm learning a new system I like to first learn how to use it,
 second learn its data model, third start seriously looking at the code.
 So that Todo is ideal for my learning method.

Sure - my point is just that we usually have as a criteria for any
performance related patch that it actually does improve performance.
So, we'd need a test case.

 If there is something else that would also involve studying all the system
 tables it would also be great.  For example, I noticed we have column
 level comments on the web but not in the database itself.  This seems
 silly.  Why not have the comments in the database and have the web
 query the tables of template databases for the given versions?

Uh... I don't know what this means.

 I'm open to other suggestions as well.

Here are a few TODO items that look relatively easy to me (they may
not actually be easy when you dig in, of course):

Clear table counters on TRUNCATE
Allow the clearing of cluster-level statistics
Allow ALTER TABLE ... ALTER CONSTRAINT ... RENAME
Allow ALTER TABLE to change constraint deferrability and actions

Unfortunately we don't have a lot of easy TODOs.  People keep doing
the ones we think up...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync label adjustments

2011-01-18 Thread A.M.


On Jan 18, 2011, at 3:55 PM, Bruce Momjian wrote:

 I have modified test_fsync to use test labels that match wal_sync_method
 values, and and added more tests for open_sync with different sizes. 
 This should make the program easier for novices to understand.  Here is
 a test run for Ubuntu 11.04:
 
   $ ./test_fsync
   2000 operations per test
   
   Compare file sync methods using one 8k write:
   (in wal_sync_method preference order, except fdatasync
   is Linux's default)
   open_datasync (non-direct I/O)*85.127 ops/sec
   open_datasync (direct I/O) 87.119 ops/sec
   fdatasync  81.006 ops/sec
   fsync  82.621 ops/sec
   fsync_writethroughn/a
   open_sync (non-direct I/O)*84.412 ops/sec
   open_sync (direct I/O) 91.006 ops/sec
   * This non-direct I/O mode is not used by Postgres.

I am curious how this is targeted at novices. A naive user might enable the 
fastest option which could be exactly wrong. For this to be useful to 
novices, I suspect the tool will need to generate platform-specific 
suggestions, no?

Cheers,
M


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane

David E. Wheeler da...@kineticode.com writes:
 These numbers are a bit crazy-making, but the upshot is that Gist is
 slow out of the gate, but with data cached, it's pretty speedy. With
 indexscan and bitmapscan disabled, these queries all took 300-400
 ms. So GIN was never better performing than a table scan.

I could not replicate that here at all --- GIN indexscans were
consistently better than seqscans for me, eg

regression=# set enable_bitmapscan TO 1;
SET
Time: 0.673 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
  QUERY PLAN
  
--
 Aggregate  (cost=1159.20..1159.21 rows=1 width=0) (actual time=23.964..23.964 
rows=1 loops=1)
   -  Bitmap Heap Scan on listings  (cost=31.15..1158.18 rows=406 width=0) 
(actual time=23.014..23.876 rows=772 loops=1)
 Recheck Cond: ((features @@ '1368799  1368800  1369043'::query_int) 
AND (deleted_at IS NULL) AND (status = 1))
 -  Bitmap Index Scan on idx_gin_features  (cost=0.00..31.05 rows=406 
width=0) (actual time=22.913..22.913 rows=772 loops=1)
   Index Cond: (features @@ '1368799  1368800  
1369043'::query_int)
 Total runtime: 24.040 ms
(6 rows)

Time: 24.968 ms
regression=# set enable_bitmapscan TO 0;
SET
Time: 0.458 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
 QUERY PLAN 
  

 Aggregate  (cost=9158.24..9158.25 rows=1 width=0) (actual 
time=145.121..145.121 rows=1 loops=1)
   -  Seq Scan on listings  (cost=0.00..9157.22 rows=406 width=0) (actual 
time=0.025..144.982 rows=772 loops=1)
 Filter: ((deleted_at IS NULL) AND (features @@ '1368799  1368800  
1369043'::query_int) AND (status = 1))
 Total runtime: 145.177 ms
(4 rows)

Time: 146.228 ms

I'm noticing also that I get different rowcounts than you do, although
possibly that has something to do with the partial-index conditions,
which I'm not trying to duplicate here (all rows in my table pass those
two tests).

 * Why does it take 3-4x longer to create the GIN than the GiST index
 on tsvector?

Perhaps more maintenance_work_mem would help with that; although the
fine manual says specifically that GIN text search indexes take about
three times longer to build than equivalent GiST indexes, so maybe that
behavior is as designed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-18 Thread Tom Lane

Pavel Stehule pavel.steh...@gmail.com writes:
 2011/1/18 Tom Lane t...@sss.pgh.pa.us:
 I looked at this patch and found it fairly awkward. Â What is the point
 of adding an additional flag to every variable, as opposed to just
 forcibly detoasting during assignment?

 But detoasting on assignment isn't enought:

 for i in array_lower(a,1) .. array_upper(a,1)
 loop
   if x  a[i] then
 x = a[i];
   end if;
 end loop;

 in this cycle the variable a wasn't modified. Any access to this
 variable means a detoast and decompres.

How so?  In what I'm envisioning, a would have been decompressed when it
was originally assigned to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler

On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:

 At the moment my opinion is that gist__int_ops is too broken to be
 usable, and it's also too uncommented to be fixable by anyone other
 than the original author.

That seems to jibe with your comments from last year:

  http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Best,

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 137 matches

Mail list logo