Re: [HACKERS] Odd out of memory problem.

2012-03-30 Thread Hitoshi Harada
On Thu, Mar 29, 2012 at 7:38 PM, Peter Eisentraut  wrote:
> On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote:
>> Hm. So my original plan was dependent on adding the state-merge
>> function we've talked about in the past. Not all aggregate functions
>> necessarily can support such a function but I think all or nearly all
>> the builtin aggregates can. Certainly min,max, count, sum, avg,
>> stddev, array_agg can which are most of what people do. That would be
>> a function which can take two state variables and produce a new state
>> variable.
>
> This information could also be useful to have in PL/Proxy (or similar
> FDWs) to be able to integrate aggregate computation into the language.
> Currently, you always have to do the state merging yourself.
>

I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase
aggregate, but in theory the proposal above is state-merge function,
so it doesn't apply to general aggregate results that passed through
the final function.  Of course some functions that don't have final
functions are ok to call state-merge function on the results.


Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] measuring lwlock-related latency spikes

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas  wrote:
> Actually, what is really bugging me is that I cannot find any way of
> getting a profile that reflects the *time* spent waiting rather than
> merely the *number* of waits.  That seems like an obvious thing to
> want, and I cannot find a way to get it.

I ended up taking a stab at writing some userspace code for this - see
attached.  It measures the amount of time taken for each contended
lwlock acquisition and prints a crude histogram of the results at
postmaster-shutdown time.  At least on the IBM POWER7 machine, running
pgbench with 32 threads, where pg_test_timing shows that getting the
time takes less than a microsecond 96%+ of the time, this seemed to
have no real impact on the tps numbers - perhaps because the workload
is I/O bound.  Risking the possible ire of people who object to large
attachments, I'm attaching the results this generated on a 30-minute,
32-thread pgbench run at scale factor 300.  To minimize said ire, I've
run the file through bzip2.

What's interesting about this is that, while there is plenty of
waiting for the usual suspects - ProcArrayLock (4), WALInsertLock (7),
and CLogControlLock (11), the waits are all pretty short:

2012-03-31 02:33:25 UTC [50305] LOG:  lock 0: 2:520, 3:
2012-03-31 02:33:25 UTC [50305] LOG:  lock 3: 1:9, 2:36838
2012-03-31 02:33:25 UTC [50305] LOG:  lock 4: 1:33, 2:216964
2012-03-31 02:33:25 UTC [50305] LOG:  lock 7: 1:39, 2:406249
2012-03-31 02:33:25 UTC [50305] LOG:  lock 8: 1:4, 2:34
2012-03-31 02:33:25 UTC [50305] LOG:  lock 11: 1:99, 2:374559
2012-03-31 02:33:25 UTC [50305] LOG:  lock 17: 2:24, 3:24

That's saying that there were over 400,000 contended acquisitions of
WALInsertLock, but the longest one had fls(wait_time_in_us) = 2, or in
other words it took less than 4us to get the lock.  So what happens if
we grep the log file for the biggest offenders?

2012-03-31 02:33:25 UTC [50305] LOG:  lock 204610: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 272958: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325412: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325784: 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 360016: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 444886: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 499890: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 533418: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 610484: 6:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 897798: 22:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1027128: 7:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1074256: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1132586: 5:1, 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1312178: 16:1, 22:1

If I'm reading this right, fls(wait_time_in_us) = 20 means a 1 second
delay, which I think means that a couple of those waits were >= *8
seconds*.  gdb reveals that in the test configuration, all odd
numbered locks between 169 and 2097319 are some buffer's
io_in_progress_lock, and all even numbered locks between 170 and
2097320 are some buffer's content_lock, which means, if I'm not
confused here, that every single lwlock-related stall > 1s happened
while waiting for a buffer content lock.  Moreover, each event
affected a different buffer.  I find this result so surprising that I
have a hard time believing that I haven't screwed something up, so if
anybody can check over the patch and this analysis and suggest what
that thing might be, I would appreciate it.  I would be a lot less
surprised if the waits revolved around the IO-in-progress locks, since
it's not that hard to imagine an I/O taking a really long time on a
busy system.  But I didn't think we were ever supposed to hold content
locks for that long.

The other thing that baffles me about these numbers is that they don't
provide any fodder for explaining the periodic drops in throughput
that happen when the system checkpoints.  I had assumed they would
show up as long lwlock waits, like somebody hanging on to
WALInsertLock while everybody else in the system piles up behind them.
 That's not reflected in these numbers - the few very long waits show
just ONE guy waiting a really long time for the lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


lwhistogram.patch
Description: Binary data


log.wp.lwhistogram.32.300.1800.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 10:07 PM, Jeff Janes  wrote:
> Which I've now done, and it actually goes the other way.  with -F50,
> not only do almost all pgbench_accounts updates still lead to a clean,
> but a good chunk of the updates to pgbench_tellers lead to a clean as
> well.  If cleans are going to happen so often, would it make sense to
> write a combined record for clean+hot_update?

I think that'd be hard to arrange.  You might thing that the prune
happens when we go to update and find that there isn't room on the
page, but that's actually not how it works at all:

# Ideally we would do defragmenting only when we are about to attempt
# heap_update on a HOT-safe tuple.  The difficulty with this approach
# is that the update query has certainly got a pin on the old tuple, and
# therefore our attempt to acquire a buffer cleanup lock will always fail.
# (This corresponds to the idea that we don't want to move the old tuple
# out from under where the query's HeapTuple pointer points.  It might
# be possible to finesse that, but it seems fragile.)

So what we do instead is:

# The currently planned heuristic is to prune and defrag when first accessing
# a page that potentially has prunable tuples (as flagged by the pd_prune_xid
# page hint field) and that either has free space less than MAX(fillfactor
# target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
# find enough free space to store an updated tuple version.  (These rules
# are subject to change.)

So, in other words, the prune happens before we have any idea that
we're going to attempt an update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Jeff Janes
On Fri, Mar 30, 2012 at 9:27 AM, Robert Haas  wrote:
>
> The more interesting waits, in my view anyway, are the ones that come
> from LWLockAcquire, which account for nearly all of the semaphore
> sleeps.  As you can see, XLogInsert accounts for over half of those,
> and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
> most of the rest.  Just out of curiosity, I expanded a couple of those
> another level, and you can see what's going on there, too.  I was
> slightly surprised by the number of calls to log_heap_clean, since I
> wouldn't have expected vacuuming to generate that much work; it turns
> out that 99.9% of those are HOT pruning events happening during query
> execution.

I've generally seen a one to one correspondence between log_heap_clean
and updates to pgbench_accounts (not by sampling waits, but by
wal_debug=1).

I figured this was because the pgbench_accounts blocks are packed to
the gills by default, so the only way to make room for the next HOT
update is the prune the dead tuple left over from the previous HOT
update in that same block.

It might be interesting to see if those go away if run with pgbench -i
-F90 or some other value less than 100.

Which I've now done, and it actually goes the other way.  with -F50,
not only do almost all pgbench_accounts updates still lead to a clean,
but a good chunk of the updates to pgbench_tellers lead to a clean as
well.  If cleans are going to happen so often, would it make sense to
write a combined record for clean+hot_update?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-03-30 Thread Jay Levitt

Tom Lane wrote:

Ants Aasma writes:

A user complained on pgsql-performance that SELECT col FROM table
GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
to return tuples from hash-aggregate as they are found when no
aggregate functions are in use. Attached is a first shot at that.


As I commented in the other thread, the user would be a lot better off
if he'd had an index on the column in question. I'm not sure it's worth
complicating the hashagg logic when an indexscan + groupagg would
address the case better.


Would this patch help in the case where "table" is actually a set-returning 
function, and thus can't have an index? (I don't yet know enough about the 
tree to know when hashaggs get used). I'm wondering if this is a useful 
exception to the "restrictions can't get pushed down through GROUP BYs" rule.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standbys, txid_current_snapshot, wraparound

2012-03-30 Thread Andrey Velikoredchanin
2012/3/29 Marko Kreen 

> I can't find a place where WAL replay updates values under XLogCtl.
> If that really does not happen, that would explain why standbys can
> see wrong epoch.
>
> No clue yet how master can get broken.
>
>
Details about environment:

Debian

Linux db 2.6.32-5-amd64 #1 SMP Fri Sep 9 20:23:16 UTC 2011 x86_64 GNU/Linux

# SELECT version();
 PostgreSQL 9.1.1 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real
(Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-03-30 Thread Albe Laurenz
I wrote:
>> Changes from previous version
>> =
>>
>> 1) Don't use remote EXPLAIN for cost/rows estimation, so now planner
>> estimates result rows and costs on the basis of local statistics such
>> as pg_class and pg_statistic.  To update local statistics, I added
>> pgsql_fdw_analyze() SQL function which updates local statistics of a
>> foreign table by retrieving remote statistics, such as pg_class and
>> pg_statistic, via libpq.  This would make the planning of pgsql_fdw
>> simple and fast.  This function can be easily modified to handle
>> ANALYZE command invoked for a foreign table (Fujita-san is proposing
>> this as common feature in another thread).

> I see the advantage of being able to do all this locally, but
> I think there are a lot of downsides too:

I found another limitation of this approach:
pgsql_fdw_analyze() has to run as a user who can update
pg_statistic, and this user needs a user mapping to a remote
user who can read pg_statistic.

This is not necessary for normal operation and needs
to be configured specifically for getting remote statistics.
This is cumbersome, and people might be unhappy to have to
create user mappings for highly privileged remote users.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-03-30 Thread Albe Laurenz
I wrote:
>> How about getting # of rows estimate by executing EXPLAIN for
>> fully-fledged remote query (IOW, contains pushed-down WHERE clause),
and
>> estimate selectivity of local filter on the basis of the statistics
>> which are generated by FDW via do_analyze_rel() and FDW-specific
>> sampling function?  In this design, we would be able to use quite
>> correct rows estimate because we can consider filtering stuffs done
on
>> each side separately, though it requires expensive remote EXPLAIN for
>> each possible path.
> 
> That sounds nice.

... but it still suffers from the problems of local statistics
for remote tables I pointed out.

I think that these shortcomings are not justified by the gain of
one client-server round trip less during planning.  I'd prefer
if pgsql_fdw were not dependent on remote statistics stored in the
local database.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen  writes:
> On Sat, Mar 31, 2012 at 1:13 AM, Tom Lane  wrote:
>> Not if the message is a constant string, which seems like the typical
>> situation (think "out of memory").  If the row processor does need a
>> buffer for a constructed string, it could make use of some space in its
>> "void *param" area, for instance.

> If it's specified as string that libpq does not own, then I'm fine with it.

Check.  Let's make it "const char **" in fact, just to be clear on that.

>> (This is why getAnotherTuple is careful to clear the failed result
>> before it tries to build a new one.  But that trick isn't going to be
>> available to an external row processor.)

> Kyotaro's original idea was to assume out-of-memory if error
> string was not set, thus the callback needed to set the string
> only when it really had something to say.

Hmm.  We could still do that in conjunction with the idea of returning
the string from the row processor, but I'm not sure if it's useful or
just overly cute.

[ thinks... ]  A small advantage of assuming NULL means that is that
we could postpone the libpq_gettext("out of memory") call until after
clearing the overflowed PGresult, which would greatly improve the odds
of getting a nicely translated result and not just ASCII.  Might be
worth it just for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Sat, Mar 31, 2012 at 1:13 AM, Tom Lane  wrote:
> Marko Kreen  writes:
>> On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
>>> I'm pretty dissatisfied with the error reporting situation for row
>>> processors.  You can't just decide not to solve it, which seems to be
>>> the current state of affairs.  What I'm inclined to do is to add a
>>> "char **" parameter to the row processor, and say that when the
>>> processor returns -1 it can store an error message string there.
>
>> But such API seems to require specifying allocator, which seems ugly.
>
> Not if the message is a constant string, which seems like the typical
> situation (think "out of memory").  If the row processor does need a
> buffer for a constructed string, it could make use of some space in its
> "void *param" area, for instance.

If it's specified as string that libpq does not own, then I'm fine with it.

>> I think it would be better to just use Kyotaro's original idea
>> of PQsetRowProcessorError() which nicer to use.
>
> I don't particularly care for that idea because it opens up all sorts of
> potential issues when such a function is called at the wrong time.
> Moreover, you have to remember that the typical situation here is that
> we're going to be out of memory or otherwise in trouble, which means
> you've got to be really circumspect about what you assume will work.
> Row processors that think they can do a lot of fancy message
> construction should be discouraged, and an API that requires
> construction of a new PGresult in order to return an error is right out.
> (This is why getAnotherTuple is careful to clear the failed result
> before it tries to build a new one.  But that trick isn't going to be
> available to an external row processor.)

Kyotaro's original idea was to assume out-of-memory if error
string was not set, thus the callback needed to set the string
only when it really had something to say.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Andres Freund
On Friday, March 30, 2012 06:27:36 PM Robert Haas wrote:
> Probability=No, score=-3.2 required=5.0 tests=AWL,BAYES_00,
> RCVD_IN_DNSWL_LOW autolearn=ham version=3.2.5   I've been playing around
> with "perf record" on the IBM POWER7 machine, and it's pretty cool.  One
> of the things I don't like is that some of the tools don't produce very
> nice text reports that you can cut and paste into an email - it's kind of
> a text-based GUI where you can zoom around and look at things.  Still,
> it's nice.
You can use --stdio to force a more plain text interface (you need perf 
annotate manually if you want line-by-line view) if you want so. Especially if 
you pipe it through cat or so.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen  writes:
> On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
>> I'm pretty dissatisfied with the error reporting situation for row
>> processors.  You can't just decide not to solve it, which seems to be
>> the current state of affairs.  What I'm inclined to do is to add a
>> "char **" parameter to the row processor, and say that when the
>> processor returns -1 it can store an error message string there.

> But such API seems to require specifying allocator, which seems ugly.

Not if the message is a constant string, which seems like the typical
situation (think "out of memory").  If the row processor does need a
buffer for a constructed string, it could make use of some space in its
"void *param" area, for instance.

> I think it would be better to just use Kyotaro's original idea
> of PQsetRowProcessorError() which nicer to use.

I don't particularly care for that idea because it opens up all sorts of
potential issues when such a function is called at the wrong time.
Moreover, you have to remember that the typical situation here is that
we're going to be out of memory or otherwise in trouble, which means
you've got to be really circumspect about what you assume will work.
Row processors that think they can do a lot of fancy message
construction should be discouraged, and an API that requires
construction of a new PGresult in order to return an error is right out.
(This is why getAnotherTuple is careful to clear the failed result
before it tries to build a new one.  But that trick isn't going to be
available to an external row processor.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 05:18:42PM -0400, Tom Lane wrote:
> Marko Kreen  writes:
> > On Wed, Mar 07, 2012 at 03:14:57PM +0900, Kyotaro HORIGUCHI wrote:
> >>> My suggestion - check in getAnotherTuple whether resultStatus is
> >>> already error and do nothing then.  This allows internal pqAddRow
> >>> to set regular "out of memory" error.  Otherwise give generic
> >>> "row processor error".
> 
> >> Current implement seems already doing this in
> >> parseInput3(). Could you give me further explanation?
> 
> > The suggestion was about getAnotherTuple() - currently it sets
> > always "error in row processor".  With such check, the callback
> > can set the error result itself.  Currently only callbacks that
> > live inside libpq can set errors, but if we happen to expose
> > error-setting function in outside API, then the getAnotherTuple()
> > would already be ready for it.
> 
> I'm pretty dissatisfied with the error reporting situation for row
> processors.  You can't just decide not to solve it, which seems to be
> the current state of affairs.  What I'm inclined to do is to add a
> "char **" parameter to the row processor, and say that when the
> processor returns -1 it can store an error message string there.
> If it does so, that's what we report.  If it doesn't (which we'd detect
> by presetting the value to NULL), then use a generic "error in row
> processor" message.  This is cheap and doesn't prevent the row processor
> from using some application-specific error reporting method if it wants;
> but it does allow the processor to make use of libpq's error mechanism
> when that's preferable.

Yeah.

But such API seems to require specifying allocator, which seems ugly.
I think it would be better to just use Kyotaro's original idea
of PQsetRowProcessorError() which nicer to use.

Few thoughts on the issue:

--

As libpq already provides quite good coverage of PGresult
manipulation APIs, then how about:

  void PQsetResultError(PGresult *res, const char *msg);

that does:

  res->errMsg = pqResultStrdup(msg);
  res->resultStatus = PGRES_FATAL_ERROR;

that would also cause minimal fuss in getAnotherTuple().



I would actually like even more:

  void PQsetConnectionError(PGconn *conn, const char *msg);

that does full-blown libpq error logic.  Thus it would be 
useful everywherewhere in libpq.  But it seems bit too disruptive,
so I would like a ACK from a somebody who knows libpq better.
(well, from you...)

---

Another thought - if we have API to set error from *outside*
of row-processor callback, that would immediately solve the
"how to skip incoming resultset without buffering it" problem.

And it would be usable for PQgetRow()/PQrecvRow() too.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen  writes:
> On Wed, Mar 07, 2012 at 03:14:57PM +0900, Kyotaro HORIGUCHI wrote:
>>> My suggestion - check in getAnotherTuple whether resultStatus is
>>> already error and do nothing then.  This allows internal pqAddRow
>>> to set regular "out of memory" error.  Otherwise give generic
>>> "row processor error".

>> Current implement seems already doing this in
>> parseInput3(). Could you give me further explanation?

> The suggestion was about getAnotherTuple() - currently it sets
> always "error in row processor".  With such check, the callback
> can set the error result itself.  Currently only callbacks that
> live inside libpq can set errors, but if we happen to expose
> error-setting function in outside API, then the getAnotherTuple()
> would already be ready for it.

I'm pretty dissatisfied with the error reporting situation for row
processors.  You can't just decide not to solve it, which seems to be
the current state of affairs.  What I'm inclined to do is to add a
"char **" parameter to the row processor, and say that when the
processor returns -1 it can store an error message string there.
If it does so, that's what we report.  If it doesn't (which we'd detect
by presetting the value to NULL), then use a generic "error in row
processor" message.  This is cheap and doesn't prevent the row processor
from using some application-specific error reporting method if it wants;
but it does allow the processor to make use of libpq's error mechanism
when that's preferable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace

2012-03-30 Thread Jeff Davis
On Thu, 2012-03-22 at 14:55 +0200, Ants Aasma wrote:
> Hi,
> 
> while working on a support case I stumbled upon a bug in pg_upgrade.
> Upgrade fails with "No such file or directory" when a database is
> moved to a non-default tablespace and contains a table that is moved
> to pg_default. The cause seems to be that the following test
> incorrectly equates empty spclocation with database tablespace:
> 
> tblspace = PQgetvalue(res, relnum, i_spclocation);
> /* if no table tablespace, use the database tablespace */
> if (strlen(tblspace) == 0)
> tblspace = dbinfo->db_tblspace;
> 
> Patch to fix this is attached.

I confirmed this bug upgrading 9.1 to master, and that this patch fixes
it. Thank you for the report!

Patch looks good to me as well, with one very minor nitpick: the added
comment is missing an apostrophe.

Bruce, can you take a look at this?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql: tab-completion fails SET var=

2012-03-30 Thread Erik Rijkers
(in hopes that the current changes to tab-completion will help to get this 
fixed)

tab-completion goes wrong on SET setting=...

example:

If you want to input "set search_path=myschema;" without spaces around '=',
and you try tab-completion halfway the schemaname:

set search_path=mysch
  ^tab

then the initial part of the schema name ('mysch') gets removed and replaced 
with 'TO'.

So that you now have:

set search_path=TO 

You lose the part of the schema name that was already input.

With spaces that doesn't happen, but after all tabcompletion is about avoiding 
keystrokes and errors.

Tab-completion is great and this search_path-annoyance happens to me all the 
time; my fingers
can't seem to learn this exception.

(pgsql 9.2devel, problem also in latest 9.1.3)


Thanks,


Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 1:30 PM, Greg Stark  wrote:
> On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas  wrote:
>> If you expand that branch of the call tree, you find that all of them
>> are coming eventually from secure_read; the server is waiting for a
>> new query from the client.  This is, obviously, overhead we can't
>> eliminate from this test; waiting for the client is part of the job.
>
> Fwiw this isn't necessarily true. How does the absolute number of
> these events compare with the number of pg_bench operations done?

Since perf is a stochastic profiler, I don't think there's any way to
measure that using this tool.

> If
> it's significantly more the server could be reading on sockets while
> there are partial commands there and it might be more efficient to
> wait until the whole command is ready before reading. It may be that
> this indicates that pg_bench is written in an inefficient way and it
> should pipeline more commands but of course optimizing pg_bench is
> kind of missing the point.

Well, to some degree.  I think most clients are going to send commands
one at a time and wait for the responses.  OTOH, if for example libpq
is sending the data inefficiently in some way that wouldbe worth
fixing; lots of people use libpq.

> Also incidentally context switches is one of the things getrusage
> shows so I'm still hoping to have that per-plan-node though that's
> orthogonal to what this tool gives you with the call graph.

Yeah.

Actually, what is really bugging me is that I cannot find any way of
getting a profile that reflects the *time* spent waiting rather than
merely the *number* of waits.  That seems like an obvious thing to
want, and I cannot find a way to get it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Tom Lane
Greg Stark  writes:
> On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas  wrote:
>> If you expand that branch of the call tree, you find that all of them
>> are coming eventually from secure_read; the server is waiting for a
>> new query from the client.  This is, obviously, overhead we can't
>> eliminate from this test; waiting for the client is part of the job.

> Fwiw this isn't necessarily true. How does the absolute number of
> these events compare with the number of pg_bench operations done? If
> it's significantly more the server could be reading on sockets while
> there are partial commands there and it might be more efficient to
> wait until the whole command is ready before reading. It may be that
> this indicates that pg_bench is written in an inefficient way and it
> should pipeline more commands but of course optimizing pg_bench is
> kind of missing the point.

Well, that would be on libpq's head if it were true, but I believe we're
fairly careful to not flush the output buffer until we're sending a
complete message.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina  wrote:
> Any enhancement here that can't be used with libpq via, say, drop-in
> .so seems unworkable to me, and that's why any solution that is
> basically proxying to the database is basically a non-starter outside
> the very earliest prototyping stages.  The tuple scanning and protocol
> semantics can and even should remain the same, especially at first.

I should add: proxying could work well if libpq had all the right
hooks. The server could remain ignorant.  Regardless, upstream changes
result.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking context switches with perf record

2012-03-30 Thread Greg Stark
On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas  wrote:
> If you expand that branch of the call tree, you find that all of them
> are coming eventually from secure_read; the server is waiting for a
> new query from the client.  This is, obviously, overhead we can't
> eliminate from this test; waiting for the client is part of the job.

Fwiw this isn't necessarily true. How does the absolute number of
these events compare with the number of pg_bench operations done? If
it's significantly more the server could be reading on sockets while
there are partial commands there and it might be more efficient to
wait until the whole command is ready before reading. It may be that
this indicates that pg_bench is written in an inefficient way and it
should pipeline more commands but of course optimizing pg_bench is
kind of missing the point.

Also incidentally context switches is one of the things getrusage
shows so I'm still hoping to have that per-plan-node though that's
orthogonal to what this tool gives you with the call graph.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Fri, Mar 30, 2012 at 9:11 AM, Andrew Dunstan  wrote:
>
>
> On 03/30/2012 11:41 AM, Robert Haas wrote:
>>
>> On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer
>>  wrote:
>>>
>>> Well, in our case HTTP is a clear win (but not replacement) and SPDY a
>>> potential one (even as a replacement).  Even if SPDY is not widely
>>> adopted
>>> it could still replace FEBE if there's a clear advantage to using it, I
>>> don't know enough to make the call right now.
>>
>> I can see that there are some advantages to having an HTTP interface
>> to the database, but I think throwing our existing protocol out the
>> window or relegating it to the status of a second-class citizen would
>> be foolish.
>
>
>
> Right, I can't imagine it happening. And I wouldn't really be keen to add an
> alternative protocol either.
>
> I could imagine a client which presented a SPDY interface to the world and
> translated it into standard calls, possibly via libpq.

Any enhancement here that can't be used with libpq via, say, drop-in
.so seems unworkable to me, and that's why any solution that is
basically proxying to the database is basically a non-starter outside
the very earliest prototyping stages.  The tuple scanning and protocol
semantics can and even should remain the same, especially at first.

What I really want is for a mechanism to push FEBE messages down into
a transport in a extendable way (not unlike SSL, except this time
sensitive to FEBE message boundaries), taking unmodified libpq
binaries (but with a hook, of course...) into consideration.

> It's well to remember that we are not a green fields project here.

Perhaps not, which is a good reason to not put cart before horse --
I've tried to be very careful to suggest that real useful
caching/proxy/routing software needs to appear that can work,
unmodified, with hypothetical FEBE-over-SPDY communications, and
accomplish some interesting use cases, and also very careful to
suggest this is a *transport*-level feature.  It think would be fair
to demand: "Okay, when Apache or NGINX or a credible embedded web
server in some language can accomplish something useful for us if we
absorb this complexity, let's put this on the table".

To have to write continuously write poolers, cachers and redirectors
(for any reason: HA, partitioning, or even just clean-looking domain
names...) is a burden on community resources. The criteria for success
is to expand the number of projects attacking these problems and
decrease the cost of writing one, and I think demanding a straw-man
implementation of, say, a query cacher or pooler and seeing how that
implementation feels is not unreasonable along with accepting the
burden of maintenance.

Robert Haas Wrote:
>> HTTP is a non-trivial protocol that tends to impose lots
>> of escaping and de-escaping overhead which is unnecessary for people
>> who just want to connect to the database and run queries.  I can
>> completely understand that someone might want the ability to do GET
>> /db/table/pk and have that return an answer very, very quickly, by
>> bypassing the usual parser and planner and just firing off an
>> index-scan and returning the results as JSON or somesuch.  But I think
>> it would be a serious mistake to assume that GET /q?q=myquery is going
>> to come out better than what we have now in the general case.

I don't think this holds for SPDY, which is the only way I even
entertain thoughts of doing this.  I want FEBE, at least at first and
probably even then for quite a long while, just with framing and
headers (during start-up) that other software will understand.

Control frames are like this:

+--+
|C| Version(15bits) | Type(16bits) |
+--+
| Flags (8)  |  Length (24 bits)   |
+--+
|   Data   |
+--+

Data frames are like this:

+--+
|C|   Stream-ID (31bits)   |
+--+
| Flags (8)  |  Length (24 bits)   |
+--+
|   Data   |
+--+

The (XML, not HTML, I read this with "view source", or try "curl") RFC
draft is here:

http://mbelshe.github.com/SPDY-Specification/draft-mbelshe-spdy-00.xml


-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Uppercase tab completion keywords in psql?

2012-03-30 Thread Peter Eisentraut
On fre, 2012-03-23 at 07:52 -0700, David Fetter wrote:
> On Thu, Mar 22, 2012 at 06:05:30PM -0400, Andrew Dunstan wrote:
> > On 03/22/2012 05:49 PM, Bruce Momjian wrote:
> > >Robert Haas and I are disappointed by this change.  I liked the
> > >fact that I could post nice-looking SQL queries without having to
> > >use my capslock key (which I use as a second control key).  Any
> > >chance of reverting this change?
> > >
> > 
> > Should it be governed by a setting?
> 
> Something like (upper|lower|preserve) ?

How about this patch then?  (There are actually four possible settings,
see patch.)

diff --git i/doc/src/sgml/ref/psql-ref.sgml w/doc/src/sgml/ref/psql-ref.sgml
index b849101..be9d37d 100644
--- i/doc/src/sgml/ref/psql-ref.sgml
+++ w/doc/src/sgml/ref/psql-ref.sgml
@@ -2652,6 +2652,22 @@ bar
   
 
   
+COMP_KEYWORD_CASE
+
+
+Determines which letter case to use when completing an SQL key word.
+If set to lower or upper, the
+completed word will be in lower or upper case, respectively.  If set
+to preserve-lower
+or preserve-upper (the default), the completed word
+will be in the case of the word already entered, but words being
+completed without anything entered will be in lower or upper case,
+respectively.
+
+
+  
+
+  
 DBNAME
 
 
diff --git i/src/bin/psql/tab-complete.c w/src/bin/psql/tab-complete.c
index 6f481bb..00d87d5 100644
--- i/src/bin/psql/tab-complete.c
+++ w/src/bin/psql/tab-complete.c
@@ -682,7 +682,7 @@ static char **complete_from_variables(char *text,
 		const char *prefix, const char *suffix);
 static char *complete_from_files(const char *text, int state);
 
-static char *pg_strdup_same_case(const char *s, const char *ref);
+static char *pg_strdup_keyword_case(const char *s, const char *ref);
 static PGresult *exec_query(const char *query);
 
 static void get_previous_words(int point, char **previous_words, int nwords);
@@ -3048,7 +3048,7 @@ create_or_drop_command_generator(const char *text, int state, bits32 excluded)
 	{
 		if ((pg_strncasecmp(name, text, string_length) == 0) &&
 			!(words_after_create[list_index - 1].flags & excluded))
-			return pg_strdup_same_case(name, text);
+			return pg_strdup_keyword_case(name, text);
 	}
 	/* if nothing matches, return NULL */
 	return NULL;
@@ -3335,9 +3335,9 @@ complete_from_list(const char *text, int state)
 			if (completion_case_sensitive)
 return pg_strdup(item);
 			else
-/* If case insensitive matching was requested initially, return
- * it in the case of what was already entered. */
-return pg_strdup_same_case(item, text);
+/* If case insensitive matching was requested initially, adjust
+ * the case according to setting. */
+return pg_strdup_keyword_case(item, text);
 		}
 	}
 
@@ -3374,9 +3374,9 @@ complete_from_const(const char *text, int state)
 		if (completion_case_sensitive)
 			return pg_strdup(completion_charp);
 		else
-			/* If case insensitive matching was requested initially, return it
-			 * in the case of what was already entered. */
-			return pg_strdup_same_case(completion_charp, text);
+			/* If case insensitive matching was requested initially, adjust the
+			 * case according to setting. */
+			return pg_strdup_keyword_case(completion_charp, text);
 	}
 	else
 		return NULL;
@@ -3484,27 +3484,48 @@ complete_from_files(const char *text, int state)
 
 
 /*
- * Make a pg_strdup copy of s and convert it to the same case as ref.
+ * Make a pg_strdup copy of s and convert the case according to
+ * COMP_KEYWORD_CASE variable, using ref as the text that was already entered.
  */
 static char *
-pg_strdup_same_case(const char *s, const char *ref)
+pg_strdup_keyword_case(const char *s, const char *ref)
 {
 	char *ret, *p;
 	unsigned char first = ref[0];
+	int		tocase;
+	const char *varval;
+
+	varval = GetVariable(pset.vars, "COMP_KEYWORD_CASE");
+	if (!varval)
+		tocase = 0;
+	else if (strcmp(varval, "lower") == 0)
+		tocase = -2;
+	else if (strcmp(varval, "preserve-lower") == 0)
+		tocase = -1;
+	else if (strcmp(varval, "preserve-upper") == 0)
+		tocase = +1;
+	else if (strcmp(varval, "upper") == 0)
+		tocase = +2;
+	else
+		tocase = 0;
 
-	if (isalpha(first))
-	{
-		ret = pg_strdup(s);
-		if (islower(first))
-			for (p = ret; *p; p++)
-*p = pg_tolower((unsigned char) *p);
-		else
-			for (p = ret; *p; p++)
-*p = pg_toupper((unsigned char) *p);
-		return ret;
-	}
+	/* default */
+	if (tocase == 0)
+		tocase = +1;
+
+	ret = pg_strdup(s);
+
+	if (tocase == -2
+		|| ((tocase == -1 || tocase == +1) && islower(first))
+		|| (tocase == -1 && !isalpha(first))
+		)
+		for (p = ret; *p; p++)
+			*p = pg_tolower((unsigned char) *p);
 	else
-		return pg_strdup(s);
+		for (p = ret; *p; p++)
+			*p = pg_toupper((unsigned char) *p);
+
+	return ret;
 }
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your sub

[HACKERS] checkpointer code behaving strangely on postmaster -T

2012-03-30 Thread Alvaro Herrera

I noticed while doing some tests that the checkpointer process does not
recover very nicely after a backend crashes under postmaster -T (after
all processes have been kill -CONTd, of course, and postmaster told to
shutdown via Ctrl-C on its console).  For some reason it seems to get
stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
do a checkpoint, but it was progressing a single page each time and then
sleeping.  In that condition, the checkpoint took a very long time to
finish.

Pressing Ctrl-C in the postmaster console at that point does not have
any effect.

-- 
Álvaro Herrera 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 7:04 PM, Marko Kreen  wrote:
> Have you looked at the examples?  PQgetResult() is pretty good hint
> that one resultset finished...

Ok, the demos are around this long thread and hard to find,
so here is a summary of links:

Original design mail:

  http://archives.postgresql.org/message-id/20120224154616.ga16...@gmail.com

First patch with quick demos:

  http://archives.postgresql.org/message-id/20120226221922.ga6...@gmail.com

Demos as diff:

  http://archives.postgresql.org/message-id/20120324002224.ga19...@gmail.com

Demos/experiments/tests (bit messier than the demos-as-diffs):

  https://github.com/markokr/libpq-rowproc-demos

Note - the point is that user *must* call PQgetResult() when resultset ends.
Thus also the "PQgetRow() does not return errors" decision.

I'll put this mail into commitfest page too, seems I've forgotten to
put some mails there.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] tracking context switches with perf record

2012-03-30 Thread Robert Haas
I've been playing around with "perf record" on the IBM POWER7 machine,
and it's pretty cool.  One of the things I don't like is that some of
the tools don't produce very nice text reports that you can cut and
paste into an email - it's kind of a text-based GUI where you can zoom
around and look at things.  Still, it's nice.

One of the neat things I've just discovered is that you can use a
call-graph report on context-switches to find out what user code is
resulting in context-switching.  For example, I ran a 32-client
pgbench test at scale factor 300 for half an hour and captured data
for 10 minutes relatively near the beginning.  The partially expanded
call tree for postgres looks like this:

  - 87.79% .schedule_timeout
 + 99.99% .unix_stream_recvmsg
  - 11.63% .SyS_semtimedop
   .SyS_ipc
   syscall_exit
   semop
 - PGSemaphoreLock
- 98.48% LWLockAcquire
   - 53.11% XLogInsert
  + 48.39% heap_update
  + 18.15% heap_insert
  + 16.36% RecordTransactionCommit
  + 15.67% log_heap_clean
  + 0.98% _bt_insertonpg
   + 18.71% ProcArrayEndTransaction
   - 14.77% SimpleLruReadPage_ReadOnly
TransactionIdGetStatus
TransactionLogFetch
  - TransactionIdDidCommit
 + 53.52% HeapTupleSatisfiesVacuum
 + 45.00% HeapTupleSatisfiesMVCC
 + 0.82% HeapTupleHeaderAdvanceLatestRemovedXid
   + 6.08% TransactionIdSetPageStatus
   + 2.63% LockBuffer
   + 1.92% GetSnapshotData
   + 1.51% SimpleLruReadPage
+ 1.52% ProcSleep
  + 0.50% do_work

So what you can see here is that all the context switches that
happened in postgres with the result either of a .schedule_timeout or
a .SyS_semtimedop.  87.79% of the context switches came from
.schedule_timeout, and 99.99% of those came from unix_stream_recvmsg.
If you expand that branch of the call tree, you find that all of them
are coming eventually from secure_read; the server is waiting for a
new query from the client.  This is, obviously, overhead we can't
eliminate from this test; waiting for the client is part of the job.
The other 11.63% of the context switches result from PGSemaphoreLock
calling semop.  Of those, 98.48% come from LWLockAcquire and 1.52%
come from ProcSleep.  The ProcSleep calls are all from blocking while
attempting to acquire a heavyweight lock; and nearly all of them are
from conflicting tuple locks.  However, they account for such a small
portion of the overall context switches that they're probably not
worth worrying about.

The more interesting waits, in my view anyway, are the ones that come
from LWLockAcquire, which account for nearly all of the semaphore
sleeps.  As you can see, XLogInsert accounts for over half of those,
and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
most of the rest.  Just out of curiosity, I expanded a couple of those
another level, and you can see what's going on there, too.  I was
slightly surprised by the number of calls to log_heap_clean, since I
wouldn't have expected vacuuming to generate that much work; it turns
out that 99.9% of those are HOT pruning events happening during query
execution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Andrew Dunstan



On 03/30/2012 11:41 AM, Robert Haas wrote:

On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer  wrote:

Well, in our case HTTP is a clear win (but not replacement) and SPDY a
potential one (even as a replacement).  Even if SPDY is not widely adopted
it could still replace FEBE if there's a clear advantage to using it, I
don't know enough to make the call right now.

I can see that there are some advantages to having an HTTP interface
to the database, but I think throwing our existing protocol out the
window or relegating it to the status of a second-class citizen would
be foolish.



Right, I can't imagine it happening. And I wouldn't really be keen to 
add an alternative protocol either.


I could imagine a client which presented a SPDY interface to the world 
and translated it into standard calls, possibly via libpq.


It's well to remember that we are not a green fields project here.



HTTP is a non-trivial protocol that tends to impose lots
of escaping and de-escaping overhead which is unnecessary for people
who just want to connect to the database and run queries.  I can
completely understand that someone might want the ability to do GET
/db/table/pk and have that return an answer very, very quickly, by
bypassing the usual parser and planner and just firing off an
index-scan and returning the results as JSON or somesuch.  But I think
it would be a serious mistake to assume that GET /q?q=myquery is going
to come out better than what we have now in the general case.



Indeed.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Fri, Mar 30, 2012 at 11:59:12AM -0400, Tom Lane wrote:
> Marko Kreen  writes:
> > On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
> >> Marko Kreen  writes:
> >>> Second conclusion is that current dblink row-processor usage is broken
> >>> when user uses multiple SELECTs in SQL as dblink uses plain PQexec().
> 
> >> Yeah.  Perhaps we should tweak the row-processor callback API so that
> >> it gets an explicit notification that "this is a new resultset".
> >> Duplicating PQexec's behavior would then involve having the dblink row
> >> processor throw away any existing tuplestore and start over when it
> >> gets such a call.
> >> 
> >> There's multiple ways to express that but the most convenient thing
> >> from libpq's viewpoint, I think, is to have a callback that occurs
> >> immediately after collecting a RowDescription message, before any
> >> rows have arrived.  So maybe we could express that as a callback
> >> with valid "res" but "columns" set to NULL?
> >> 
> >> A different approach would be to add a row counter to the arguments
> >> provided to the row processor; then you'd know a new resultset had
> >> started if you saw rowcounter == 0.  This might have another advantage
> >> of not requiring the row processor to count the rows for itself, which
> >> I think many row processors would otherwise have to do.
> 
> > Try to imagine how final documentation will look like.
> 
> > Then imagine documentation for PGrecvRow() / PQgetRow().
> 
> What's your point, exactly?  PGrecvRow() / PQgetRow() aren't going to
> make that any better as currently defined, because there's noplace to
> indicate "this is a new resultset" in those APIs either.

Have you looked at the examples?  PQgetResult() is pretty good hint
that one resultset finished...

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Tom Lane
Marko Kreen  writes:
> On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
>> Marko Kreen  writes:
>>> Second conclusion is that current dblink row-processor usage is broken
>>> when user uses multiple SELECTs in SQL as dblink uses plain PQexec().

>> Yeah.  Perhaps we should tweak the row-processor callback API so that
>> it gets an explicit notification that "this is a new resultset".
>> Duplicating PQexec's behavior would then involve having the dblink row
>> processor throw away any existing tuplestore and start over when it
>> gets such a call.
>> 
>> There's multiple ways to express that but the most convenient thing
>> from libpq's viewpoint, I think, is to have a callback that occurs
>> immediately after collecting a RowDescription message, before any
>> rows have arrived.  So maybe we could express that as a callback
>> with valid "res" but "columns" set to NULL?
>> 
>> A different approach would be to add a row counter to the arguments
>> provided to the row processor; then you'd know a new resultset had
>> started if you saw rowcounter == 0.  This might have another advantage
>> of not requiring the row processor to count the rows for itself, which
>> I think many row processors would otherwise have to do.

> Try to imagine how final documentation will look like.

> Then imagine documentation for PGrecvRow() / PQgetRow().

What's your point, exactly?  PGrecvRow() / PQgetRow() aren't going to
make that any better as currently defined, because there's noplace to
indicate "this is a new resultset" in those APIs either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-03-30 Thread Marko Kreen
On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote:
> Marko Kreen  writes:
> > My conclusion is that row-processor API is low-level expert API and
> > quite easy to misuse.  It would be preferable to have something more
> > robust as end-user API, the PQgetRow() is my suggestion for that.
> > Thus I see 3 choices:
> 
> > 1) Push row-processor as main API anyway and describe all dangerous
> >scenarios in documentation.
> > 2) Have both PQgetRow() and row-processor available in ,
> >PQgetRow() as preferred API and row-processor for expert usage,
> >with proper documentation what works and what does not.
> > 3) Have PQgetRow() in , move row-processor to .
> 
> I still am failing to see the use-case for PQgetRow.  ISTM the entire
> point of a special row processor is to reduce the per-row processing
> overhead, but PQgetRow greatly increases that overhead.

No, decreasing CPU overhead is minor win.  I guess in realistic
application, like dblink, you can't even measure the difference.

The *major* win comes from avoiding buffering of all rows in PGresult.
Ofcourse, this is noticeable only with bigger resultsets.
I guess such buffering pessimizes memory usage: code always
works on cold cache.  And simply keeping RSS low is good for
long-term health of a process.

Second major win is avoiding the need to use cursor with small chunks
to access resultset of unknown size.  Thus stalling application
until next block arrives from network.


The "PGresult *PQgetRow()" is for applications that do not convert
rows immediately to some internal format, but keep using PGresult.
So they can be converted to row-by-row processing with minimal
changes to actual code.

Note that the PGrowValue is temporary struct that application *must*
move data away from.  If app internally uses PGresult, then it's
pretty annoying to invent a new internal format for long-term
storage.

But maybe I'm overestimating the number of such applications.

> And it doesn't
> reduce complexity much either IMO: you still have all the primary risk
> factors arising from processing rows in advance of being sure that the
> whole query completed successfully.

It avoids the complexity of:

* How to pass error from callback to upper code

* Needing to know how exceptions behave

* How to use early exit to pass rows to upper code one-by-one,
  (by storing the PGresult and PGrowValue in temp place
   and later checking their values)

* How to detect that new resultset has started. (keeping track
  of previous PGresult or noting some quirky API behaviour
  we may invent for such case)

* Needing to make sure the callback does not leak to call-sites
  that expect regular libpq behaviour.
  ("Always call PQregisterRowProcessor(db, NULL, NULL) after query finishes" )
  ["But now I'm in exception handler, how do I find the connection?"]


I've now reviewed the callback code and even done some coding with it
and IMHO it's too low-level to be end-user-API.

Yes, the "query-may-still-fail" complexity remains, but thats not unlike
the usual "multi-statement-transaction-is-not-guaranteed-to-succeed"
complexity.

Another compexity that remains is "how-to-skip-current-resultset",
but that is a problem only on sync connections and the answer is
simple - "call PQgetResult()".  Or "call PQgetRow/PQrecvRow" if
user wants to avoid buffering.

> Plus it conflates "no more data"
> with "there was an error receiving the data" or "there was an error on
> the server side".

Well, current PQgetRow() is written with style: "return only single-row
PGresult, to see errors user must call PQgetResult()".  Basically
so that user it forced to fall back familiar libpq usage pattern.

It can be changed, so that PQgetRow() returns also errors.

Or we can drop it and just keep PQrecvRow().

> PQrecvRow alleviates the per-row-overhead aspect of
> that but doesn't really do a thing from the complexity standpoint;
> it doesn't look to me to be noticeably easier to use than a row
> processor callback.

> I think PQgetRow and PQrecvRow just add more API calls without making
> any fundamental improvements, and so we could do without them.  "There's
> more than one way to do it" is not necessarily a virtue.

Please re-read the above list of problematic situations that this API
fixes.  Then, if you still think that PQrecvRow() is pointless, sure,
let's drop it.

We can also postpone it to 9.3, to poll users whether they want
easier API, or is maximum performance important.  (PQrecvRow()
*does* have few cycles of overhead compared to callbacks.)

Only option that we have on the table for 9.2 but not later
is moving the callback API to .

> > Second conclusion is that current dblink row-processor usage is broken
> > when user uses multiple SELECTs in SQL as dblink uses plain PQexec().
> 
> Yeah.  Perhaps we should tweak the row-processor callback API so that
> it gets an explicit notification that "this is a new resultset".
> Duplicating PQexec's behavior would then involve having the d

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer  wrote:
> Well, in our case HTTP is a clear win (but not replacement) and SPDY a
> potential one (even as a replacement).  Even if SPDY is not widely adopted
> it could still replace FEBE if there's a clear advantage to using it, I
> don't know enough to make the call right now.

I can see that there are some advantages to having an HTTP interface
to the database, but I think throwing our existing protocol out the
window or relegating it to the status of a second-class citizen would
be foolish.  HTTP is a non-trivial protocol that tends to impose lots
of escaping and de-escaping overhead which is unnecessary for people
who just want to connect to the database and run queries.  I can
completely understand that someone might want the ability to do GET
/db/table/pk and have that return an answer very, very quickly, by
bypassing the usual parser and planner and just firing off an
index-scan and returning the results as JSON or somesuch.  But I think
it would be a serious mistake to assume that GET /q?q=myquery is going
to come out better than what we have now in the general case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Dimitri Fontaine
Robert Haas  writes:
> Oh, right: I remember that now.  I still think it's a bad way to do
> it, because the trigger potentially has a lot of work to do to
> reconstruct a working command string, and it still ends up getting
> executed by the wrong user.  For CREATE EXTENSION it's not that bad,

That's true, I'm only kind of saying that the INSTEAD OF keyword still
makes sense (instead of security_checks). I agree that the feature is
simpler to use the way to propose it.

>>   context = 'DROP TYPE'
>>   tag = 'DROP FUNCTION'
>
> I'm not sure I understand how you're using the words "context" and

Yeah context is not explicit, we could call that "toplevel": the command
tag of the command that the user typed. When toplevel is null, the event
trigger is fired on a command the user sent, when it's not null, the
trigger is fired on some inner command operation.

> "tag".  I think for a drop trigger I would want the function to
> receive this information: type of object dropped, OID of object
> dropped, column number in the case of a column drop, flag indicating
> whether it's a toplevel drop or a cascaded drop.  I wouldn't object to
> also making the currently-in-context toplevel command tag available,
> but I think most drop triggers wouldn't really care, so I wouldn't
> personally spend much implementation effort on it if it turns out to
> be hard.

I'm not sure it would be hard as I'm only seeing a single depth possible
here, so a single per-backend string static variable would do.

> But in general, I don't really know what a "proper" subcommand is or
> why some subcommands should be more proper than others, or why we
> should even be concerned about whether something is a subcommand at
> all.  I think it's fine and useful to have triggers that fire in those
> kinds of places, but I don't see why we should limit ourselves to
> that.  For applications like replication, auditing, and enhanced
> security, the parse tree and subcommand/non-subcommand status of a
> particular operation are irrelevant.  What you need is an exact

Not really. When replicating you could perfectly say that you only
replicate the toplevel DROP because the replica will also do the cascade
dance and you might have decided not to replicate all related objects on
the other side.

The information you need really want not to miss is when only the
cascaded object is part of the replication, not the main one. That was
not covered by my previous patch but now we have a way to cover it.

> description of the operation that got performed (e.g. the default on
> table X column Y got dropped); you might be able to reverse-engineer
> that from the parse tree, but it's much better to have the system pass
> you the information you need more directly.  Certainly, there are
> cases where you might want to have the parse tree, or even the raw
> command text, available, but I'm not even convinced that that those
> cases will be the most commonly used ones unless, of course, they're
> the only ones we offer, in which case everyone will go down that path
> by necessity.

There are far too many variants and cases of our command to be able to
extract their parameters in a flat way (a bunch of variables compared to
a nested description ala json or xml), and I don't think such a flat
representation is going to be much better than the parse tree.

Now, we will later be able to offer a normalized rewritten command
string from the parse tree to the use, but I don't see us adding support
for that from cascaded drops, one other reason why I like to expose them
as sub commands.

> Again, I'm not understanding the distinction between toplevel events
> and sub-events.  I don't see any need for such a distinction.  I think
> there are just events, and some of them happen at command start/end
> and others happen somewhere in the middle.  As long as it's a safe and
> useful place to fire a trigger, who cares?

I guess you're head is too heavily in the code side of things as opposed
to the SQL user view point. Maybe my attempt to conciliate both views is
not appropriate, but I really do think it is.

>> Given the scope of this mini expression language, we can easily bypass
>> calling the executor in v1 here, and reconsider later if we want to
>> allow calling a UDF in the WHEN clause… I don't think it's an easy
>> feature to add in, though.
>
> Or a necessary one.  AFAICS, the main benefit of WHEN clauses on

Exactly.

> regular triggers is that you can prevent the AFTER trigger queue from
> getting huge, and maybe a save a little on the cost of invoking a
> trigger function just to exit again.  But neither of those should be
> relevant here - nobody does that much DDL, and anybody writing command
> triggers should understand that this is advanced magic not intended
> for beginners.  Wizards below level 10 need not apply.

Here it's only a facility to manage your event trigger code
organization, and I insisted in having it in the syntax because in the
event trigger grammar

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Dobes Vandermeer
On Fri, Mar 30, 2012 at 9:57 PM, Merlin Moncure  wrote:

> On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan 
> wrote:
> > 1. I've been in discussion with some people about adding simple JSON
> extract
> > functions. We already have some (i.e. xpath()) for XML.
>
> Your json work is a great start in marrying document level database
> features with a relational backend.  My take is that storing rich data
> inside the database in json format, while tempting, is generally a
> mistake.  Unless the document is black box it should be decomposed and
> stored relationally and marked back up into a document as it goes out
> the door.
>

I found storing documents (with no schema) in CouchDB very sexy.  For
analytics purposes it's very handy to throw whatever data you've got into a
document and save it to the database and sort out the schema later, when
you have a need for it.  It could be you want to allow hundreds of
different fields in the document, and the document has a hierarchal
structure, and so on ... and yet there's no interest in spending time
figuring out how to map all that effectively into a SQL schema.  Some
things might start out a black box but become interesting for reporting
purposes later.

While I was doing this I always thought this would have been a better
approach for my previous project, an accounting application.  If I could
just have stored entities like invoice & customer as a single document that
is inserted, updated, etc. atomically it would be a lot simpler and faster
than having to break things out into columns and rows spread over various
tables.  Some fields are just "information" that isn't searched for or used
relationally but ends up getting a field (or a whole bunch of fields, like
your address) in the DB schema anyway.  Adding, removing, and changing
columns is always scary to me in the SQL database - when there's no schema
you can fix these things as part of the definition of your "view" instead.

So I kind of think the document database kind of bridges the gap between an
OODBMS and the RDBMS because the document is like a little cluster of
objects and then you populate your relations by mapping a view of those
documents and use that for analysis and search.

What I'm saying is that jsonpath probably isn't the whole story:
> another way of bulk moving json into native backend structures without
> parsing would also be very helpful.  For example, being able to cast a
> json document into a record or a record array would be just amazing.
>

Hmm whatever that exactly is, it does sound cool!


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Dobes Vandermeer
On Fri, Mar 30, 2012 at 4:30 PM, Daniel Farina  wrote:

> On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer 
> wrote:
> >> Virtual hosts. Same port.>> I think SPDY or like-protocols [...] give a
> crisp treatment to
> >> interactive, stateful workloads involving
> >>
> >> back-and-forth between client and server with multiplexing, fixing
> >> some problems with the hacks in HTTP-land from before.
> >
> > It sounds like at some level you're really talking about replacing the
> > built-in protocol with SPDY because SPDY is possibly a better baseline
> than
> > updating the existing protocol.  That's an interesting idea, I think this
> > project could evolve in that direction if there's demand for it.
>
> If only so there is a smaller set of arbitrary decisions to make about
> how to delimit messages...but if SPDY doesn't get widely deployed, or
> exacts an unacceptable performance penalty, it is game over.
>

Well, in our case HTTP is a clear win (but not replacement) and SPDY a
potential one (even as a replacement).  Even if SPDY is not widely adopted
it could still replace FEBE if there's a clear advantage to using it, I
don't know enough to make the call right now.


Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Andrew Dunstan



On 03/30/2012 09:57 AM, Merlin Moncure wrote:
What I'm saying is that jsonpath probably isn't the whole story: 
another way of bulk moving json into native backend structures without 
parsing would also be very helpful. For example, being able to cast a 
json document into a record or a record array would be just amazing.


It should be possible. After all, hstore's populate_record() does that. 
If we married that logic up with the json parser we should be half way 
there.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Tom Lane
Arun Chaitanya  writes:
> The link to the paper is
> http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Given the authorship of that paper, I'd have to wonder whether Microsoft
has filed for any patents regarding these ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Thanks Robert,

Yes. I think I am being over ambitious as I never had any Open Source
development experience.
Anyways, please go through the idea. I have posted the link to the
paper in on of the replies.

Please, suggest me other options which I can take up as a GSOC 2012 project.

On Fri, Mar 30, 2012 at 5:34 PM, Robert Haas  wrote:
> On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya  
> wrote:
>> I wanted to take up this as a GSOC 2012 project.
>
> This would be a great query planner optimization but the chances of
> getting it done in one summer as a GSoC project seem to me to be nil.
> You've never had a patch accepted before; picking a massive
> reorganization of the query planner as your first project is not going
> to work out well.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Thanks a lot Heikki.
I have already posted an example in the mail.
The link to the paper is
http://www.iith.ac.in/~ravig/courses/cs5050/papers/decorrelation-cesar.pdf

Hope this helps,
Arun

On Fri, Mar 30, 2012 at 5:32 PM, Heikki Linnakangas
 wrote:
> (off-list)
>
> You'll want to post a link to the paper, otherwise people who are not GSoC
> mentors will have no idea what you're talking about ;-). Posting an example
> query and access plans would illustrate the point, too.
>
>
> On 30.03.2012 14:33, Arun Chaitanya wrote:
>>
>> Hi,
>>
>> I wanted to take up this as a GSOC 2012 project.
>>
>> SQL supports nested queries. When the inner query contains a
>> correlation variable the present optimizer takes an iterative
>> execution plan. If the inner query scans over a relation, the
>> iterative plan chosen can be sub-optimal.
>>
>> The goal of this project is to enable De-correlation for all possible
>> cases. The iterative execution plan can be converted to a set oriented
>> plan by taking a join over the base relations involved. Sufficient
>> work has already been done in this area and has been implemented in
>> SQL Server.
>>
>> The changes required to incorporate the above mentioned strategy is in
>> rewriting phase to the best of my knowledge. The key idea is to
>> introduce the APPLY operator in the raw parse tree. In the above
>> mentioned Papers, the author has mentioned the process of removing the
>> apply. The author has proposed a set of rules which will allow us to
>> achieve the goal. The present postgresql optimizer community has done
>> some work in these lines for simple subqueries involving =,>  ,<  in
>> the predicates [ I observed it by seeing the result of EXPLAIN for
>> relevant queries ]. The optimization is not done for subqueries
>> containing aggregate queries and existential and containment queries.
>>
>> An example query from TPCH benchmark discussed by the author:
>>
>> select c_custkey
>> from customer
>> where 100<  (select sum(o_totalprice)
>> from orders
>> where o_custkey = c_custkey)
>>
>> In the above case, c_custkey is a correlation variable (parameterized)
>> coming from the outer query. Hence in the present system, the inner
>> query is executed as many times as the tuples in the customer
>> relation. As the subQuery involves a scan over orders relation, the
>> total I/O cost involved is pretty high.
>>
>> Using the transformation proposed by the author, the query can be
>> re-written as
>>
>> select c_custkey
>> from customer left outer join
>> orders on o_custkey = c custkey
>> group by c_custkey
>> having 100<  sum(o_totalprice)
>>
>> This allows the optimizer to chose a better plan for left outer join
>> and avoid iterative execution. The idea here is not to get a rewritten
>> query as output but to generate a plan tree that does the same as the
>> above query.
>>
>> Regards,
>> Arun
>>
>
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 4:32 AM, Dimitri Fontaine
 wrote:
> I did that another way in previous incarnations of the patch, which was
> to allow for INSTEAD OF event trigger backed by a SECURITY DEFINER
> function. When the extension is whitelisted, prevent against recursion
> then CREATE EXTENSION in the security definer function, then signal that
> the execution should now be aborted.
>
> That was too dangerous given the lack of policy about where exactly the
> user code is fired, but I think we could now implement that for some of
> the event timing specs we're listing. Only some of them, I guess only
> those that are happening before we lock the objects.

Oh, right: I remember that now.  I still think it's a bad way to do
it, because the trigger potentially has a lot of work to do to
reconstruct a working command string, and it still ends up getting
executed by the wrong user.  For CREATE EXTENSION it's not that bad,
because the arguments to the command are so simple, but of course any
time we extend the CREATE EXTENSION syntax, the trigger needs to know
about it too whether it's security-relevant or not, and doing
something similar with, say, ALTER TABLE would be a ridiculously
complicated.  I think there is a use case for what you called an
INSTEAD OF trigger, but I don't believe in this one.  It seems to me
that there's a lot of power in being able to *just* intercept the
security decision and then let the rest of the command go about its
business.  Of course, you have to avoid getting security checks (like,
you must own the table in order to drop a column) with integrity
checks (like, you can't drop a column from pg_class) but I think
that's not very hard to get right.

>> More generally, my thought on the structure of this is that you're
>> going to have certain toplevel events, many of which will happen at
>> only a single place in the code, like "an object got dropped" or "a
>> DDL command started" or "a DDL command ended".  So we give those
>> names, like sql_drop, ddl_command_start, and ddl_command_end.  Inside
>
> I really dislike mixing sql_drop and ddl_command_start as being the same
> kind of objects here, even if I can bend my head in the right angle and
> see that it's a fair view when looking at how it's implemented. I can't
> see a way to explain that to users without having to explain them how
> drop cascade is implemented.
>
> So my proposal here is to “fake” a “proper“ subcommand thanks to the new
> context variable. If you DROP TYPE foo CASCADE and that in turn drops a
> function foo_in(), then an event trigger is fired with
>
>   context = 'DROP TYPE'
>   tag = 'DROP FUNCTION'
>
> Same idea when you DROP TABLE … CASCADE and a SEQUENCE and a bunch of
> index need to disappear too, you get an usual event trigger fired with
> the context set to 'DROP TABLE' this time.
>
> I don't think we need to arrange for explicitly publishing the context
> specific information here. If we need to, we have to find the right
> timing spec where we can guarantee still being in the top level command
> and where we already have the details filled in, then users can attach a
> trigger here and register the information for themselves.

I'm not sure I understand how you're using the words "context" and
"tag".  I think for a drop trigger I would want the function to
receive this information: type of object dropped, OID of object
dropped, column number in the case of a column drop, flag indicating
whether it's a toplevel drop or a cascaded drop.  I wouldn't object to
also making the currently-in-context toplevel command tag available,
but I think most drop triggers wouldn't really care, so I wouldn't
personally spend much implementation effort on it if it turns out to
be hard.

But in general, I don't really know what a "proper" subcommand is or
why some subcommands should be more proper than others, or why we
should even be concerned about whether something is a subcommand at
all.  I think it's fine and useful to have triggers that fire in those
kinds of places, but I don't see why we should limit ourselves to
that.  For applications like replication, auditing, and enhanced
security, the parse tree and subcommand/non-subcommand status of a
particular operation are irrelevant.  What you need is an exact
description of the operation that got performed (e.g. the default on
table X column Y got dropped); you might be able to reverse-engineer
that from the parse tree, but it's much better to have the system pass
you the information you need more directly.  Certainly, there are
cases where you might want to have the parse tree, or even the raw
command text, available, but I'm not even convinced that that those
cases will be the most commonly used ones unless, of course, they're
the only ones we offer, in which case everyone will go down that path
by necessity.

>> your trigger procedure, the set of magic variables that is available
>> will depend on which toplevel event you set the trigger on, but
>> hopefully al

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Merlin Moncure
On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan  wrote:
> 1. I've been in discussion with some people about adding simple JSON extract
> functions. We already have some (i.e. xpath()) for XML.

I've built a couple of applications that push data in and out of xml
via manual composition going out and xpath coming in.  TBH, I found
this to be a pretty tedious way of developing a general application
structure and a couple of notches down from the more sql driven
approach.  Not that jsonpath/xpath aren't wonderful functions -- but I
thing for general information passing there's a better way.

Your json work is a great start in marrying document level database
features with a relational backend.  My take is that storing rich data
inside the database in json format, while tempting, is generally a
mistake.  Unless the document is black box it should be decomposed and
stored relationally and marked back up into a document as it goes out
the door.  This is why brevity and flexibility of syntax is so
important when marshaling data in and out of transport formats.  It
encourages people to take the right path and get the best of both
worlds -- a rich backend with strong constraints that can natively
speak such that writing data driven web services is easy.

What I'm saying is that jsonpath probably isn't the whole story:
another way of bulk moving json into native backend structures without
parsing would also be very helpful.  For example, being able to cast a
json document into a record or a record array would be just amazing.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: pg_test_timing utility, to measure clock monotonicity and timing

2012-03-30 Thread Robert Haas
On Wed, Mar 28, 2012 at 10:43 AM, Fujii Masao  wrote:
> On Wed, Mar 28, 2012 at 9:19 PM, Robert Haas  wrote:
>> On Tue, Mar 27, 2012 at 10:10 PM, Fujii Masao  wrote:
>>> On Wed, Mar 28, 2012 at 5:17 AM, Robert Haas  wrote:
 pg_test_timing utility, to measure clock monotonicity and timing cost.
>>>
>>> When I compiled this, I got a compiler warning. Attached patch
>>> silences the warning.
>>
>> Unfortunately, that *produces* a warning on my machine.  Normally, I
>> think we handle this using INT64_FORMAT, but the fact that it's %10ld
>> here and not just %lld makes that awkward.  I guess we maybe need to
>> insert some kludgy workaround here - write it into a separate buffer,
>> and then blank-pad it, or something like that.
>
> This seems a simplest workaround. How about attached patch?

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Robert Haas
On Fri, Mar 30, 2012 at 7:33 AM, Arun Chaitanya  wrote:
> I wanted to take up this as a GSOC 2012 project.

This would be a great query planner optimization but the chances of
getting it done in one summer as a GSoC project seem to me to be nil.
You've never had a patch accepted before; picking a massive
reorganization of the query planner as your first project is not going
to work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

2012-03-30 Thread Arun Chaitanya
Hi,

I wanted to take up this as a GSOC 2012 project.

SQL supports nested queries. When the inner query contains a
correlation variable the present optimizer takes an iterative
execution plan. If the inner query scans over a relation, the
iterative plan chosen can be sub-optimal.

The goal of this project is to enable De-correlation for all possible
cases. The iterative execution plan can be converted to a set oriented
plan by taking a join over the base relations involved. Sufficient
work has already been done in this area and has been implemented in
SQL Server.

The changes required to incorporate the above mentioned strategy is in
rewriting phase to the best of my knowledge. The key idea is to
introduce the APPLY operator in the raw parse tree. In the above
mentioned Papers, the author has mentioned the process of removing the
apply. The author has proposed a set of rules which will allow us to
achieve the goal. The present postgresql optimizer community has done
some work in these lines for simple subqueries involving =, > , < in
the predicates [ I observed it by seeing the result of EXPLAIN for
relevant queries ]. The optimization is not done for subqueries
containing aggregate queries and existential and containment queries.

An example query from TPCH benchmark discussed by the author:

select c_custkey
from customer
where 100 < (select sum(o_totalprice)
from orders
where o_custkey = c_custkey)

In the above case, c_custkey is a correlation variable (parameterized)
coming from the outer query. Hence in the present system, the inner
query is executed as many times as the tuples in the customer
relation. As the subQuery involves a scan over orders relation, the
total I/O cost involved is pretty high.

Using the transformation proposed by the author, the query can be re-written as

select c_custkey
from customer left outer join
orders on o_custkey = c custkey
group by c_custkey
having 100 < sum(o_totalprice)

This allows the optimizer to chose a better plan for left outer join
and avoid iterative execution. The idea here is not to get a rewritten
query as output but to generate a plan tree that does the same as the
above query.

Regards,
Arun

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finer Extension dependencies

2012-03-30 Thread Dimitri Fontaine
Tom Lane  writes:
>> That's how I did it first, but Alvaro opposed to that because it allows
>> for more than one extension to provide for the same feature name.
>>   http://archives.postgresql.org/pgsql-hackers/2012-03/msg01425.php
>
> Right, but the question that has to be considered is how often would
> that be intentional as opposed to an undesirable name collision.
> I think Hitoshi was right upthread that it will seldom if ever be
> the case that somebody is independently reimplementing somebody
> else's API, so the use-case for intentional substitution seems thin.

I reverted that change and we're now back to:

Table "pg_catalog.pg_extension_feature"
   Column   | Type | Modifiers 
+--+---
 extoid | oid  | not null
 extfeature | name | not null
Indexes:
"pg_extension_feature_index" UNIQUE, btree (extoid, extfeature)
"pg_extension_feature_oid_index" UNIQUE, btree (oid)

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



extension-provides.v7.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] poll: CHECK TRIGGER?

2012-03-30 Thread Pavel Stehule
Hello

2012/3/28 Heikki Linnakangas :
> Ok, seems that the API issue is settled, so I'm now looking at the code
> actually doing the checking. My first impression is that this is a lot of
> code. Can we simplify it?
>

I played with this and It is not be reduced without darkening current
code of pl_exec.c. So I moved code related to checking from files to
new file pl_check.c. This code is relative large - about 50 kb, but it
is relative simple and I hope it is readable. I afraid so this cannot
be reduced by reuse with other pl_func.c (significantly). Recursive
iteration over nodes is relative not big part of this patch (~25%) and
general stmt walker doesn't help too much.

> Since this is deeply integrated into the PL/pgSQL interpreter, I was
> expecting that this would run through the normal interpreter, in a special
> mode that skips all the actual execution of queries, and shortcuts all loops
> and other control statements so that all code is executed only once. That
> would mean sprinkling some "if (check_only)" code into the normal exec_*
> functions. I'm not sure how invasive that would be, but it's worth
> considering. I think you would be able to more easily catch more errors that
> way, and the check code would stay better in sync with the execution code.

-1

there are a few places, that are very difficult: basic block with
exception handling - exception handlers, CASE stmt,  Other issue
is increasing of complexity some routines like exec_eval*

>
> Another thought is that check_stmt() and all its subroutines are very
> similar to the plpgsql_dumptree() code. Would it make sense to merge those?
> You could have an output mode, in addition to the xml and plain-text
> formats, that would just dump the whole tree like plpgsql_dumptree() does.
>

It is difficult now - without changes in plpgsql_stmt_if,
plpgsql_stmt_case and plpgsql_stmt_block is not possible to write
general walker that is usable for checking and dumptree. It needs
redesign of these nodes first.

> In prepare_expr(), you use a subtransaction to catch any ERRORs that happen
> during parsing the expression. That's a good idea, and I think many of the
> check_* functions could be greatly simplified by adopting a similar
> approach. Just ereport() any errors you find, and catch them at the
> appropriate level, appending the error to the output string. Your current
> approach of returning true/false depending on whether there was any errors
> seems tedious.

It cannot be implemented in AST interpret. Without removing some
requested functionality - fatal_errors.

>
> If you create a function with an invalid body (ie. set
> check_function_bodies=off; create function ... $$ bogus $$;) ,
> plpgsql_check_function() still throws an error. It's understandable that it
> cannot do in-depth analysis if the function cannot be parsed, but I would
> expect the syntax error to be returned as a return value like other errors
> that it complains about, not thrown as a hard ERROR. That would make it more
> useful to bulk-check all functions in a database with something like "select
> plpgsql_check_function(oid) from pg_class". As it is, the checking stops at
> the first invalid function with an error.

done

postgres=> select plpgsql_check_function('sss'::regproc, 0);
   plpgsql_check_function
-
 error:42601:syntax error at or near "adasdfsadf"
 Query:  adasdfsadf
 --  ^
 Context: compilation of PL/pgSQL function "sss" near line 1
(4 rows)


>
> PS. I think plpgsql_check_function() belongs in pl_handler.c

done

Regards

Pavel Stehule

>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com


plpgsql_check_function2-2012-03-30.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers patch v18

2012-03-30 Thread Dimitri Fontaine
Robert Haas  writes:
> I'm thinking of things like extension whitelisting.  When some
> unprivileged user says "CREATE EXTENSION harmless", and harmless is
> marked as superuser-only, we might like to have a hook that gets
> called *at permissions-checking time* and gets to say, oh, well, that
> extension is on the white-list, so we're going to allow it.  I think
> you can come up with similar cases for other commands, where in
> general the operation is restricted to superusers or database owners
> or table owners but in specific cases you want to allow others to do
> it.

I did that another way in previous incarnations of the patch, which was
to allow for INSTEAD OF event trigger backed by a SECURITY DEFINER
function. When the extension is whitelisted, prevent against recursion
then CREATE EXTENSION in the security definer function, then signal that
the execution should now be aborted.

That was too dangerous given the lack of policy about where exactly the
user code is fired, but I think we could now implement that for some of
the event timing specs we're listing. Only some of them, I guess only
those that are happening before we lock the objects.

I would then prefer using the INSTEAD OF words that are way more easy to
grasp than AT.

> CREATE EVENT TRIGGER name ON event_name (event_subtype_name [, ...])
> EXECUTE PROCEDURE function_name(args);

  create event trigger prohibit_some_ddl
     preceding 
          when tag in ('CREATE TABLE', 'ALTER TABLE')
       execute procedure throw_an_error();
>
> I guess that would make sense if you think there would ever be more
> than one choice for .  I'm not immediately seeing a
> use case for that, though - I was explicitly viewing the syntax foo

So, the variables in question are tag, objectid, objectname, schemaname
and from a very recent email context. On reflexion, I think the variable
here would only be either tag or context, and that's it.

> More generally, my thought on the structure of this is that you're
> going to have certain toplevel events, many of which will happen at
> only a single place in the code, like "an object got dropped" or "a
> DDL command started" or "a DDL command ended".  So we give those
> names, like sql_drop, ddl_command_start, and ddl_command_end.  Inside

I really dislike mixing sql_drop and ddl_command_start as being the same
kind of objects here, even if I can bend my head in the right angle and
see that it's a fair view when looking at how it's implemented. I can't
see a way to explain that to users without having to explain them how
drop cascade is implemented.

So my proposal here is to “fake” a “proper“ subcommand thanks to the new
context variable. If you DROP TYPE foo CASCADE and that in turn drops a
function foo_in(), then an event trigger is fired with

   context = 'DROP TYPE'
   tag = 'DROP FUNCTION'

Same idea when you DROP TABLE … CASCADE and a SEQUENCE and a bunch of
index need to disappear too, you get an usual event trigger fired with
the context set to 'DROP TABLE' this time.

I don't think we need to arrange for explicitly publishing the context
specific information here. If we need to, we have to find the right
timing spec where we can guarantee still being in the top level command
and where we already have the details filled in, then users can attach a
trigger here and register the information for themselves.

> your trigger procedure, the set of magic variables that is available
> will depend on which toplevel event you set the trigger on, but
> hopefully all firings of that toplevel event can provide the same
> magic variables.  For example, at ddl_command_start time, you're just
> gonna get the command tag, but at ddl_command_end time you will get
> the command tag plus maybe some other stuff.

With my proposal above, you could get the same set of information when
being called as a toplevel event or a subevent (one where the context is
not null). That would mean adding object name and schema name lokkups in
the drop cascade code, though. We can also decide not to do that extra
lookup and just publish the object id which we certainly do have.

This way, the timing spec of a sub-event can still be of the same kind
as the top-level event ones, we still have before and after lock entry
points, same with lookup if we add that feature, etc.

> Now, we COULD stop there.  I mean, we could document that you can
> create a trigger on ddl_command_start and every DDL command will fire
> that trigger, and if the trigger doesn't care about some DDL
> operations, then it can look at the command tag and return without
> doing anything for the operations it doesn't care about.  The only
> real disadvantage of doing it that way is speed, and maybe a bit of
> code complexity within the trigger.  So my further thought was that

Within my “context proposal”, you also lose the ability to refer to sub
events as plain events with a context, which I find so much cleaner.

> we'd allow you to specify 

Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)

2012-03-30 Thread Daniel Farina
On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer  wrote:
>> Virtual hosts. Same port.
>
> In that case, the frontend would not be tied to a specific PostgreSQL
> server, then?  I think initially this might complicate things a bit, and you
> could solve it by putting an HTTP proxy in front to do the virtual hosts for
> you.

I think these problems are treatable, as you mention, and if there is
to be any value at all by using mostly off-the-shelf components
(which, at this time, are rarer for the new-generation stuff).  That's
the draw, for me.

> Ah, I get it - you want a way to add some extra information to the protocol
> in a backwards compatible way.  HTTP (and SPDY) provides a "standard" way to
> do that.  Makes sense.
>
> I'll make a note of these and hash out the details a bit more once there's
> something working to add them to.

A lot of them are old ideas, but it would be nice to encourage
experimentation by getting over some of the
small-matter-of-programming and backwards-compatibility issues.

>> I think SPDY or like-protocols [...] give a crisp treatment to
>> interactive, stateful workloads involving
>>
>> back-and-forth between client and server with multiplexing, fixing
>> some problems with the hacks in HTTP-land from before.
>
> It sounds like at some level you're really talking about replacing the
> built-in protocol with SPDY because SPDY is possibly a better baseline than
> updating the existing protocol.  That's an interesting idea, I think this
> project could evolve in that direction if there's demand for it.

If only so there is a smaller set of arbitrary decisions to make about
how to delimit messages...but if SPDY doesn't get widely deployed, or
exacts an unacceptable performance penalty, it is game over.  The
worst possible outcome is the thing that becomes common also is
impractical for session-oriented sequential SQL execution, but I am
hopeful that given the use cases driving this standardization process
that this is less likely to happen.  FEBE's framing/message format
doesn't seem like an area where people are itching to try wild and
crazy changes unique to Postgres (maybe someone has...) so using a
more commonly seen delimitation format seems like a concession that
opens a lot of more useful interesting doors.

This discussion is somewhat premature because HTTP 2.0 isn't
standardized, nor has SPDY become a wide-spread defacto format
(although the percentage of well-known web-infrastructure projects
implementing it has grown both impressively both recently and
overall), and there are not even straw-man bandwidth/latency
measurements (in terms of vs. FEBE), but as long as someone is
bringing up HTTP, I thought it worth discussing in a little more
depth, because it's something I poll regularly mentally, looking for a
sign that It's Time.  It wasn't ready enough for me to start a thread,
but clearly I couldn't quite resist replying to one...

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers