[HACKERS] Fwd: Extended query protocol and exact types matches.
-- Forwarded message -- From: Dmitriy Igrishin Date: 2010/12/10 Subject: Fwd: Extended query protocol and exact types matches. To: postgres list Hey sql@, -- Forwarded message -- From: Dmitriy Igrishin Date: 2010/12/9 Subject: Extended query protocol and exact types matches. To: pgsql-gene...@postgresql.org Hey general@, To be assured and just for calmness. Problem: 1. CREATE TABLE test_tab (id integer, dat varchar(64)); 2. INSERT INTO test_tab VALUES($1, $2) via PQexecParams, where paramTypes[0] == OID of bigint, paramTypes[1] == OID of text. Questions: Whether this case falls to http://www.postgresql.org/docs/9.0/static/typeconv-query.html ? Is such cases safe or it is recommended (best) to specify a OIDs which are exact matches ? PS. I know, that queries like SELECT $1 does not work without specifying OID or without rewriting it to e.g. SELECT $1::text. Thanks. -- // Dmitriy. -- // Dmitriy. -- // Dmitriy.
Re: [HACKERS] To Signal The postmaster
Hi All, Can anybody tell after finding the trigger file what steps does postgres follow? When and how it will set the postgres recovery mode to false? On Thu, Dec 9, 2010 at 3:51 PM, Fujii Masao wrote: > On Wed, Dec 8, 2010 at 6:22 PM, Fujii Masao wrote: > > On Wed, Dec 8, 2010 at 4:59 PM, Heikki Linnakangas > > wrote: > >> For 9.1, we should think of a better way to do this, perhaps using > SIGUSR1 > >> to wake up. Maybe we won't even need the trigger file anymore. > > > > If we use SIGUSR1, the mechanism to allow the users to specify the event > type > > seems to be required. For example, we should make the SIGUSR1 handler > > check not only the shmem (i.e., PMSignalStat) but also the file? > > What I'm thinking is to make something like "pg_ctl promote" create the > event > file specifying the standby promotion in $PGDATA/pg_event or elsewhere, > and send SIGUSR1 to postmaster. OTOH, when SIGUSR1 arrives, postmaster > checks whether that event file exists. If it does, postmaster removes it > and > sends the signal to startup process for standby promotion. > > Thought? > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- Thanks & Regards, Aaliya Zarrin (+91)-9160665888
[HACKERS] Anyone for SSDs?
Hi all, Most of you already know I am new to this list and newer to any OSS development. However, while browsing the source code (of 9.0.1) I find that there is only one way to store relations on disk - the magnetic disk. This came suddenly in my mind so I am asking the experts here. Considering the fact that SSDs will be common (at least for the enterprise) in the coming years because of (of course you know the reason) their less seek time and higher transfer rates per second, is there someone trying for a ssd.c? In almost all cases even using md.c, the kernel will handle it effectively but would it not be better that we are well prepared to ask kernel for more? Or has such an attempt already begun? - Vaibhav (*_*) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] serializable read only deferrable
On Tue, Dec 07, 2010 at 10:14:24AM -0600, Kevin Grittner wrote: > > Essentially, instead of adding dependencies as you go along and > > abort once you hit a conflict, SERIALIZABLE READ ONLY DEFERRED > > transactions would assume the worst case from the start and thus > > be able to bypass the more detailed checks later on. > > Right -- such a transaction, having acquired a good snapshot, could > release all SSI resources and run without any of the SSI overhead. Yes, this makes sense. If no running transaction has ever read, and will never read before COMMIT, any value that's modified by a concurrent transaction, then they will not create snapshot anomalies, and the current snapshot has a place in the serial ordering. > > With this scheme, you'd at least stand some chance of eventually > > acquiring a consistent snapshot, even in the case of an endless > > stream of overlapping READ WRITE transactions. > > Yeah, I'd been twisting ideas around trying to find a good way to do > this; you've got it right at the conceptual level, I think. The only thing I'm worried about here is how much risk of starvation remains. You'd need to wait until there are no running r/w transactions accessing overlapping data sets; for some applications that might not be any better than waiting for the system to be idle. But I think there's no way around that, it's just the price you have to pay to get a snapshot that can never see an anomaly. > Pseudo-code of idea (conveniently ignoring locking issues and > non-serializable transactions): This seems reasonable to me. Let me know if you need help implementing it; I have some spare cycles right now. Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] SynchRep; wait-forever and shutdown
Hi, In previous discussion, some people wanted the "wait-forever" option which blocks all the transactions on the master until sync'd standby has appeared, in order to reduce the risk of data loss in synchronous replication. What I'm not clear is; How does smart or fast shudown advance while all the transactions are being blocked? 1. Shutdown should wait for all the transactions to end by appearance of sync'd standby? * Problem is that shutdown would take very long. 2. Shutdown should commit all the blocking transactions? * Problem is that a client thinks that those transactions have successfully been committed even though they have not been replicated to the standby. 3. Shutdown should abort all the blocking transactions? * Problem is that a client thinks that those transactions have been aborted even though those WAL records have been written on the master. But this is very common problem for DBMS, so we don't need to worry about this in the context of replication. ISTM smart and fast shutdown fits in with #1 and #3, respectively. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Josh Berkus wrote: Did you rerun test_sync with O_DIRECT entabled, using my patch? The figures you had from test_fsync earlier were without O_DIRECT. No--I was just focused on testing the changes that had already been committed. The use of O_DIRECT in the server but not test_fsync could very well be the reason for the difference; don't know yet. I'm trying to get through this CF before I start getting distracted by newer patches, I'll get to yours soon I hope. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Greg, > This is interesting, because test_fsync consistently reported a rate of > about half this when using open_datasync instead of the equal > performance I'm getting from the database. I'll see if I can reproduce > that further, but it's no reason to be concerned about the change that's > been made I think. Just more evidence that test_fsync has quirks left > to be sorted out. But that's not backbranch material, it should be part > of 9.1 only refactoring, already in progress via the patch Josh > submitted. There's a bit of time left to get that done. Did you rerun test_sync with O_DIRECT entabled, using my patch? The figures you had from test_fsync earlier were without O_DIRECT. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Postgresql 9.1 pg_last_xact_replay_timestamp limitations
On Fri, Dec 10, 2010 at 1:24 AM, Gabi Julien wrote: > On Wednesday 08 December 2010 21:58:46 you wrote: >> On Thu, Dec 9, 2010 at 1:37 AM, Gabi Julien >> wrote: >> > slave# /etc/init.d/postgresql start >> > slave# psql -hlocalhost my_db -c "select pg_last_xact_replay_timestamp(), >> > now() as not_modified_since;" >> > pg_last_xact_replay_timestamp | not_modified_since >> > ---+--- >> > | 2010-12-08 16:06:09.920219+00 > >> We should return the timestamp of last valid checkpoint rather than NULL in >> that >> case? > > Well, I think this behavior would be more appreciated by postgresql users in > general. The case where the slave can be restarted after a clean shutdown is > rare but we need to consider it nonetheless. In my case I implemented a > custom function that reads the last returned timestamp from a new file on > disk. This is not a perfect solution since the value returned might be older > then the actual state of the replication but it's good enough for my needs. The second question is; What should be returned when the server has been started normally without recovery? NULL? The timestamp of last valid checkpoint? The third question is; What should be returned while replaying WAL records which exist between REDO starting point and checkpoint? In this case, it seems bad to return the timestamp of the checkpoint whenever there is no replay transaction, since the result timestamp would go back once at least one transaction has been replayed before reaching the checkpoint record. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: In my test cases, the syncs that the backends were doing were almost always to the same file that the checkpoint writer was already choking on (so they are entangled simply by virtue of that). So very quickly all the backends hit the same wall and thunked to a halt. This is probably a feature of trying to use pgbench as the basis to get a very artificial model. Yes--pgbench has some problems like you describe, ones that are a bit different than the way I've seen fsync writes get in each other's way in the production systems I've looked at. That's good if you really want to provoke this behavior, which is one reason why I've used as an example for my patches so far (the other being that it's already available in everyone's installation). But it's tough to get it to act more like a real-world system, which don't have quite so many localized updates, without cranking the scale way up. And that then tends to aggravate other problems too. The 8.3 checkpoint spreading work also got some useful results using the dbt-2 benchmark. I'm at the point where I think I need to return to that test program for what I'm doing now. I'd encourage you to try that out too if you get a chance. Thanks for the feedback and the review. I hope you appreciate now why I suggested you wait for the stuff I was submitting before getting back into the sorted checkpoint topic again. That should be a lot easier to make sense of with this instrumentation in place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Since any Windows refactoring has been postponed for now (I'll get back to performance checks on that platform later), during my testing time this week instead I did a round of pre-release review of the change Tom has now committed. All looks good to me, including the docs changes. I confirmed that: -Ubuntu system with an older kernel still has the same wal_sync_method (fdatasync) and performance after pulling the update -RHEL6 system changes as planned from using open_datasync to fdatasync once I updated to a HEAD after the commit On the RHEL6 system, I also checked the commit rate using pgbench with the attached INSERT only script, rather than relying on test_fsync. This is 7200 RPM drive, so theoretical max of 120 commits/second, on ext4; this is the same test setup I described in more detail back in http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com $ psql -c "show wal_sync_method" wal_sync_method - fdatasync (1 row) $ pgbench -i -s 10 pgbench [gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6733 tps = 112.208795 (including connections establishing) tps = 112.216904 (excluding connections establishing) And then manually switched over to test performance of the troublesome old default: [gsm...@meddle ~]$ psql -c "show wal_sync_method" wal_sync_method - open_datasync [gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6672 tps = 111.185802 (including connections establishing) tps = 111.195089 (excluding connections establishing) This is interesting, because test_fsync consistently reported a rate of about half this when using open_datasync instead of the equal performance I'm getting from the database. I'll see if I can reproduce that further, but it's no reason to be concerned about the change that's been made I think. Just more evidence that test_fsync has quirks left to be sorted out. But that's not backbranch material, it should be part of 9.1 only refactoring, already in progress via the patch Josh submitted. There's a bit of time left to get that done. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books \set nbranches :scale \set ntellers 10 * :scale \set naccounts 10 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote: > I wrote: > > We're throwing away one tuple at a time as we advance forward through > > the tuplestore, and moving 10+ tuple pointers each time. Ugh. > > This code was all right when written, because (IIRC) the mergejoin > > case was actually the only caller. But it's not all right for > > WindowAgg's less-predictable usage patterns. > > > I thought for a bit about changing things around so that the first-used > > tuple slot isn't necessarily state->memtuples[0], but just like the > > comment says, that complicates a lot of other logic. And there isn't > > any easy place to reclaim the wasted slots later. > > > What seems like the best bet is to put in a heuristic to make > > tuplestore_trim simply not do anything until nremove reaches some > > reasonably large amount, perhaps 10% of the number of stored tuples. > > This wastes up to 10% of the alloted memory, but that seems tolerable. > > On reflection I think just not doing anything isn't a very good idea. > The problem with that is that a mis-coded caller could try to fetch > tuples that it had already told the tuplestore could be trimmed away; > and this would work, most of the time, until you got unlucky and the > trim operation had actually deleted them. I think it's pretty important > for bug-catching purposes that the tuplestore enforce that those tuples > are not available anymore. > > Hence the attached patch, which combines the two ideas by recycling > tuples immediately but not sliding the pointer array until a reasonable > amount of movement has occurred. This fixes the complained-of > performance problem AFAICT. > > I'm not sure whether or not to back-patch this into 9.0 and 8.4. The > code in tuplestore.c hasn't changed at all since 8.4, so there's not > much risk of cross-version bugs, but if I did miss anything we could > be shipping a buggy version next week. Thoughts? > > regards, tom lane > +1 for back patching. Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
Tom Lane wrote: > Do you have reason to think that anybody is likely to exercise > window functions in HEAD, beyond what the regression tests do, in > the next couple of months? Not specifically, no. From the description (not having read the patch) I was somewhat concerned that it might affect something outside that narrow use case. If that's not possible, then I'm more comfortable putting it in a release that soon after it hits the repository. It's a judgment call, and you're clearly in the best position to make it. You asked for thoughts, so I shared my concerns. :-) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
"Kevin Grittner" writes: > Tom Lane wrote: >> I'm not sure whether or not to back-patch this into 9.0 and 8.4. >> The code in tuplestore.c hasn't changed at all since 8.4, so >> there's not much risk of cross-version bugs, but if I did miss >> anything we could be shipping a buggy version next week. >> Thoughts? > Is there a performance regression involved, or is it a new feature > which hasn't performed well on this type of query until your patch? Well, since window functions didn't exist before 8.4, it's difficult to argue that there was a regression. It's certainly a performance bug though: nobody would expect that giving a query *more* work_mem would cause it to run many times slower. > If the latter, I'd be inclined to give it some time on HEAD and > release it in the *following* minor release unless you're *very* > confident it couldn't break anything. Well, I'm reasonably confident in the patch, and it does pass regression tests. But I've been wrong before. I'm not terribly thrilled with that suggestion though. Do you have reason to think that anybody is likely to exercise window functions in HEAD, beyond what the regression tests do, in the next couple of months? Slipping the application of the patch to back branches by a little bit doesn't make a lot of management sense to me. I think either we trust it and put it into back branches, or we don't trust it and put it only in HEAD, so it goes through a beta cycle before hitting production. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
Tom Lane wrote: > I'm not sure whether or not to back-patch this into 9.0 and 8.4. > The code in tuplestore.c hasn't changed at all since 8.4, so > there's not much risk of cross-version bugs, but if I did miss > anything we could be shipping a buggy version next week. > Thoughts? Is there a performance regression involved, or is it a new feature which hasn't performed well on this type of query until your patch? If the latter, I'd be inclined to give it some time on HEAD and release it in the *following* minor release unless you're *very* confident it couldn't break anything. It's an uphill battle to convince managers that it's safe to apply minor upgrades with minimal testing. It doesn't take to many slips for the boulder to roll all the way back to the bottom of that hill. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
I wrote: > We're throwing away one tuple at a time as we advance forward through > the tuplestore, and moving 10+ tuple pointers each time. Ugh. > This code was all right when written, because (IIRC) the mergejoin > case was actually the only caller. But it's not all right for > WindowAgg's less-predictable usage patterns. > I thought for a bit about changing things around so that the first-used > tuple slot isn't necessarily state->memtuples[0], but just like the > comment says, that complicates a lot of other logic. And there isn't > any easy place to reclaim the wasted slots later. > What seems like the best bet is to put in a heuristic to make > tuplestore_trim simply not do anything until nremove reaches some > reasonably large amount, perhaps 10% of the number of stored tuples. > This wastes up to 10% of the alloted memory, but that seems tolerable. On reflection I think just not doing anything isn't a very good idea. The problem with that is that a mis-coded caller could try to fetch tuples that it had already told the tuplestore could be trimmed away; and this would work, most of the time, until you got unlucky and the trim operation had actually deleted them. I think it's pretty important for bug-catching purposes that the tuplestore enforce that those tuples are not available anymore. Hence the attached patch, which combines the two ideas by recycling tuples immediately but not sliding the pointer array until a reasonable amount of movement has occurred. This fixes the complained-of performance problem AFAICT. I'm not sure whether or not to back-patch this into 9.0 and 8.4. The code in tuplestore.c hasn't changed at all since 8.4, so there's not much risk of cross-version bugs, but if I did miss anything we could be shipping a buggy version next week. Thoughts? regards, tom lane diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c index 9bbaba43771f495fdf24e9f2afd545b69a22ecbd..8c8139c897679892e0d4ad13e69ae8d814484206 100644 *** a/src/backend/utils/sort/tuplestore.c --- b/src/backend/utils/sort/tuplestore.c *** struct Tuplestorestate *** 145,152 --- 145,159 /* * This array holds pointers to tuples in memory if we are in state INMEM. * In states WRITEFILE and READFILE it's not used. + * + * When memtupdeleted > 0, the first memtupdeleted pointers are already + * released due to a tuplestore_trim() operation, but we haven't expended + * the effort to slide the remaining pointers down. These unused pointers + * are set to NULL to catch any invalid accesses. Note that memtupcount + * includes the deleted pointers. */ void **memtuples; /* array of pointers to palloc'd tuples */ + int memtupdeleted; /* the first N slots are currently unused */ int memtupcount; /* number of tuples currently present */ int memtupsize; /* allocated length of memtuples array */ *** tuplestore_begin_common(int eflags, bool *** 252,257 --- 259,265 state->context = CurrentMemoryContext; state->resowner = CurrentResourceOwner; + state->memtupdeleted = 0; state->memtupcount = 0; state->memtupsize = 1024; /* initial guess */ state->memtuples = (void **) palloc(state->memtupsize * sizeof(void *)); *** tuplestore_clear(Tuplestorestate *state) *** 401,407 state->myfile = NULL; if (state->memtuples) { ! for (i = 0; i < state->memtupcount; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); --- 409,415 state->myfile = NULL; if (state->memtuples) { ! for (i = state->memtupdeleted; i < state->memtupcount; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); *** tuplestore_clear(Tuplestorestate *state) *** 409,414 --- 417,423 } state->status = TSS_INMEM; state->truncated = false; + state->memtupdeleted = 0; state->memtupcount = 0; readptr = state->readptrs; for (i = 0; i < state->readptrcount; readptr++, i++) *** tuplestore_end(Tuplestorestate *state) *** 432,438 BufFileClose(state->myfile); if (state->memtuples) { ! for (i = 0; i < state->memtupcount; i++) pfree(state->memtuples[i]); pfree(state->memtuples); } --- 441,447 BufFileClose(state->myfile); if (state->memtuples) { ! for (i = state->memtupdeleted; i < state->memtupcount; i++) pfree(state->memtuples[i]); pfree(state->memtuples); } *** tuplestore_gettuple(Tuplestorestate *sta *** 774,787 } else { ! if (readptr->current <= 0) { Assert(!state->truncated); return NULL; } readptr->current--; /* last returned tuple */ } ! if (readptr->current <= 0) { Assert(!state->truncated); return NULL; --- 783,796 } else {
Re: [HACKERS] Extensions, patch v16
On Dec 9, 2010, at 12:34 PM, Dimitri Fontaine wrote: > - add support for 'relocatable' boolean property in the control file, > as discussed on list > > this controls what happens at create extension time, by doing a > relocation of the extension objects when the extension is relocatable > and the asked schema isn't the first of the search_path (you can't > relocate an object to a schema where it already is) > > when the extension is not relocatable, the mechanism used is the > @extschema@ replacement in the script so that the user still has a > say, but at create time only This still isn't ideal, but I think it's a big improvement. Thanks. > - nothing is done for the psql commands \dx and \dx+, here's an idea: > > \dx lists only installed extensions > \dx+ lists the objects, calling pg_extension_objects() > \dX lists available extensions (and installed too) +1 I think that's much more like existing psql commands. > - we still depend on extension authors providing a control file. Do we > want to spend some efforts on trying to get rid of this file? I know > David desperately want to, but that's at the cost of making it much > harder to manage more than one extension in a single directory, for > once, and the Makefile mechanisms to make than happen (include a rule > depending on the presence of some variables, keep track of it for the > cleaning, etc) doesn't seem to me to worth it. I don't think it makes it any harder to manage multiple extension in a single directory because one can create the control file explicitly (or perhaps rely on .control.in for that), just as they do now. Everyone else can do less work. So: * If $extension.control.in exists, use that * If it doesn't, generate $extension.control from the Makefile variables * Always remove $extension.control in the `clean` targets Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb failure with Postgres 8.4.4
On 12/09/2010 03:36 PM, BRUSSER Michael wrote: No, we do not use --with-system-tzdata option. I looked at the makefile and at the output of pg_config. We may need to do some cleanup there, but I did not pick any clues. The problem occurs on all our UNIX platforms. Is there anything I could do to shed more light on it? I can post the output of pg_config if it helps, or try to run initdb with truss or strace. Yes, please show us the pg_config (and please don't top-post). cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why percent_rank is so slower than rank?
Jie Li writes: > I'm new to window functions. Recently I run some simple queries but > surprised to find percent_rank is so slower than rank, could anybody tell me > why? Huh, interesting. I can reproduce this with toy data, such as create table inventory1 (inv_date_sk int, inv_item_sk int); insert into inventory1 select 1, random()* 10 from generate_series(1,189000); explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; The example is *not* particularly slow if you leave work_mem at default. But if you bump up work_mem enough so that the WindowAgg's internal tuplestore fits into memory, it slows down like crazy. A bit of quality time with oprofile shows that all the time is going into this memmove() in tuplestore_trim(): /* * Slide the array down and readjust pointers. This may look pretty * stupid, but we expect that there will usually not be very many * tuple-pointers to move, so this isn't that expensive; and it keeps a * lot of other logic simple. * * In fact, in the current usage for merge joins, it's demonstrable that * there will always be exactly one non-removed tuple; so optimize that * case. */ if (nremove + 1 == state->memtupcount) state->memtuples[0] = state->memtuples[nremove]; else memmove(state->memtuples, state->memtuples + nremove, (state->memtupcount - nremove) * sizeof(void *)); We're throwing away one tuple at a time as we advance forward through the tuplestore, and moving 10+ tuple pointers each time. Ugh. This code was all right when written, because (IIRC) the mergejoin case was actually the only caller. But it's not all right for WindowAgg's less-predictable usage patterns. I thought for a bit about changing things around so that the first-used tuple slot isn't necessarily state->memtuples[0], but just like the comment says, that complicates a lot of other logic. And there isn't any easy place to reclaim the wasted slots later. What seems like the best bet is to put in a heuristic to make tuplestore_trim simply not do anything until nremove reaches some reasonably large amount, perhaps 10% of the number of stored tuples. This wastes up to 10% of the alloted memory, but that seems tolerable. We could complicate things a bit more by remembering that so-and-so many slots are authorized to be removed, and forcing a trim operation to discard them if we find ourselves in memory trouble. I'm not sure that extra complication is worthwhile though. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add a primary key using an existing index
On Thu, Dec 9, 2010 at 2:51 PM, Kevin Grittner wrote: > Tom Lane wrote: >> If the constraint name is not specified, we should certainly use >> the existing index name, not randomly rename it. > > +1 +1 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb failure with Postgres 8.4.4
No, we do not use --with-system-tzdata option. I looked at the makefile and at the output of pg_config. We may need to do some cleanup there, but I did not pick any clues. The problem occurs on all our UNIX platforms. Is there anything I could do to shed more light on it? I can post the output of pg_config if it helps, or try to run initdb with truss or strace. Thanks, Michael. -Original Message- From: Tom Lane [mailto:t...@sss.pgh.pa.us] Sent: Thursday, December 09, 2010 2:56 PM To: BRUSSER Michael Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] initdb failure with Postgres 8.4.4 "BRUSSER Michael" writes: > Initdb fails for me when host machine has no access to the Postgres build > location. > LOG: could not open directory .../install/share/timezone": No such file or > directory Moving the install tree works for me. Did you do something odd with the --with-system-tzdata configuration option? regards, tom lane This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
On 12/6/10 6:13 PM, Tom Lane wrote: > Josh Berkus writes: >> OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails. >> What should I have it do instead? > > Report that it fails, and keep testing the other methods. Patch attached. Includes a fair amount of comment cleanup, since existing comments did not meet our current project standards. Tests all 6 of the methods we support separately. Some questions, though: (1) Why are we doing the open_sync different-size write test? AFAIK, this doesn't match any behavior which PostgreSQL has. (2) In this patch, I'm stepping down the number of loops which fsync_writethrough does by 90%. The reason for that was that on the platforms where I tested writethrough (desktop machines), doing 10,000 loops took 15-20 *minutes*, which seems hard on the user. Would be easy to revert if you think it's a bad idea. Possibly auto-sizing the number of loops based on the first fsync test might be a good idea, but seems like going a bit too far. (3) Should the multi-descriptor test be using writethrough on platforms which support it? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile index 252c087..2ddbbe9 100644 *** a/src/tools/fsync/Makefile --- b/src/tools/fsync/Makefile *** *** 4,10 # # Copyright (c) 2003-2010, PostgreSQL Global Development Group # ! # src/tools/fsync/Makefile # #- --- 4,10 # # Copyright (c) 2003-2010, PostgreSQL Global Development Group # ! # $PostgreSQL: pgsql/src/tools/fsync/Makefile,v 1.9 2010/07/05 18:54:38 tgl Exp $ # #- *** override CPPFLAGS := -I$(libpq_srcdir) $ *** 16,24 OBJS= test_fsync.o ! all: test_fsync ! test_fsync: test_fsync.o | submake-libpq submake-libpgport $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $...@$(X) clean distclean maintainer-clean: --- 16,24 OBJS= test_fsync.o ! all: submake-libpq submake-libpgport test_fsync ! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $...@$(X) clean distclean maintainer-clean: diff --git a/src/tools/fsync/README b/src/tools/fsync/README index 6d9acd3..5b45581 100644 *** a/src/tools/fsync/README --- b/src/tools/fsync/README *** *** 1,4 ! src/tools/fsync/README fsync = --- 1,4 ! $PostgreSQL: pgsql/src/tools/fsync/README,v 1.5 2009/11/28 15:04:54 momjian Exp $ fsync = *** fsync *** 6,11 This program tests fsync. The tests are described as part of the program output. Usage: test_fsync [-f filename] [loops] ! Loops defaults to 5000. The default output file is /var/tmp/test_fsync.out. ! Consider that /tmp or /var/tmp might be memory-based file systems. --- 6,25 This program tests fsync. The tests are described as part of the program output. Usage: test_fsync [-f filename] [loops] + + test_fsync is intended to give you a reasonable idea of what the fastest + fsync_method is on your specific system, as well as supplying diagnostic + information in the event of an identified I/O problem. However, differences + shown by test_fsync may not make any difference in real database throughput, + especially since many database servers are not speed-limited by their + transaction logs. ! Filename defaults to test_fsync.out in the current directory. test_fsync ! should be run on the same filesystem where your transaction log currently ! resides. ! ! Loops default to 1, except for writethrough tests, where there are 1/10 of ! that in order to make the user not wait forever. You should lower loops if you ! have a slow system and the tests are taking more than 5 minutes each. You should ! raise loops if your system is faster than 5000/second, in order to get useful ! statistics. diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c index 28c2119..5980b70 100644 *** a/src/tools/fsync/test_fsync.c --- b/src/tools/fsync/test_fsync.c *** *** 3,9 * * * test_fsync.c ! * test various fsync() methods */ #include "postgres.h" --- 3,9 * * * test_fsync.c ! * tests all supported fsync() methods */ #include "postgres.h" *** *** 22,55 #include #include ! ! #ifdef WIN32 #define FSYNC_FILENAME "./test_fsync.out" - #else - /* /tmp might be a memory file system */ - #define FSYNC_FILENAME "/var/tmp/test_fsync.out" - #endif #define WRITE_SIZE (8 * 1024) /* 8k */ #define LABEL_FORMAT "\t%-30s" int loops = 1; void die(char *st
Re: [HACKERS] initdb failure with Postgres 8.4.4
"BRUSSER Michael" writes: > Initdb fails for me when host machine has no access to the Postgres build > location. > LOG: could not open directory .../install/share/timezone": No such file or > directory Moving the install tree works for me. Did you do something odd with the --with-system-tzdata configuration option? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BufFreelistLock
On Dec 8, 2010, at 11:44 PM, Jeff Janes wrote: >>> For the clock sweep algorithm, I think you could access >>> nextVictimBuffer without any type of locking. >> >> This is wrong, mainly because you wouldn't have any security against two >> processes decrementing the usage count of the same buffer because they'd >> fetched the same value of nextVictimBuffer. That would probably happen >> often enough to severely compromise the accuracy of the usage counts and >> thus the accuracy of the LRU eviction behavior. See above. > > Ah, I hadn't considered that. Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much what the performance of the sweep is. To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they strive to keep X number of pages on the free list so that when a process needs memory it can get it quickly. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add a primary key using an existing index
Tom Lane wrote: > If the constraint name is not specified, we should certainly use > the existing index name, not randomly rename it. +1 -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add a primary key using an existing index
Gurjeet Singh writes: > But I still hold a bias towards renaming the index to match constraint name > (with a NOTICE), rather than require that the constraint name match the > index name, because the constraint name is optional and when it is not > provided system has to generate a name and we have to rename the index > anyway to maintain consistency. No. If the constraint name is not specified, we should certainly use the existing index name, not randomly rename it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add a primary key using an existing index
On Sun, Dec 5, 2010 at 2:09 PM, Peter Eisentraut wrote: > On fre, 2010-12-03 at 15:27 -0500, Robert Haas wrote: > > On Fri, Dec 3, 2010 at 2:56 PM, r t wrote: > > > What exactly was the objection to the following --> > > > ALTER TABLE table_name ADD PRIMARY KEY (column_list) USING index_name; > > > Is the objection that you might have been trying to specify a > constraint > > > named "using" ? I'm willing to make that option more difficult. :-) > > > > I think it's that someone might expect the word after USING to be the > > name of an index AM. > > That could be avoided by writing > > USING INDEX > > Allowing USING INDEX along with USING INDEX TABLESPACE is causing shift/reduce conflicts. I liked the proposal upthread of providing alternate syntax where user does not have to specify column-list and system picks up that list from the index. ALTER TABLE table_name ADD [CONSTRAINT cons_name] PRIMARY KEY (column_list) [WITH (...)] [USING INDEX TABLESPACE tblspcname]; ALTER TABLE table_name ADD [CONSTRAINT cons_name] PRIMARY KEY [WITH (...)] [USING INDEX index_name]; This would also help avoid the bug that Itagaki found, where the user wants to use an existing index, and also specifies USING INDEX TABLESPACE. But I still hold a bias towards renaming the index to match constraint name (with a NOTICE), rather than require that the constraint name match the index name, because the constraint name is optional and when it is not provided system has to generate a name and we have to rename the index anyway to maintain consistency. Following are the gram.y changes that I am going to start with: %type constraints_set_mode -%type OptTableSpace OptConsTableSpace OptTableSpaceOwner +%type OptTableSpace OptConsTableSpace OptConsIndex OptTableSpaceOwner %typeopt_check_option [...] | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace ConstraintAttributeSpec { Constraint *n = makeNode(Constraint); n->contype = CONSTR_UNIQUE; n->location = @1; n->keys = $3; n->options = $5; n->indexspace = $6; n->deferrable = ($7 & 1) != 0; n->initdeferred = ($7 & 2) != 0; $$ = (Node *)n; } + | UNIQUE opt_definition OptConsIndex ConstraintAttributeSpec + { + Constraint *n = makeNode(Constraint); + n->contype = CONSTR_UNIQUE; + n->location = @1; + n->options = $2; + n->indexname = $3; + n->deferrable = ($4 & 1) != 0; + n->initdeferred = ($4 & 2) != 0; + $$ = (Node *)n; + } | PRIMARY KEY '(' columnList ')' opt_definition OptConsTableSpace ConstraintAttributeSpec { Constraint *n = makeNode(Constraint); n->contype = CONSTR_PRIMARY; n->location = @1; n->keys = $4; n->options = $6; n->indexspace = $7; n->deferrable = ($8 & 1) != 0; n->initdeferred = ($8 & 2) != 0; $$ = (Node *)n; } + | PRIMARY KEY opt_definition OptConsIndex ConstraintAttributeSpec + { + Constraint *n = makeNode(Constraint); + n->contype = CONSTR_PRIMARY; + n->location = @1; + n->options = $3; + n->indexname = $4; + n->deferrable = ($5 & 1) != 0; + n->initdeferred = ($5 & 2) != 0; + $$ = (Node *)n; + } | EXCLUDE access_method_clause '(' ExclusionConstraintList ')' [...] OptConsTableSpace: USING INDEX TABLESPACE name { $$ = $4; } | /*EMPTY*/ { $$ = NULL; } ; +OptConsIndex: USING INDEX name { $$ = $3; } + | /*EMPTY*/ { $$ = NULL; } + ; + Regards, -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device
Re: XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]
> "JJ" == Jeff Janes writes: JJ> So PG always writing 8K at a time is unlikely to make a difference JJ> than if it wrote a smaller amount. Ah. Somehow I was thinking of the xlog files' 16M filesize rather than the internal 8k block size If it is only writing 8k blocks then there is probably little hope of making efficient use of flash. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] initdb failure with Postgres 8.4.4
Initdb fails for me when host machine has no access to the Postgres build location. LOG: could not open directory .../install/share/timezone": No such file or directory LOG: could not open directory .../install/share/timezone": No such file or directory WARNING: could not open directory .../install/share/timezonesets": No such file or directory HINT: This may indicate an incomplete PostgreSQL installation, or that the file .../bin.Linux/postgres" has been moved away from its proper location. FATAL: invalid value for parameter "timezone_abbreviations": "Default" DEBUG: shmem_exit(1): 0 callbacks to make DEBUG: proc_exit(1): 0 callbacks to make DEBUG: exit(1) The postgres executable is in the proper place. The truncated paths .../install/share/timezone, etc point to the original build location. The timezone and other files are available in the local directory: share % ls -1 conversion_create.sql information_schema.sql pg_hba.conf.sample pg_ident.conf.sample pg_service.conf.sample postgres.bki postgres.description postgresql.conf.sample postgres.shdescription psqlrc.sample recovery.conf.sample snowball_create.sql sql_features.txt system_views.sql timezone timezonesets tsearch_data I call initdb with the -L option pointing to that directory, from initdb log I can see that it is indeed being used for certain things - POSTGRES_BKI=.../share/postgres.bki POSTGRES_DESCR=.../share/postgres.description POSTGRES_SHDESCR=.../share/postgres.shdescription POSTGRESQL_CONF_SAMPLE=.../share/postgresql.conf.sample PG_HBA_SAMPLE=.../share/pg_hba.conf.sample PG_IDENT_SAMPLE=.../share/pg_ident.conf.sample I am wondering why the timezone files cannot be found there as well? This is quite urgent and any help will be greatly appreciated. Michael. . This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer.
Re: [HACKERS] Optimize PL/Perl function argument passing [PATCH]
On Wed, Dec 08, 2010 at 09:21:05AM -0800, David E. Wheeler wrote: > On Dec 8, 2010, at 9:14 AM, Tim Bunce wrote: > > >> Do you have any more improvements in the pipeline? > > > > I'd like to add $arrayref = decode_array_literal('{2,3}') and > > maybe $hashref = decode_hstore_literal('x=>1, y=>2'). > > I don't know how much works would be involved in those though. > > Those would be handy, but for arrays, at least, I'd rather have a GUC > to turn on so that arrays are passed to PL/perl functions as array > references. Understood. At this stage I don't know what the issues are so I'm nervous of over promising (plus I don't know how much time I'll have). It's possible a blessed ref with string overloading would avoid backwards compatibility issues. Tim. > > Another possible improvement would be rewriting encode_array_literal() > > in C, so returning arrays would be much faster. > > +1 > > Best, > > David > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solving sudoku using SQL
On Thu, Dec 9, 2010 at 11:56 AM, Dimitri Fontaine wrote: > Merlin Moncure writes: >> On Wed, Dec 8, 2010 at 8:57 AM, Tatsuo Ishii wrote: >>> >>> Then execute the huge SELECT: >>> http://codezine.jp/static/images/article/1629/html/sql.html >> >> benchmark what you've got against this (ported to postgres by marcin mank): >> http://www.pastie.org/684163 > > It that this one ? > > http://archives.postgresql.org/message-id/e08cc0400911042333o5361b21cu2c9438f82b1e5...@mail.gmail.com sure is -- I missed the formatted version. The original query also an Oracle original. If you remove the 'where ind = 0', you can watch the database solve the puzzle, which is pretty neat. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solving sudoku using SQL
Merlin Moncure writes: > On Wed, Dec 8, 2010 at 8:57 AM, Tatsuo Ishii wrote: >> >> Then execute the huge SELECT: >> http://codezine.jp/static/images/article/1629/html/sql.html > > benchmark what you've got against this (ported to postgres by marcin mank): > http://www.pastie.org/684163 It that this one ? http://archives.postgresql.org/message-id/e08cc0400911042333o5361b21cu2c9438f82b1e5...@mail.gmail.com -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow BLOBs restoring
On 12/09/2010 10:05 AM, Tom Lane wrote: I think what we need to do is make fix_dependencies build a reverse lookup list of all the objects dependent on each TOC object, so that the searching behavior in reduce_dependencies can be eliminated outright. That will take O(N) time and O(N) extra space, which is a good tradeoff because you won't care if N is small, while if N is large you have got to have it anyway. Barring objections, I will do this and back-patch into 9.0. There is maybe some case for trying to fix 8.4 as well, but since 8.4 didn't make a separate TOC entry for each blob, it isn't as exposed to the problem. We didn't back-patch the last round of efficiency hacks in this area, so I'm thinking it's not necessary here either. Comments? Sound good. Re 8.4: at a pinch people could probably use the 9.0 pg_restore with their 8.4 dump. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow BLOBs restoring
On Thu, Dec 9, 2010 at 10:05 AM, Tom Lane wrote: > I wrote: >> One fairly simple, if ugly, thing we could do about this is skip calling >> reduce_dependencies during the first loop if the TOC object is a blob; >> effectively assuming that nothing could depend on a blob. But that does >> nothing about the point that we're failing to parallelize blob >> restoration. Right offhand it seems hard to do much about that without >> some changes to the archive representation of blobs. Some things that >> might be worth looking at for 9.1: > >> * Add a flag to TOC objects saying "this object has no dependencies", >> to provide a generalized and principled way to skip the >> reduce_dependencies loop. This is only a good idea if pg_dump knows >> that or can cheaply determine it at dump time, but I think it can. > > I had further ideas about this part of the problem. First, there's no > need for a file format change to fix this: parallel restore is already > groveling over all the dependencies in its fix_dependencies step, so it > could count them for itself easily enough. Second, the real problem > here is that reduce_dependencies processing is O(N^2) in the number of > TOC objects. Skipping it for blobs, or even for all dependency-free > objects, doesn't make that very much better: the kind of people who > really need parallel restore are still likely to bump into unreasonable > processing time. I think what we need to do is make fix_dependencies > build a reverse lookup list of all the objects dependent on each TOC > object, so that the searching behavior in reduce_dependencies can be > eliminated outright. That will take O(N) time and O(N) extra space, > which is a good tradeoff because you won't care if N is small, while if > N is large you have got to have it anyway. > > Barring objections, I will do this and back-patch into 9.0. There is > maybe some case for trying to fix 8.4 as well, but since 8.4 didn't > make a separate TOC entry for each blob, it isn't as exposed to the > problem. We didn't back-patch the last round of efficiency hacks in > this area, so I'm thinking it's not necessary here either. Comments? Ah, that sounds like a much cleaner solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solving sudoku using SQL
Excerpts from Jan Urbański's message of mié dic 08 17:11:44 -0300 2010: > I'm pleasantly surprised that the SA code as it stands today, setting > the equlibrium factor to 8 and temperature reduction factor to 0.4, the > query takes 1799.662 ms in total. That's 5x better than Oracle :-) -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow BLOBs restoring
I wrote: > One fairly simple, if ugly, thing we could do about this is skip calling > reduce_dependencies during the first loop if the TOC object is a blob; > effectively assuming that nothing could depend on a blob. But that does > nothing about the point that we're failing to parallelize blob > restoration. Right offhand it seems hard to do much about that without > some changes to the archive representation of blobs. Some things that > might be worth looking at for 9.1: > * Add a flag to TOC objects saying "this object has no dependencies", > to provide a generalized and principled way to skip the > reduce_dependencies loop. This is only a good idea if pg_dump knows > that or can cheaply determine it at dump time, but I think it can. I had further ideas about this part of the problem. First, there's no need for a file format change to fix this: parallel restore is already groveling over all the dependencies in its fix_dependencies step, so it could count them for itself easily enough. Second, the real problem here is that reduce_dependencies processing is O(N^2) in the number of TOC objects. Skipping it for blobs, or even for all dependency-free objects, doesn't make that very much better: the kind of people who really need parallel restore are still likely to bump into unreasonable processing time. I think what we need to do is make fix_dependencies build a reverse lookup list of all the objects dependent on each TOC object, so that the searching behavior in reduce_dependencies can be eliminated outright. That will take O(N) time and O(N) extra space, which is a good tradeoff because you won't care if N is small, while if N is large you have got to have it anyway. Barring objections, I will do this and back-patch into 9.0. There is maybe some case for trying to fix 8.4 as well, but since 8.4 didn't make a separate TOC entry for each blob, it isn't as exposed to the problem. We didn't back-patch the last round of efficiency hacks in this area, so I'm thinking it's not necessary here either. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow BLOBs restoring
Robert Haas writes: > On Thu, Dec 9, 2010 at 12:28 AM, Tom Lane wrote: >> * Mark BLOB TOC entries as SECTION_DATA, or somehow otherwise make them >> parallelizable. Also break the BLOBS data item apart into an item per >> BLOB, so that that part's parallelizable. Maybe we should combine the >> metadata and data for each blob into one TOC item --- if we don't, it >> seems like we need a dependency, which will put us back behind the >> eight-ball. I think the reason it's like this is we didn't originally >> have a separate TOC item per blob; but now that we added that to support >> per-blob ACL data, the monolithic BLOBS item seems pretty pointless. >> (Another thing that would have to be looked at here is the dependency >> between a BLOB and any BLOB COMMENT for it.) > Is there any use case for restoring a BLOB but not the BLOB COMMENT or > BLOB ACLs? Can we just smush everything together into one section? The ACLs are already part of the main TOC entry for the blob. As for comments, I'd want to keep the handling of those the same as they are for every other kind of object. But that just begs the question of why comments are separate TOC entries in the first place. We could eliminate this problem if they became fields of object entries across the board. Which would be a non-backwards-compatible change in dump file format, but doing anything about the other issues mentioned above will require that anyway. I'm not certain however about whether it's safe to treat the object-metadata aspects of a blob as SECTION_DATA rather than SECTION_PRE_DATA. That will take a bit of investigation. It seems like there shouldn't be any fundamental reason for it not to work, but that doesn't mean there's not any weird assumptions buried someplace in pg_dump ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why percent_rank is so slower than rank?
Hi all, I'm new to window functions. Recently I run some simple queries but surprised to find percent_rank is so slower than rank, could anybody tell me why? The table schema: test=# \d inventory1 Table "public.inventory1" Column| Type | Modifiers --+-+--- inv_date_sk | integer | not null inv_item_sk | integer | not null inv_warehouse_sk | integer | not null inv_quantity_on_hand | integer | test=# \dt+ inventory1 List of relations Schema |Name| Type | Owner | Size | Description ++---+--+-+- public | inventory1 | table | workshop | 8880 kB | The rank query result: test=# explain analyze select inv_date_sk,inv_item_sk, rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; QUERY PLAN --- WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=631.947..1361.158 rows=189000 loops=1) -> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=631.924..771.990 rows=189000 loops=1) Sort Key: inv_date_sk, inv_item_sk Sort Method: quicksort Memory: 12218kB -> Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.055..198.948 rows=189000 loops=1) Total runtime: 1500.193 ms (6 rows) The percent_rank result: test=# explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; QUERY PLAN --- WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=766.432..32924.804 rows=189000 loops=1) -> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=756.320..905.407 rows=189000 loops=1) Sort Key: inv_date_sk, inv_item_sk Sort Method: quicksort Memory: 12218kB -> Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.102..224.607 rows=189000 loops=1) Total runtime: 33152.188 ms (6 rows) One special thing is that all the values of the partition key(inv_date_sk) are the same, that is, there is only one window partition. I find that percent_rank needs to buffer all the tuples to get the total number of rows. But why is it so expensive? I use 8.4.4. And I only increase the work_mem to 100M and leave other parameters untouched. Thanks, Li Jie
[HACKERS] PS display and standby query conflict
Hi, When I created the conflict between recovery and many read-only transactions in the standby server for test purpose, I found that the keyword "waiting" disappeared from PS display for just a moment even though the conflict had not been resolved yet. This seems strange to me. This problem happens because ResolveRecoveryConflictWithVirtualXIDs resets PS display for each read-only transactions that recovery waits for. Why do we need to reset that each time even though the conflict has not been resolved yet? The attached patch suppresses such a needless reset. Comments? BTW, ResolveRecoveryConflictWithVirtualXIDs calls pgstat_report_waiting(), which is also needless since the startup process doesn't have the shared memory entry (i.e., MyBEEntry) for pg_stat_activity. The attached patch removes that call. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center ps_display_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow BLOBs restoring
On Thu, Dec 9, 2010 at 12:28 AM, Tom Lane wrote: > Vlad Arkhipov writes: >> 08.12.2010 22:46, Tom Lane writes: >>> Are you by any chance restoring from an 8.3 or older pg_dump file made >>> on Windows? If so, it's a known issue. > >> No, I tried Linux only. > > OK, then it's not the missing-data-offsets issue. > >> I think you can reproduce it. First I created a database full of many >> BLOBs on Postres 8.4.5. Then I created a dump: > > Oh, you should have said how many was "many". I had tried with several > thousand large blobs yesterday and didn't see any problem. However, > with several hundred thousand small blobs, indeed it gets pretty slow > as soon as you use -j. > > oprofile shows all the time is going into reduce_dependencies during the > first loop in restore_toc_entries_parallel (ie, before we've actually > started doing anything in parallel). The reason is that for each blob, > we're iterating through all of the several hundred thousand TOC entries, > uselessly looking for anything that depends on the blob. And to add > insult to injury, because the blobs are all marked as SECTION_PRE_DATA, > we don't get to parallelize at all. I think we won't get to parallelize > the blob data restoration either, since all the blob data is hidden in a > single TOC entry :-( > > So the short answer is "don't bother to use -j in a mostly-blobs restore, > becausw it isn't going to help you in 9.0". > > One fairly simple, if ugly, thing we could do about this is skip calling > reduce_dependencies during the first loop if the TOC object is a blob; > effectively assuming that nothing could depend on a blob. But that does > nothing about the point that we're failing to parallelize blob > restoration. Right offhand it seems hard to do much about that without > some changes to the archive representation of blobs. Some things that > might be worth looking at for 9.1: > > * Add a flag to TOC objects saying "this object has no dependencies", > to provide a generalized and principled way to skip the > reduce_dependencies loop. This is only a good idea if pg_dump knows > that or can cheaply determine it at dump time, but I think it can. > > * Mark BLOB TOC entries as SECTION_DATA, or somehow otherwise make them > parallelizable. Also break the BLOBS data item apart into an item per > BLOB, so that that part's parallelizable. Maybe we should combine the > metadata and data for each blob into one TOC item --- if we don't, it > seems like we need a dependency, which will put us back behind the > eight-ball. I think the reason it's like this is we didn't originally > have a separate TOC item per blob; but now that we added that to support > per-blob ACL data, the monolithic BLOBS item seems pretty pointless. > (Another thing that would have to be looked at here is the dependency > between a BLOB and any BLOB COMMENT for it.) > > Thoughts? Is there any use case for restoring a BLOB but not the BLOB COMMENT or BLOB ACLs? Can we just smush everything together into one section? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
Just wanted to say thanks for the review, since I haven't yet managed to fix and commit this. I expect to later this month. On Mon, 2010-09-27 at 23:06 -0400, Robert Haas wrote: > On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs wrote: > > Simple tuning of btree_xlog_vacuum() using an idea I had a while back, > > just never implemented. XXX comments removed. > > > > Allows us to avoid reading in blocks during VACUUM replay that are only > > required for correctness of index scans. > > Review: > > 1. The block comment in XLogConfirmBufferIsUnpinned appears to be > copied from somewhere else, and doesn't really seem appropriate for a > new function since it refers to "the original coding of this routine". > I think you could just delete the parenthesized portion of the > comment. > > 2. This bit from ConfirmBufferIsUnpinned looks odd to me. > > + /* > + * Found it. Now, pin/unpin the buffer to prove it's unpinned. > + */ > + if (PinBuffer(buf, NULL)) > + UnpinBuffer(buf, false); > > I don't think pinning and unpinning the buffer is sufficient to > provide that it isn't otherwise pinned. If the buffer isn't in shared > buffers at all, it seems clear that no one can have it pinned. But if > it's present in shared buffers, it seems like you still need > LockBufferForCleanup(). -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] To Signal The postmaster
On Wed, Dec 8, 2010 at 6:22 PM, Fujii Masao wrote: > On Wed, Dec 8, 2010 at 4:59 PM, Heikki Linnakangas > wrote: >> For 9.1, we should think of a better way to do this, perhaps using SIGUSR1 >> to wake up. Maybe we won't even need the trigger file anymore. > > If we use SIGUSR1, the mechanism to allow the users to specify the event type > seems to be required. For example, we should make the SIGUSR1 handler > check not only the shmem (i.e., PMSignalStat) but also the file? What I'm thinking is to make something like "pg_ctl promote" create the event file specifying the standby promotion in $PGDATA/pg_event or elsewhere, and send SIGUSR1 to postmaster. OTOH, when SIGUSR1 arrives, postmaster checks whether that event file exists. If it does, postmaster removes it and sends the signal to startup process for standby promotion. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On-the-fly index tuple deletion vs. hot_standby
On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote: > On 29.11.2010 08:10, Noah Misch wrote: > > I have a hot_standby system and use it to bear the load of various reporting > > queries that take 15-60 minutes each. In an effort to avoid long pauses in > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours > > of > > the master's transactions. Even so, I kept seeing recovery pause for the > > duration of a long-running query. In each case, the culprit record was an > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple. The > > attached test script demonstrates the behavior (on HEAD); the index tuple > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the > > standby. > > > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age. We go ahead and > > remove > > the index tuples. On the standby, btree_xlog_delete_get_latestRemovedXid > > does > > not regard the inserting-transaction outcome, so btree_redo proceeds to > > conflict > > with snapshots having visibility over that transaction. Could we correctly > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore > > tuples > > of aborted transactions and tuples inserted and deleted within one > > transaction? @Noah Easily the best bug reported submitted in a long time. Thanks. > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() > could just call HeapTupleHeaderAdvanceLatestRemoveXid(). Yes, it applies to other cases also. Thanks for the suggestion. Fix committed. Please double-check my work, committed early since I'm about to jump on a plane. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers