Re: [HACKERS] PG qsort vs. Solaris
So basically, glibc's qsort is bad enough that even a 10%-more-comparisons advantage doesn't save it. Do those numbers look very different if you have lots of columns or if you're sorting on something like an array or a ROW? Imho, that also is an argument for using our own qsort. It can be extended to deal with high comparison function cost directly. Thus I would opt to add a comparison function cost arg to qsort_arg iff we find scenarios where our qsort performs too bad. This cost can be used to switch to merge sort for very high cost values. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] vcbuild bison check
Ok. So what you want is something that checks that it's =1.875 but specifically not 2.1? Might be a while before I can submit an updated patch for that, may need to rewrite the whole script in perl to do that :-( .bat files are horribly limited in what they can do. Since we are on NT or higher you could use extensions: IF %bversion% GEQ 1.875 IF %bversion% NEQ 2.1 goto use_bison http://www.robvanderwoude.com/ntif.html (even in .bat files) to avoid converting to perl. sorry, haven't looked at the file so needs adaption Thank you for the work Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] vcbuild bison check
-Original Message- From: Magnus Hagander [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 04, 2006 6:01 PM To: Zeugswetter Andreas ADI SD Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] [PATCHES] vcbuild bison check Ok. So what you want is something that checks that it's =1.875 but specifically not 2.1? Might be a while before I can submit an updated patch for that, may need to rewrite the whole script in perl to do that :-( .bat files are horribly limited in what they can do. Since we are on NT or higher you could use extensions: IF %bversion% GEQ 1.875 IF %bversion% NEQ 2.1 goto use_bison http://www.robvanderwoude.com/ntif.html I thought that only worked if your locale was set to something that has dot as decimal separator. Mine has comma, as have many others... Um, I think it does a string compare because point or comma is no decimal digit, but that would imho also be sufficient. My locale is German, so my decimal sep should also be a comma, and it worked for the mentioned versions. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] vcbuild bison check
And given that, they're going to get the latest by default, or 1.875 if they read the (currently being written) README. The point was, that = 2.2 won't be allowed when it comes out for win32, even if it should work. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Faster StrNCpy
I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy, but given the LENCPY results this may be a loser too. Um, why not strlcpy then MemSet the rest ? Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] Bad bug in fopen() wrapper code
Magnus, is this the right fix? Well, actually msdn states: Return Value If successful, _setmode returns the previous translation mode. A return value of -1 indicates an error So, shouldn't we be testing for -1 instead of 0 ? The thing is probably academic, since _setmode is only supposed to fail on invalid file handle or invalid mode. So basically, given our code, it should only fail if filemode is (O_BINARY | O_TEXT) both flags set. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Faster StrNCpy
I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy, but given the LENCPY results this may be a loser too. Um, why not strlcpy then MemSet the rest ? That's what strncpy() is supposed to be doing. Yes, but it obviously does not in some ports, and that was the main problem as I interpreted it. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] @ versus ~, redux
The existing geometric containment tests seem to be nonstrict, so if we wanted to leave room to add strict ones later, it might be best to settle on x @= yx contains or equals y x =@ yx is contained in or equals y reserving @ and @ for future strict comparison operators. At first glace, it seems more intuitive to me to do: x @= y x contains or equals y x =@ y y is contained in or equals y Hm, I've never seen anyone spell less than or equal to as =, so I'm not sure where you derive =@ from? Not saying no, but the other seems clearer to me. Yes, but to me too =@ seems more natural since we started with @ and @. Tom, your argument would more match your original @ and @, but then it would imply @= and @=, imho. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] @ versus ~, redux
x @ y means x is contained in y ltree @ ltree If you consider ltree entries to be sets containing all their children then those sound consistent. Now we get to decide whether @ was better than the now proposed @ :-) I like @. (or we stay clear by using the inet ops) Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] insert/update/delete returning and rules
With this approach, you still have to update your rules if you want to support RETURNING on your views --- but if you don't update them, you don't have a security hole. Basically the standard setup for an updatable view would use ON INSERT DO INSTEAD INSERT INTO ... RETURNING ... where today you don't write any RETURNING. I like that approach. And if the sections allow CASE WHEN it should be possible to cover all use cases efficiently. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Prepared statements considered harmful
I'd wish that we reconsider when and how prepared statements are used. The JDBC interface and PL/pgSQL are frequently noticed perpetrators, but the problem is really all over the place. AFAIK those are the only two places where preparation is the default ... what else were you thinking of? Perl DBI (DBD::Pg) defaults to prepared plans when connecting to a version 8.0 or higher server. Or at least, that's the way I read the documentation. Yea, but if you close the statement or leave the scope of the statement variable the plan is gone. So it is doing exactly what I would expect. It is written $stmt-prepare('select 1') what else would you expect ? There are enough other functions to get a result without a plan sticking around, like $db-selectrow_array Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Win32 hard crash problem
My bet is something depending on GetTickCount to measure elapsed time (and no, it's not used in the core Postgres code, but you've got plenty of other possible culprits in that stack). This doesn't quite make sense. The only reason we have to reboot is because PostgreSQL no longer responds. The system itself is fine. The Windows kernel may still work, but that doesn't mean that everything Postgres depends on still works. It may be a not reacting listen socket. This may be because of a handle leak. Next time it blocks look at the handle counts (e.g. with handle.exe from sysinternals). You could also look for handle count now with Task Manager and see if it increases constantly. (handle.exe shows you the details) Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Prepared statements considered harmful
Anyway, your plan assumes that you have information to work with. The current system plans prepared queries with no information at all about parameters and people are advocating to keep it that way. I think a good first step would be the plan on first execution, like Oracle does. Yup, it is also possible to try to find an obvious plan and only delay planning (or part of the plan) when different inputs make a big difference (like MaxDB and Informix). Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Prepared statements considered harmful
I don't chime in very often, but I do think the refusal to incorporate hints into the planner system is fantastically stubborn and nonsensical. What is actually fantastically nonsensical about this is that the issues I outlined about prepared statements would merely become worse if planner hints were used. Then, you wouldn't only have to worry about plans that were created earlier during the session, you would be faced with plans that were created earlier during the application's development. In general, the solutions to the prepared statement issues need to effect that the plans are created more often, not less often. I have yet to see one of our partial Informix hints (where the planner does it's usual job only with one path with lowered/elevated costs) fall foul on not anticipated change of underlying data. Thus I don't buy the argument that hints are always bad. Of course their use should be extremely rare and well thought out. Most of the time sql tuning involves a concerted effort between the programmer and a db performance expert, usually resulting in rewritten sql or program logic without adding hints. I can see arguments for hints the dba can set himself centrally on the server, but in my experience chances for substantial improvement are very limited in that case. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Prepared statements considered harmful
How about prepared means really prepared... in the sense of parsed, analyzed all sensible plans, and save a meta-plan which based on current statistics and parameter values chooses one of the considered (and cached) plans ? I don't think this could solve one particularly frequent problem which is that pattern matching queries don't get along with prepared plans if the search pattern isn't known at planning time. I think what we would actually want is knowledge about how much difference different parameters actually make in plan decision. (the stats show an even distribution and join correlation) Then we could prepare the plan when there is not much difference and postpone planning until we know the parameters when the difference is big. OLTP workload typically benefits from prepared plans, and the one plan is good for all possible inputs, so imho we cannot just assume all plans need replanning for different parameters. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Tricky bugs in concurrent index build
What bothers me about what we have now is that we have optional keywords before and after INDEX, rather than only between CREATE and INDEX. Yeah, putting them both into that space seems consistent to me, and it will fix the problem of making an omitted index name look like a valid command. I'm not sure I should be opening this can of worms, but do we want to use a different keyword than CONCURRENTLY to make it read better there? precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end: CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; I'd stick with that. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Tricky bugs in concurrent index build
precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end: CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; That was what the patch originally used, but it was changed because it made difficult for psql to auto-complete that. That is imho not enough of a reason to divert. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Tricky bugs in concurrent index build
Is it not possible to brute force this adding an AM method to insert without the uniqueness check? Hm. Actually there already is a feature of aminsert to allow suppressing the unique check, but I'm not sure whether using it for RECENTLY_DEAD tuples helps. Seems like we have to wait to see whether DELETE_IN_PROGRESS deleters commit in any case. Um, but if we wait for the DELETE_IN_PROGRESS tuple, after the wait we can add it eighter with or without the unique check (depending on commit/abort). Then at least we don't need to wait in a 3rd pass for readers ? Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] WIP archive_timeout patch
I noticed a minor annoyance while testing: when the system is completely idle, you get a forced segment switch every checkpoint_timeout seconds, even though there is nothing useful to log. The checkpoint code is smart enough not to do a checkpoint if nothing has happened since the last one, and the xlog switch code is smart enough not to do a switch if nothing has happened since the last one ... but they aren't talking to each other and so each one's change looks like something happened to the other one. I'm not sure how much trouble it's worth taking to prevent this scenario, though. If you can't afford a WAL file switch every five minutes, you probably shouldn't be using archive_timeout anyway ... Um, I would have thought practical timeouts would be rather more than 5 minutes than less. So this does seem like a problem to me :-( Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Possible Typecasting Bug with coalesce()
= select now()coalesce('Jul 14 2006 9:16:47AM'); The only bug I see here is that implicit coercions to text are a bad idea :-( --- IMHO it would be better if your first query failed instead of giving you unexpected behavior. :-) We know that you think that Tom, but a lot of us do not want to go casting all our sql, especially where other db's don't require it. Would an equivalent CASE statement also do the early conversion to text ? Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [SQL] using constraint based paritioning to fix EAV type schemas? (view inheritance and constraints)
I've seen a few EAV designs in practice. They've all been problematic. I'd like to have a better way of dealing with them. Which is why I'm tentatively suggesting support for inheritance and constraints in views. If there's some other way to achieve constraint based exclusion across a UNION of heterogenous tables, I'd love to hear it. If you do your own rules anyway, why can't you use inheritance and create the rules on the parent table and the constraints on the child tables ? You can still use the child tables directly if you want. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [SQL] using constraint based paritioning to fix EAV type schemas? (view inheritance and constraints)
I've seen a few EAV designs in practice. They've all been problematic. I'd like to have a better way of dealing with them. Which is why I'm tentatively suggesting support for inheritance and constraints in views. If there's some other way to achieve constraint based exclusion across a UNION of heterogenous tables, I'd love to hear it. If you do your own rules anyway, why can't you use inheritance and create the rules on the parent table and the constraints on the child tables ? Ah, sorry, just saw that you want different column names in your subtables. Add me as another vote to extend the new constraint elimination to union all views :-) Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] RESET CONNECTION?
Will this patch make it into 8.2? http://archives.postgresql.org/pgsql-patches/2004-12/msg00228.php It's a really nice feature, would be extremly useful with tools like pgpool. No, it will not because RESET CONNECTION can mess up interface code that doesn't want the connection reset. We are not sure how to handle that. Imho, if it where at the protocol level, that would not be such an issue. If the interface gives access to the protocol level it is already depending on good will. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] automatic system info tool?
If you can open a command shell you can get the OS version with the 'ver' command under Windows: C:\ver Microsoft Windows XP [Version 5.1.2600] How do you do this from a program though. Under UNIX uname() is a function call as well as a program. It returns the os name, version, hostname and system type. GetVersionEx() will get you the windows version, service pack, etc IIRC. in perl: use POSIX; print join(',',POSIX::uname()),\n; prints: Windows NT,hostname.domain.com,5.0,Build 2195 (Service Pack 4),x86 Works on all Platforms. (more detail on Win with: use Win32; join(' ', Win32::GetOSVersion()), \n;) Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Implied Functional Index use
- add a new boolean to pg_operator to allow us to define which operators offer true equality ... This would be useful for other purposes too, as we keep coming up against what's the equality operator for this datatype problems. However, the restriction to true equality, such that we can assume x = y implies f(x) = f(y) for every immutable function f on the datatype Maybe we could have a tri (or more) state flag for the equality operators. ' ' .. not an equality op 'e' .. equality 's' .. strict equality (op only true iff the binary representation is equal) Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Implied Functional Index use
There is a problem of implication here, AFAICS: When a user SQL asks WHERE col1 = 7 which equality level is meant when several exist? Well, the operator must be unique, so there is no problem. Unique in the sense that an operator with the same name ('=' in this case) and argument types cannot exist for more than one level of equality. (and the level should not have an effect on the resolution) So, when we see col1 = 7 we lookup the equality level of the operator and decide whether it is strict enough for the particular optimization. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] update/insert,
OK, but the point of the question is that constantly updating a single row steadily degrades performance, would delete/insery also do the same? Yes, there is currently no difference (so you should do the update). Of course performance only degrades if vaccuum is not setup correctly. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] update/insert, delete/insert efficiency WRT vacuum and
Is there a difference in PostgreSQL performance between these two different strategies: if(!exec(update foo set bar='blahblah' where name = 'xx')) exec(insert into foo(name, bar) values('xx','blahblah'); or In pg, this strategy is generally more efficient, since a pk failing insert would create a tx abort and a heap tuple. (so in pg, I would choose the insert first strategy only when the insert succeeds most of the time (say 95%)) Note however that the above error handling is not enough, because two different sessions can still both end up trying the insert (This is true for all db systems when using this strategy). Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Here is an overview of the SITC method: http://momjian.us/cgi-bin/pgsitc A pretty fundamental problem is that the method assumes it's OK to change the CTID of a live tuple (by swapping its item pointer with some expired version). It is not --- this will break: I am having difficulty visualizing that. The plan is not to change CTID's (only the CTID's offset into the page is to be changed). The CTID of the new version is one that is up to now invisible to all backends, so noone can actually have remembered that CTID. Also you would first insert the slot content and then change the CTID offset (this offset change might need to be made atomic). Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Single Index Tuple Chain (SITC) method
And anyway, ctid is a usable unique row identifier only within read-only transactions, or not ? actually for as long as no vacuum comes along. This would change with SITC. (Maybe it would help to only reuse old versions of the same row, then anybody holding a ctid would at least be still looking at a version of the same row, and should thus be able to follow the update chain) Err, no. The ctid is the only identifer of a tuple in any case. When you do a delete, the tuple to be deleted is indicated by the ctid field which has been passed up from the base table through the rest of the query. When you reach the top the ctid better refer to the same tuple or you'll delete the wrong one. UPDATE is the same. For all these purposes you will be holding the ctid of a visible (to someone) tuple. Those don't qualify for a new SITC tuple anyway. For all intents and purposes, the CTID of tuple can't change unless you're 100% certain no-one is using it in any way. For all I know, noone is using dead tuples except for visibility lookup. We would need to make sure that other backends see the new tuple eighter as dead or txopen as long as the contents are not valid. I think we could do that without a vacuum lock on platforms that support 4 byte atomic operations. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2
Very nice explanation, thanks Alvaro. 2. Mark frozen databases specially somehow. To mark databases frozen, we need a way to mark tables as frozen. How do we do that? As I explain below, this allows some nice optimizations, but it's a very tiny can full of a huge amount of worms. Marking a Table Frozen == Marking a table frozen is simple as setting relminxid = FrozenXid for a table. As explained above, this cannot be done in a regular postmaster environment, because a concurrent transaction could be doing nasty stuff to a table. So we can do it only in a standalone backend. Unless you lock the table exclusively during vacuum, that could be done with vacuum freeze. I like that more, than changing stuff that is otherwise completely frozen/static. (I see you wrote that below) On the other hand, a frozen table must be marked with relminxid = a-regular-Xid as soon as a transaction writes some tuples on it. Note that this unfreezing must take place even if the offending transaction is aborted, because the Xid is written in the table nevertheless and thus it would be incorrect to lose the unfreezing. The other idea was to need a special unfreeze command ... This is how pg_class_nt came into existence -- it would be a place where information about a table would be stored and not subject to the rolling back of the transaction that wrote it. Oh, that puts it in another league, since it must guarantee commit. I am not sure we can do that. The previous discussion was about concurrency and data that was not so important like tuple count. In short: - I'd start with #1 (no relminxid = FrozenXid) like Tom suggested - and then implement FREEZE/UNFREEZE with exclusive locks like Simon wrote (so it does not need pg_class_nt) and use that for the templates. Simon wrote: Suggest that we prevent write operations on Frozen tables by revoking all INSERT, UPDATE or DELETE rights held, then enforcing a check during GRANT to prevent them being re-enabled. Superusers would need to check every time. If we dont do this, then we will have two contradictory states marked in the catalog - privilges saying Yes and freezing saying No. No, I'd not mess with the permissions and return a different error when trying to modify a frozen table. (It would also be complicated to unfreeze after create database) We should make it clear, that freezing is no replacement for revoke. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2
Suggest that we prevent write operations on Frozen tables by revoking all INSERT, UPDATE or DELETE rights held, then enforcing a check during GRANT to prevent them being re-enabled. Superusers would need to check every time. If we dont do this, then we will have two contradictory states marked in the catalog - privilges saying Yes and freezing saying No. No, I'd not mess with the permissions and return a different error when trying to modify a frozen table. (It would also be complicated to unfreeze after create database) We should make it clear, that freezing is no replacement for revoke. That was with a mind to performance. Checking every INSERT, UPDATE and DELETE statement to see if they are being done against a frozen table seems like a waste. I'd think we would have relminxid in the relcache, so I don't buy the performance argument :-) (You could still do the actual check in the same place where the permission is checked) There would still be a specific error message for frozen tables, just on the GRANT rather than the actual DML statements. I'd still prefer to see the error on modify. Those that don't can revoke. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] vacuum, performance, and MVCC
On 6/25/2006 10:12 PM, Bruce Momjian wrote: When you are using the update chaining, you can't mark that index row as dead because it actually points to more than one row on the page, some are non-visible, some are visible. Back up the truck ... you mean in the current code base we have heap tuples that are visible in index scans because of heap tuple chaining but without index tuples pointing directly at them? I don't know where this idea came from, but it's not true. All heap tuples, dead or otherwise, have index entries. When using CITC you would be reusing the index tuples from the current heap tuple, so you can only reuse free space or a dead member of a CITC chain. You cannot reuse a dead tuple not member of a CITC chain because that has separate (invalid) index tuples pointing at it. Part of the trick was moving slots (==ctid) around, so I still do not really see how you can represent the CITC chain as part of the update chain. Unless you intend to break dead parts of the update chain ? Maybe that is ok ? Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] vacuum, performance, and MVCC
head of the chain yet. With an index scan, finding the head is easy, but for a sequential scan, it seems more difficult, and we don't have any free space in the tail of the chain to maintain a pointer to the head. Thinking some more, there will need to be a bit to uniquely identify the head of a CITC. I don't think so. It would probably be sufficient to impose an order on the CITC. e.g. the oldest tuple version in the CITC is the head. (An idea just in case we can't spare a bit :-) Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] vacuum, performance, and MVCC
back and forth the data between an archive table and the live table, based on how active the groups are, I can't imagine any other way of partitioning it. And that would also mean some quite big load given the pretty high dynamics of the groups. You said the activity comes in bursts per group, so the obvious partitioning would be per group. If you have too many groups to have one partition per group you could try to find some modulo or other rule to spread them into separate partitions. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] vacuum, performance, and MVCC
This could be a solution... but then I'm not sure how well would do queries which need the first 10 records based on some criteria which does not include the group id. I guess limit queries across the union of the partitions don't work too well for now, and we do have such queries. You would query the parent (no union). Do you need order by's ? Without order by it is currently no problem. Do we push the limit down to the separate tables when we have an appropriate index for the order by (that could be a TODO item)? (You need a max of limit rows per child in the outer order) Or we would need to implement an efficient index merge node for order by queries on parent (and union all's) with low limits and an appropriate index. Selecting the oldest x rows from a time partitioned table is a frequent problem we need to work around here too (Informix db). Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] vacuum, performance, and MVCC
Each time the record is updated, a new version is created, thus lengthening the correct version search each time row is accessed, until, of course, the next vacuum comes along and corrects the index to point to the latest version of the record. Is that a fair explanation? No, it's not. 1. The index points to all the versions, until they get vacuumed out. it points to the last current version as updated by vacuum, or the first version of the row. no, the index has one entry for each version of the row. This is why updating only non-indexed columns is relatively expensive in pg. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] sync_file_range()
Indeed, I've been wondering lately if we shouldn't resurrect LET_OS_MANAGE_FILESIZE and make that the default on systems with largefile support. If nothing else it would cut down on open/close overhead on very large relations. I'd still put some limit on the filesize, else you cannot manually distribute a table across spindles anymore. Also some backup solutions are not too happy with too large files eighter (they have trouble with staging the backup). I would suggest something like 32 Gb. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] sync_file_range()
Tom Lane [EMAIL PROTECTED] writes: Indeed, I've been wondering lately if we shouldn't resurrect LET_OS_MANAGE_FILESIZE and make that the default on systems with largefile support. If nothing else it would cut down on open/close overhead on very large relations. I'd still put some limit on the filesize, else you cannot manually distribute a table across spindles anymore. Also some backup solutions are not too happy with too large files eighter (they have trouble with staging the backup). I would suggest something like 32 Gb. Well, some people would find those arguments compelling and some wouldn't. We already have a manually configurable RELSEG_SIZE, so people who want a 32Gb or whatever segment size can have it. But if you're dealing with terabyte-sized tables that's still a lot of segments. What I'd be inclined to do is allow people to set RELSEG_SIZE = 0 in pg_config_manual.h to select the unsegmented option. That way we already have the infrastructure in pg_control etc to ensure that the database layout matches the backend. That sounds perfect. Still leaves the question of what to default to ? Another issue is, that we would probably need to detect large file support of the underlying filesystem, else we might fail at runtime :-( Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] MultiXacts WAL
I would like to see some checking of this, though. Currently I'm doing testing of PostgreSQL under very large numbers of connections (2000+) and am finding that there's a huge volume of xlog output ... far more than comparable RDBMSes. So I think we are logging stuff we don't really have to. I think you really have to lengthen the checkpoint interval to reduce WAL overhead (20 min or so). Also imho you cannot only compare the log size/activity since other db's write part of what pg writes to WAL to other areas (physical log, rollback segment, ...). If we cannot afford lenghtening the checkpoint interval because of too heavy checkpoint load, we need to find ways to tune bgwriter, and not reduce checkpoint interval. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
This bothers me a bit, because in fact the effects if any of the tested query would have been rolled back. Not sure we have any choice though. If we expose the error then we'll have problems with clients not showing the EXPLAIN results. I think we should leave it in top level, throw the error and fix the clients. As I understood, the idea was, that it only does that if you press ^C or query timeout. In this case current clients would also not show the plan. Not if the clients are implemented per protocol spec. A client cannot assume that sending QueryCancel will make the current query fail. Sorry I don't understand that comment. I did not not say that it must fail, but iff it is interrupted (and thus fails) was the case I meant. You stated, that current clients won't show the explain output if they get a protocol error response. (Does the protocol not allow both data and error ?) We would need to teach clients to output the explain result even if an error is returned. I hold my comment: on ^C we should return the plan and return the error. We should not misuse automatic subtransactions for this. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] More on inheritance and foreign keys
I think that the ONLY was wrong from day one :-( Well, sure, but until we have an implementation that actually *works* across multiple tables, it has to be there so that we can at least consistently support the current single-table semantics. Until we have some form of cross-table unique constraint (index or whatever) we can't support multi-table foreign keys --- taking off the ONLY is not a fix. Um, I think it would work for a special case, where the unique constraint includes the partitioning column[s], and the partitions (check constraints) don't overlap. In this case you can create simple unique indexes on the subtables. When looking at other db's this is not such an exceptional requirement for unique indexes that share the same partitioning scheme as the table. And imho the all indexes sharing the table partitioning scheme is the most important use case. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
This bothers me a bit, because in fact the effects if any of the tested query would have been rolled back. Not sure we have any choice though. If we expose the error then we'll have problems with clients not showing the EXPLAIN results. I think we should leave it in top level, throw the error and fix the clients. As I understood, the idea was, that it only does that if you press ^C or query timeout. In this case current clients would also not show the plan. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] ADD/DROP constraints
On a separate note. The one major remaining piece here is in constraints. I'm thinking what I have to check is that every constraint present on the parent table is present on the child tables. And I'm thinking I should do that by looking at the constraint's textual definition (consrc). This doesn't allow you to get by with a single stronger constraint -- you would still need the redundant looser constraint to satisfy the inheritance. Yes, I think you would actually want eighter an identical, or a stronger constraint on the child. But it does let you get by with constraint names that don't match the parent's. I'm not sure that's such a good thing, since pg_dump would then generate a redundant constraint when it generates the table. Maybe that would go if constraints got conislocal and coninh. Or maybe I should insist that a matching constraint name be present *and* that the source text match? That's more of a pain to code though. I think in the meantime, I would check that eighter a source match is present OR a constraint with the same name. This would allow more flexibility and imho still enough safety checking. Until we have (or feel a need for) check logic for stronger constraint it would be the op's responsibility. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] ADD/DROP constraints
Or maybe I should insist that a matching constraint name be present *and* that the source text match? That's more of a pain to code though. That could also break some partitioning schemes; I don't think it's a given that parents and children have matching constraints, and afaik a parent can have constraints that a child doesn't. Yea, but that is why we would have parent ONLY constraints, they would only apply when the tuple is actually stored in the parent relation. In the typical partitioning case it does not really matter since the parent ONLY is typically empty. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ADD/DROP INHERITS
But that's entirely inconsistent with the way inherited tables work in general. I don't see any basis for that conclusion. The properties of a table are set when it's created and you need to do pretty explicit ALTERs to change them. It just seems weird for: CREATE TABLE foo (x,y,z) INHERITS (bar) to not be the equivalent to: CREATE TABLE foo (x,y,z) ALTER TABLE foo ADD INHERITS bar Imho the op should only choose that path if he wants to fill the table before adding the inheritance. It makes no sense to add columns with default values to existing rows of the child table, especially when you inherit the defaults from the parent. So I agree with Tom, that ADD INHERITS should not add columns. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] More on inheritance and foreign keys
The solution to the foreign key problem seems easy if I modify PostgreSQL implementation and take off the ONLY word from the SELECT query, but it's not an option for me, as I'm I think that the ONLY was wrong from day one :-( The default in other areas is table including childs. (Not like in old pg where you had to use tab* to include childs) (iirc leaving off ONLY is not sufficient because of locking problems) Of course then we would need REFERENCES tenk ONLY (unique1) to allow current behavior. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Updatable views/with check option parsing
While I don't think that making WITH a fully reserved word would cause any great damage, I'm unwilling to do it just to save a couple of lines of code. I think we should go on and do promote WITH to a reserved keyword now Oracle, MS-SQL, DB2, MySQL and Informix also have WITH reserved, so it would imho be ok to do it if it simplifies code. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] error-free disabling of individual child partition
Added to TODO: o Add ALTER TABLE tab ADD/DROP INHERITS parent Sounds good, more inline with add/drop constraint. pg_attribute.attislocal has to be set to 'false' for ADD, and attislocal: If you set this to False, you wouldn't be able to set it back again. Just curious, why is that ? pg_attribute.attinhcount adjusted appropriately Do we have a plan on what to do with RI constraints on the parent when doing DROP INHERTITS ? Seems all FK tables would need to be checked for keys in the local table. (Not sure whether we have PK on tab*, but it seems you could do it when the partitioning column is part of the PK) Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] LIKE, leading percent, bind parameters and indexes
AFAICS the problem is not restricted to LIKE, we can easily find a lot of similar problems caused by the actual parameters. For example, SeqScan vs. IndexScan vs. BitmapIndexScan for a range query. So an improvement is definitely needed. Another way is to generate a plan on the fly. What we do is to let some REPLAN nodes sit on top of some critical plan node: at the execution, we will compare the actual numbers we get and the estimated number we have Since we are deciding this on histogram data, it seems we could store the ranges (and exception values) where this plan is not good, and replan in case the new value does not fit. This would also imply, that we postpone (part of the) planning until we get the first values, when the node cost largly depends on the supplied value. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] error-free disabling of individual child partition
table of another table. I propose a TODO item to allow this: ALTER TABLE childN INHERITS ( parent1, ... ); We don't need a disinherit do we? I propose: ALTER TABLE childN INHERITS (); Thus I also think, that the list should be complete, and is not an addition to existing inheritance. O, yes, I think we do. I can imagine that the ability to swap a table Agreed. Simon, were you testing how many ppl read to the end :-) Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Compression and on-disk sorting
1) Use n sort areas for n tapes making everything purely sequential access. Some time ago testing I did has shown, that iff the IO block size is large enough (256k) it does not really matter that much if the blocks are at random locations. I think that is still true for current model disks. So unless we parallelize, it is imho sufficient to see to it that we write (and read) large enough blocks with single calls. This also has no problem in highly concurrent scenarios, where you do not have enough spindles. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Compression and on-disk sorting
Certainly, if you can't prototype a convincing performance win using that algorithm, it's unlikely to be worth anyone's time to look harder. That should be easily possible with LZO. It would need to be the lib that we can optionally link to (--with-lzo), since the lib is GPL. lzo even allows for inplace decompression and overlapping compression. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Compression and on-disk sorting
Unfortunatly, the interface provided by pg_lzcompress.c is probably insufficient for this purpose. You want to be able to compress tuples as they get inserted and start a new block once the output reaches a I don't think anything that compresses single tuples without context is going to be a win under realistic circumstances. I would at least compress whole pages. Allow a max ratio of 1:n, have the pg buffercache be uncompressed, and only compress on write (filesystem cache then holds compressed pages). The tricky part is predicting whether a tuple still fits in a n*8k uncompressed 8k compressed page, but since lzo is fast you might even test it in corner cases. (probably logic that needs to also be in the available page freespace calculation) Choosing a good n is also tricky, probably 2 (or 3 ?) is good. You probably also want to always keep the header part of the page uncompressed. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_dump and backslash escapes
Very clear. The issue is that I can't find any of these emitted by a pg_dump version who's native backend doesn't understand them. I assume that it is expected that a cross-db dump/reload will generate errors, and it is done rarely for upgrades, but I assume same-version dump/restore is done more frequently and people don't expect errors. Is that not a significant distinction? I thought the suggested procedure (see migration doc) was to use the new pg_dump to dump the older db version, so why backpatch ? Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_dump and backslash escapes
I thought the suggested procedure (see migration doc) was to use the new pg_dump to dump the older db version, so why backpatch ? Uh, you can suggest it, but I would guess 50% do it, and once the old database is gone, there is no way to re-do the dump. But you can still load the dump if you execute the two statements in the new db psql session before loading the dump file, no ? SET escape_string_warning = off; SET standard_conforming_strings = off; Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Compression and on-disk sorting
Given that any time that happens we end up caring much less about CPU usage and much more about disk IO, for any of these cases that use non-random access, compressing the data before sending it to disk would potentially be a sizeable win. Note however that what the code thinks is a spill to disk and what actually involves disk I/O are two different things. If you think of it as a spill to kernel disk cache then the attraction is a lot weaker... Yes, that is very true. However it would also increase the probability that spill to disk is not needed, since more data fits in RAM. It would probably need some sort of plugin architecture, since the fastest compression algorithms (LZO) that also reach good ratios are gpl. LZO is proven to increase physical IO write speed with low CPU overhead. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Compression and on-disk sorting
Personally, I believe it would be worth it - but only to a few. And these most of these few are likely using Oracle. So, no gain unless you can convince them to switch back... :-) We do know that the benefit for commercial databases that use raw and file system storage is that raw storage is only a few percentage points faster. Imho it is really not comparable because they all use direct or async IO that bypasses the OS buffercache even when using filesystem files for storage. A substantial speed difference is allocation of space for restore (no format of fs and no file allocation needed). I am not saying this to advocate moving in that direction however. I do however think that there is substantial headroom in reducing the number of IO calls and reducing on disk storage requirements. Especially in concurrent load scenarios. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Wrong plan for simple join with index on FK
These are all minor abberations though, on the whole the estimates are pretty good. Perhaps you need to tweak the values of random_page_cost and similar variables. Thank You, It's general problem or only mine? I have 100% standard current PC. The default random_page_cost assumes some concurrent activity. If your PC does nothing else concurrently, the performance of a seq scan will be underestimated. Try to do the statement with some concurrent disk load and you will most likely see that the 1. plan is faster. (assuming the tables are not fully cached) Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] [HACKERS] Big IN() clauses etc : feature proposal
Something else worth considering is not using the normal catalog methods for storing information about temp tables, but hacking that together would probably be a rather large task. But the timings suggest, that it cannot be the catalogs in the worst case he showed. 0.101 ms BEGIN 1.451 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP 1.4 seconds is not great for create table, is that what we expect ? 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.443 ms ANALYZE tmp 0.365 ms SELECT * FROM tmp 0.310 ms DROP TABLE tmp 32.918 ms COMMIT CREATING the table is OK, but what happens on COMMIT ? I hear the disk seeking frantically. The 32 seconds for commit can hardly be catalog related. It seems the file is fsynced before it is dropped. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] XLOG_BLCKSZ vs. wal_buffers table
I'm planning on continuing to increase XLOG_BLCKSZ and wal_buffers to determine when the throughput starts to level out or drop I think for an even better comparison you should scale wal_buffers down with increasing XLOG_BLCKSZ, so that the xlog buffer has a fixed size in kb. Reasonable wal_buffers imho amount to at least 256kb, better yet 512 or 1 Mb, with sufficiently large transactions (and to try to factor out the difference between blocksizes). AFAIK all the transactions in DBT2 are pretty small. I think all DML is single-row in fact, so I'm not sure that having wal_buffers much larger than the number of connections would help much. Well, but those updates wander around the whole table/index, so you will have a lot of before images to write. So I take back the sufficiently large transactions part of my comment. You want more wal_buffers in all higher load scenarios. (one test had 8 buffers of 2k each, this is not enough in any high load scenario) Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] XLOG_BLCKSZ vs. wal_buffers table
I'm planning on continuing to increase XLOG_BLCKSZ and wal_buffers to determine when the throughput starts to level out or drop I think for an even better comparison you should scale wal_buffers down with increasing XLOG_BLCKSZ, so that the xlog buffer has a fixed size in kb. Reasonable wal_buffers imho amount to at least 256kb, better yet 512 or 1 Mb, with sufficiently large transactions (and to try to factor out the difference between blocksizes). Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] ANSI-strict pointer aliasing rules
4. Find the option for disabling strict alias and get configure to add that. You'll still lose performance, but the option is -qalias=noansi. My old xlc does not show that option, it is unfortunately version specific. The currently compatible option to turn it off would be -qnoansialias So we can use: xlc -qnoansialias The default cc options are: -qlanglvl=extended -qnoro -qnoroconst So I guess we could also use (but above is imho clearer/better): cc -qro -qroconst -qlanglvl=extc89 Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] ANSI-strict pointer aliasing rules
I ran afoul of these rules the other day when compiling pgsql 8.1 on AIX. The configure scripts are set up to look for xlc instead of cc, and that command invokes cc with -qalias=ansi, the ANSI-strict pointer aliasing mode. Can you please explain what exactly was not working ? xlc has in the past shown warnings that were actually problematic code that gcc did not show (and the cc variant of xlc also does not show). Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Concurrency problem building indexes
[ shrug... ] Worksforme. There is a short interval at the end of the first CREATE INDEX on the table where the problem would happen if another CREATE INDEX tries to modify the pg_class row before the first one's committed. I did a pg_dumpall and removed the index creation commands. The first time I run the index build, I usually get at least one occurrence. I think that narrows it down nicely. You create the table, load rows, then without analyze create the indexes, thus pg_class is not up to date, and the update needs to be done. My answer to this would be to (have an option to) ommit this relpages and reltuples update. It is imho not the task of create index to update statistics in the first place. I have been burnt by that behavior when creating indexes on empty tables in Informix and never liked it (iirc pg has a workaround for empty tables though). Wes, you could most likely solve your immediate problem if you did an analyze before creating the indexes. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Concurrency problem building indexes
Wes, you could most likely solve your immediate problem if you did an analyze before creating the indexes. I can try that. Is that going to be a reasonable thing to do when there's 100 million rows per table? I obviously want to minimize the number of sequential passes through the database. No, I think it would only help if it gets the exact tuple count. For large tables it only gets an exact count with a full scan (use vacuum instead of analyze). Then again, when the table is large, the different create indexes should finish at sufficiently different times, so an analyze might be sufficient to fix the problem for small tables. (analyze is fast for large tables since it only does a sample) Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pg_dump -Ft failed on Windows XP
Apparently it won't work at all if TMP isn't set? I'm not *too* concerned about that, since TMP is normally set by the OS itself. There's one set in the system environment (to c:\windows\temp or whatrever) and then it's overridden by one set by the OS when it loads a user profile. OK, then maybe not having it would be equivalent to /tmp-not-writable on Unix, ie, admin error. Also to the point, what would you fall back to? Current directory maybe? It tries \ (tested on Win 2000), if the dir argument is NULL and TMP is not set. But TMP is usually set. Attached is a working version not yet adapted to port/. - memoryleak fixed - use _tmpname and _fdopen not the compatibility tmpname and fdopen (imho only cosmetic) - EACCES fixed (Win2000 needs _S_IREAD | _S_IWRITE or fails with EACCES, even as Admin) - I suggest adding a prefix pg_temp_ (for leftover temp files after crash, the name I get is then usually pg_temp_2) Andreas Index: bin/pg_dump/pg_backup_tar.c === RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_backup_tar.c,v retrieving revision 1.50 diff -c -r1.50 pg_backup_tar.c *** bin/pg_dump/pg_backup_tar.c 12 Feb 2006 06:11:50 - 1.50 --- bin/pg_dump/pg_backup_tar.c 21 Apr 2006 09:22:00 - *** *** 362,368 --- 362,388 { tm = calloc(1, sizeof(TAR_MEMBER)); + #ifndef WIN32 tm-tmpFH = tmpfile(); + #else + /* on win32, tmpfile() generates a filename in the root directory, which requires +* administrative permissions to write to. */ + while (1) + { + char *tmpname; + int fd; + + tmpname = _tempnam(NULL, pg_temp_); + if (tmpname == NULL) + break; + fd = _open(tmpname, _O_RDWR | _O_CREAT | _O_EXCL | _O_BINARY | _O_TEMPORARY, _S_IREAD | _S_IWRITE); + free(tmpname); + if (fd == -1 errno == EEXIST) + continue; /* Try again with a new name if file exists */ + if (fd != -1) + tm-tmpFH = _fdopen(fd, w+b); + break; + } + #endif if (tm-tmpFH == NULL) die_horribly(AH, modulename, could not generate temporary file name: %s\n, strerror(errno)); ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Practical impediment to supporting multiple SSL libraries
Well, the psqlODBC driver apparently ran into a number of problems with libpq that resulted in them not using it for their purpose. Given libpq primary purpose is to connect to PostgreSQL, it failing at that is something that should be fixed. I think you are forgetting, that e.g. a JDBC driver will not want to depend on an external C dll at all. It will want a native Java implementation (Group 4). Thus imho it is necessary to have a defined wire protocol, which we have. So if a driver needs to use the wire protocol it is imho not a problem. If applications started using it, because they don't find a suitable driver, now that would be a problem. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] schema-qualified SET CONSTRAINTS
The attached patch allows SET CONSTRAINTS to take a schema qualified constraint name (myschema.t1_fk_t2) and when given a bare constraint name it uses the search_path to determine the matching constraint instead of the previous behavior of disabling all identically named constraints. This patch seems egregiously non backwards compatible :-(. Yes, it does change the existing behavior, but egregiously? How many applications intentionally defer constraints in multiple schemas at once? intentionally defer specifically named constraints in multiple schemas (The default application would imho eighter defer all, or a specific constraint) Not many. I would guess the more likely situation is that these applications don't even realize that they are deferring more than one constraint when it happens. I agree. I think the new behavior is more intuitive, and would even argue the old behavior gets it wrong. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Accessing schema data in information schema
Plan C would be to say that we don't need to preserve SELECT * FROM seqname, but I'll bet there would be some hollering. I'd like to hear this hollering first, before we create tons of views :-) Imho it is not a problem to remove it, I am for Plan C. (Those with need for the select can still create their view by hand. A release note would be sufficient imho.) Of course if we still need one row in pg_class for the ACL's, that row might as well be a view. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Merge algorithms for large numbers of tapes
Two pass will create the count of subfiles proportional to: Subfile_count = original_stream_size/sort_memory_buffer_size The merge pass requires (sizeof record * subfile_count) memory. That is true from an algorithmic perspective. But to make the merge efficient you would need to have enough RAM to cache a reasonably large block per subfile_count. Else you would need to reread the same page/block from one subfile multiple times. (If you had one disk per subfile you could also rely on the disk's own cache, but I think we can rule that out) Example: You have a 7 gigabyte table to sort and you have 100 MB sort buffer. The number of subfiles will be: 70 / 1 = 70 files To be efficient you need (70 + 1) \* max(record_size, 256k) = 18 Mb Plus you need a structure per subfile that points to the current record in the buffer. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Merge algorithms for large numbers of tapes
Two pass will create the count of subfiles proportional to: Subfile_count = original_stream_size/sort_memory_buffer_size The merge pass requires (sizeof record * subfile_count) memory. That is true from an algorithmic perspective. But to make the merge efficient you would need to have enough RAM to cache a reasonably large block per subfile_count. Else you would need to reread the same page/block from one subfile multiple times. (If you had one disk per subfile you could also rely on the disk's own cache, but I think we can rule that out) But what about the OS cache? Linux will read upto the next 128KB of a file if it's contiguous on disk, which is likely with modern filesystems. It's likely to be much fairer than any way we can come up with to share memory. We were discussing how much RAM is needed, and not how much the backend allocates itself. So if the backend needs to duplicate some of the OS cache, that will only add to the memory requirement. The most likely scenario is, that the backend additionally holds one page per subfile. Question is, do we want our algorithm to rely on that caching? Currently we do, and I don't think that is so bad actually. The only optimization I would consider, is adding a sequential access hint to the tape file :-) open. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] problem with large maintenance_work_mem settings and
I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. Well, since they are not actually tapes, why not? I wonder what the OS does when we repeatedly open and close those files because we are short on filedescriptors ? Will it replace cached pages of a file that we have closed *more* aggressively ? Maybe we should limit the files to how many files we would actually be able to hold open in parallel ? Or keep more that one tape in one file and remember a start offset into the file per tape. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Merge algorithms for large numbers of tapes
This amounts to an assumption that you have infinite work_mem, in which case you hardly need an external sort at all. If your work_mem is in fact finite, then at some point you need more than two passes. I'm not really interested in ripping out support for sort operations that are much larger than work_mem. No it does not. I have explained this before. You can have one million files and merge them all into a final output with a single pass. It does not matter how big they are or how much memory you have. Hh ? But if you have too many files your disk access is basically then going to be random access (since you have 1000nds of files per spindle). From tests on AIX I have pretty much concluded, that if you read 256k blocks at a time though, random access does not really hurt that much any more. So, if you can hold 256k per file in memory that should be sufficient. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Automatic free space map filling
But you could do the indexes first and remember how far you can vacuum the heap later. But the indexes _can't_ be done first; you _first_ need to know which tuples are dead, which requires looking at the table itself. If we already had the all tuples visible bitmap I think we could first scan the bitmap and decide whether we can afford to look at the visibility info for each entry in the index. We only collect the ctids before so we don't have the inefficient lookups, but if we can avoid the lookup in most cases it would again be attractive. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Automatic free space map filling
But what about index clearing? When do you scan each index? At the end of each iteration (or earlier, depending on maintenance_work_mem). So for each iteration you would need to scan the indexes. Maybe we could make maintenance_work_mem be the deciding factor; after scanning the indexes, do the release/reacquire locks cycle. But you could do the indexes first and remember how far you can vacuum the heap later. So you might as well do each index separately first and remember how far you can go with the heap for each one. Then do the heap with a special restriction that comes from what you remembered from the indexes. You can now separate the heap vacuum in arbitrarily large transactions, since the indexes are already taken care of. (You only vacuum to the point of the eldest vacuumed index) Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Automatic free space map filling
I thought we had sufficiently destroyed that reuse a tuple meme yesterday. You can't do that: there are too many aspects of the system design that are predicated on the assumption that dead tuples do not come back to life. You have to do the full vacuuming bit (index entry removal, super-exclusive page locking, etc) before you can remove a dead tuple. One more idea I would like to throw in. Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead tuple by reducing the tuple to it's header info. (If you still wanted to be able to locate index entries fast, you would need to keep indexed columns, but I think we agreed that there is no real use) I think that would be achievable at reasonable cost (since you can avoid one page IO) on the page of the currently active tuple (the first page that is considered). On this page: if freespace available -- use it elsif freespace available after reducing all dead rows -- use the freespace with a new slot else Of course this only works when we still have free slots, but I think that might not really be an issue. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Automatic free space map filling
I think you must keep the header because the tuple might be part of an update chain (cf vacuuming bugs we repaired just a few months ago). t_ctid is potentially interesting data even in a certainly-dead tuple. yes, I'd still want to keep the full header. Andreas' idea is possibly doable but I am not sure that I see the point. It does not reduce the need for vacuum nor the I/O load imposed by vacuum. What it does do is bias the system in the direction of allocating an unreasonably large number of tuple line pointers on a page (ie, more than are useful when the page is fully packed with normal tuples). Since we never reclaim such pointers, over time all the pages in a table would tend to develop line-pointer-bloat. I don't know what the net overhead would be, but it'd definitely impose some aggregate inefficiency. Ok, for vacuum the slot would look like any other dead row and thus be target for removal. Why do we not truncate the line pointer array ? Is it, that vacuum (not the full version) does not move rows to other pages or slots ? Of course vacuum full could do it, but I see your point. Maybe we could impose an upper limit on the number of slots to allow, after which the optimization is turned off. But this starts to sound not so good :-( Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] GRANT/REVOKE: Allow column-level privileges
3) For every privilege descriptor in CPD whose action is INSERT, UPDATE, or REFERENCES without a column name, privilege descriptors are also created and added to CPD for each column C in O for which A holds the corresponding privilege with grant option. For each such column, a privilege descriptor is created that specifies the identical grantee, the identical action, object C, and grantor A. 4) For every privilege descriptor in CPD whose action is SELECT without a column name or method name, privilege descriptors are also created and added to CPD for each column C in O for which A holds the corresponding privilege with grant option. For each such column, a privilege descriptor is created that specifies the identical grantee, the identical action, object C, and grantor A. As I read it, granting a table-level privilege is equivalent to repeating the appropriate column-level privilege for all columns. In other words: For this table: CREATE TABLE tab (c1 int, c2 int, c3 int); This statement: GRANT SELECT ON tab TO grantee; ...also implies: GRANT SELECT (c1) ON tab TO grantee; GRANT SELECT (c2) ON tab TO grantee; GRANT SELECT (c3) ON tab TO grantee; This means that after the following, the grantee should have no privileges on tab.c1 (but should retain them on tab.c2, tab.c3): GRANT SELECT ON tab TO grantee; REVOKE SELECT (c1) ON tab FROM grantee; I don't (do not want to) read that conclusion from above paragraphs, anyone else ? My reasoning is, that you can only revoke what has previously been granted. e.g. grant dba to grantee; cannot be revoked with: revoke select on tab from grantee; for that table I think the paragraphs have only been added to understand what rights you have on each column. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] gprof SELECT COUNT(*) results
OTOH DB2 and SQLServer take block level read locks, so they can do this too, but at major loss of concurrency and threat of deadlock. Note, that in the usual committed read isolation, they do not need to read lock a row ! e.g. Informix only verifies, that it could lock the row (that there is no write lock). Only cursor stability leaves one read lock until the next fetch, serializable actually leaves all read locks, and select for update an intent update lock. Also they usually feed a buffer of rows to the client, so if the client does a fetch it gets a row from the client side buffer. Only when the buffer is empty, they get more from the server. I think the statement holds, that the optimization is pg specific, and cannot be directly compared to other db's. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] gprof SELECT COUNT(*) results
DB2: Uncommitted Read (UR) mode Dirty read isn't the default, or the recommended lock level for most apps. I was considering Cursor Stability mode (or higher), which is the default Sorry, they call it read committed but actually do cursor stability, which does keep one lock on the last fetched row. Keeping the lock would actually not be necessary to conform with ANSI read committed. See table 4 on Page 8 of http://www.cs.ndsu.nodak.edu/~yawang/Snapshot.ppt SQLServer: READ COMMITTED does take share locks. But it does not hold them. According to docu it holds them while reading which is not a very detailed description. How long is that really, e.g. with odbc forward cursor fetch ? There's a NO LOCK hint, true, but its not a default. That is for dirty/uncommitted reads. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Should libedit be preferred to libreadline?
With AIX 5, the easiest way to get a shared object is to pass -bexpall to the linker. This results in all symbols being exported. Yes, that is another reason not to use this broken switch. And last time I checked (AIX 4.3.3), -bexpall did not export all needed symbols (e.g. globals) from the backend eighter. And the counterpart -bimpall did also not work. Dynamic loading did not work without the .imp and .exp files :-( Andreas PS: I'd prefer if readline was only linked where it is needed, namely in psql. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving count(*)
The instant someone touches a block it would no longer be marked as frozen (vacuum or analyze or other is not required) and count(*) would visit the tuples in the block making the correct decision at that time. Hmm, so the idea would be that if a block no longer contained any tuples hidden from any active transaction, you could store the count and skip reading that page. I like the approach of informix and maxdb, that can tell the count(*) instantly without looking at index leaf or data pages. Imho we could do that with a central storage of count(*) even with mvcc. The idea is a base value for count(*) and corrective values per open xid. To tell the count you add all corrective values whose xid is visible in snapshot. Each backend is responsibe for compacting xid counters below min open xid. Periodically (e.g. at checkpoint time) you compact (aggregate committed xid counters into the base value) and persist the count. Since that costs, I guess I would make it optional and combine it with materialized views that are automatically used at runtime, and can at the same time answer other aggregates or aggregates for groups. create materialized view xx_agg enable query rewrite as select count(*), sum (col1) from xx [group by col2]; Your page flag storage could possibly also be used for btree access, to short circuit the heap visibility lookup (e.g. for pages where all rows are visible (vacuumed)). I think that your proposal is too complex if it is not used to also improve other performance areas. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] MERGE vs REPLACE
Unless you have a table lock, INSERT has to be before UPDATE, think UPDATE, UPDATE (both fail), INSERT, INSERT. update if no rows updated insert if duplicate key update if no rows updated goto insert That is why you have the loop. This is not a problem with above code, because only one insert succeeds while the others then do the update. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving count(*)
Since that costs, I guess I would make it optional and combine it with materialized views that are automatically used at runtime, and can at the same time answer other aggregates or aggregates for groups. create materialized view xx_agg enable query rewrite as select count(*), sum (col1) from xx [group by col2]; I wonder how many times you really need a count(*) w/o where clause. If I understand you correctly you are trying to optimize just this one case? I guess you have not read to the end. A materialized view with a group by as indicated in the example is able to answer all sorts of queries with or without where clauses ( e.g. ... where col2 = 'x'). Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tablespaces and non-empty directories
This is because lost+found exists. Since lost+found would be a reasonably common directory to find at a mount-point on Unix-like OSs*, would it make sense for CREATE TABLESPACE to ignore it if present? No. There is no reason to use a volume's root directory as a tablespace; especially so since the root directory ought to be owned by root That is not so on AIX. Only the moint point (the dir in the parent) is root. Once mounted it can have (and preserves) any permission you want. But on AIX the workaround is to remove the directory after mounting and before creating the tablespace. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] MERGE vs REPLACE
The problem I try to solve is something along: a bunch of clients try to update a count, and ONE of them must initialize the count if it does not exist... this can't be done with current infrastructure without race conditions. The solution without merge but a unique key in other db's is: update if no rows updated insert if duplicate key update if no rows updated goto insert note, that the counter updates need to be of the form set x = x + ? where key=y do you see a potential race condition with this ? In pg you also need a savepoint before the insert for this to work. Depending on the ratio of insert vs update we also start with insert when the insert succeeds more that 50% (I would use a higher percentage with pg though): insert if duplicate key update if no rows updated goto insert Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly