Re: [HACKERS] Make subquery alias optional in FROM clause
On 2017-02-24 07:25, Robert Haas wrote: > I don't think it's only Oracle that allows omitting the > alias; I think there are a number of other systems that behave > similarly. SQLite, for example. Making conversions from SQLite to Postgres easier is a Good Thing. "subquery in FROM must have an alias" has caused me inconvenience doing that as recently as last week. -M- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Remove TZ entry from postgres CLI doc page.
Hi, Please find attached a patch to the postgres command line documentation page doc/src/sgml/ref/postgres-ref.sgml that removes the "TZ" entry from the "Environment" section. If I've understood it correctly, since ca4af308 TZ can be looked at when you run initdb, but is not looked at when the server is started. I am using Test::PostgreSQL to create a test database; it empties the postgresql.conf created by initdb and is then therefore not using the correct timezone, and it took me a while to work out what was happening and why I couldn't use TZ when starting the database. ATB, Matthew 0001-Remove-TZ-entry-from-postgres-CLI-doc-page.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] 4B row limit for CLOB tables
That's assuming that toasting is evenly spread between tables. In my experience, that's not a great bet... Time to create a test: SELECT chunk_id::bigint/10 as id_range, count(*), count(*)/(10::float) density FROM (SELECT chunk_id FROM pg_toast.pg_toast_39000165 WHERE chunk_id 1 AND chunk_seq = 0) f GROUP BY id_range ORDER BY id_range; The machine in question was restored in parallel in Sept 2013 as part of an upgrade from 8.4. It has about 2000 tables, so its definitely not dominated by a couple tables. Progress towards oid wrap around is about 25.6%. With minimal effort, I found 2 bad examples, and I’m sure I can easily find more. I attached the results for those two. There were runs of 1,100,000+ and 600,000+ chunk_ids where more than 99% of the chunk_id are taken. After restore completion, oid densities averaged less than 20 per 100,000 and 400 per 100,000 respectively. The only reasons those runs seem to be so short is because the tables were much smaller back then. I expect that next time I dump restore (necessary for upgrading OS versions due to the collation issue), I’m going to have runs closer to 20,,000. ... this fix would actually make things enormously worse. With the single counter feeding all tables, you at least have a reasonable probability that there are not enormously long runs of consecutive OIDs in any one toast table. With a sequence per table, you are nearly guaranteed that there are such runs, because inserts into other tables don't create a break. It makes each toast table independent (and far less likely to wrap) . It would wrap when the sum(mods on THIS toast table) 2^32. Right now the function looks like: sum(mods on ALL toast tables in cluster) + sum(created normal tables in cluster * k) + sum(created temp tables in cluster * k) + [...] 2^32, WHERE k average number of ids consumed for pg_class, pg_type, etc... In the case of an insert only table (which is a common use case for partitions), the id would only wrap when the TOAST table was “full”. On the other hand currently, it would wrap into its pg_restored section when the combined oid consuming operations on the cluster surpassed 4 billion. That being said, I’m certainly not attached to that solution. My real argument is that although its not a problem today, we are only about 5 years from it being a problem for large installs and the first time you’ll hear about it is after someone has a 5 minute production outage on a database thats been taking traffic for 2 years. - Matt K. id_range | count | density -+---+- 390 | 92188 | 0.92188 391 | 99186 | 0.99186 392 | 99826 | 0.99826 393 | 99101 | 0.99101 394 | 99536 | 0.99536 395 | 99796 | 0.99796 396 | 99321 | 0.99321 397 | 99768 | 0.99768 398 | 99744 | 0.99744 399 | 99676 | 0.99676 400 | 98663 | 0.98663 401 | 40690 | 0.4069 403 |92 | 0.00092 404 | 491 | 0.00491 407 |74 | 0.00074 408 |54 | 0.00054 415 | 152 | 0.00152 416 |47 | 0.00047 419 |59 | 0.00059 422 | 2 | 2e-05 423 |14 | 0.00014 424 | 5 | 5e-05 425 |11 | 0.00011 426 | 7 | 7e-05 427 | 5 | 5e-05 428 | 6 | 6e-05 517 | 5 | 5e-05 518 | 9 | 9e-05 519 | 6 | 6e-05 520 |12 | 0.00012 521 |17 | 0.00017 522 | 5 | 5e-05 588 |15 | 0.00015 589 |10 | 0.0001 590 |19 | 0.00019 591 |12 | 0.00012 592 |12 | 0.00012 593 | 2 | 2e-05 617 | 4 | 4e-05 618 | 9 | 9e-05 619 | 7 | 7e-05 620 |14 | 0.00014 621 | 5 | 5e-05 622 |11 | 0.00011 682 | 8 | 8e-05 683 |13 | 0.00013 684 |17 | 0.00017 685 | 6 | 6e-05 686 |17 | 0.00017 687 | 4 | 4e-05 767 | 5 | 5e-05 768 |10 | 0.0001 769 | 9 | 9e-05 770 | 2 | 2e-05 771 |14 | 0.00014 772 | 2 | 2e-05 773 |11 | 0.00011 774 |13 | 0.00013 775 |10 | 0.0001 776 | 3 | 3e-05 914 | 7 | 7e-05 915 | 7 | 7e-05 916 | 1 | 1e-05 917 | 3 | 3e-05 918 | 3 | 3e-05 919 | 5 | 5e-05 920 | 4 | 4e-05 921 | 9 | 9e-05 922 | 9 | 9e-05 923 | 1 | 1e-05 (70 rows) id_range | count | density -+---+- 402 | 96439 | 0.96439 403 | 99102 | 0.99102 404 | 98787 | 0.98787 405 | 99351
Re: [HACKERS] [GENERAL] 4B row limit for CLOB tables
Hmm 2^32 times aprox. 2kB (as per usual heuristics, ~4 rows per heap page) is 8796093022208 (~9e13) bytes ... which results in 8192 1GB segments :O 8192 1GB segments is just 8TB, its not _that_ large. At TripAdvisor we’ve been using a NoSQL solution to do session storage. We are looking to probably swap that out to be Postgres (every other db backing the site is Postgres). Essentially, what I’m building is a system with 1 logical table that maps session id to a 2KB+ grab bag of ever changing session attributes which is partially normalized, partially json. 315 million uniques a month multiplied by the retention policy means I need to hold 2-4 billion session objects (and somehow expire old ones). Additionally, most http calls can update the session, so between maintenance windows I expect to take around 20 billion 'upserts’. Obviously, I will have to shard and partition the table in practice, but this weekend I ran a test that demonstrated that a single table on a 9.4 server + logical replication + Dell 730xd can handle 4x that workload. Well, it can for 38 hours… until you wrap xid’s on the toast table. :P I’ll be the first to admit that isn’t the normal use case though. I’m happy to have found this thread, however, because I’m going to have to build around the global oid counter, explicitly the prevent the problem I explain below regarding clustering. Anybody actually reaching this limit out there? Well its not the 4 billion row limit that concerns me, its the global shared counter in conjunction with pg_restore/clustering that is actually pretty concerning. Just checked through all of TripAdvisor’s normal databases and the max tuples I see in single toast table is 17,000,000, so that is still a couple of orders of magnitude too small. (however, close enough that it’ll be a concern in a few years). However, I do have active databases where the current oid is between 1 billion and 2 billion. They were last dump-restored for a hardware upgrade a couple years ago and were a bit more than half the size. I therefore can imagine that I have tables which are keyed by ~8,000,000 consecutive oids. I would argue that when it wraps there will be a single insert that will probably block for 2-5 minutes while it tries to accomplish ~8,000,000 index scans inside of GetNewOidWithIndex. Even partitioning doesn’t protect you from this potential problem. What even more weird is that this issue can be trigged by consuming too many oid’s in a different database in the same cluster (i.e. creating large amounts of temp tables) The problem with changing the id from 32 to 64 bits is that the storage *for everybody else* doubles, making the implementation slower for most though this might be actually not that important. Well, you aren’t doubling the storage. Even if you have to store the key in 4 places, you are adding 16 bytes per TOAST tuple. If we work off the 2KB estimate for each TOAST tuple, then you are only increasing the storage by 0.7%. I’m sure there are more hidden costs but we are really only talking about a low single digit percent increase. In exchange, you get to drop one index scan per toast insert; an index scan looking in the only hot part of the index. That being said I’d be perfectly happy merely giving each TOAST table its own sequence as that almost entire mitigates the risk of an unexpected lock up on reasonably sized tables/partitions, and provides a functional work around for those of us with larger than average installs. - Matt K -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPSERT wiki page, and SQL MERGE syntax
On 2014-10-10 19:44, Kevin Grittner wrote: Peter Geoghegan p...@heroku.com wrote: People keep remarking that they don't like that you can (optionally) name a unique index explicitly, [...] To restate: to do so is conflating the logical definition of the database with a particular implementation detail. As just one reason that is a bad idea: we can look up unique indexes on the specified columns, but if we implement a other storage techniques where there is no such thing as a unique index on the columns, yet manage to duplicate the semantics (yes, stranger things have happened), people can't migrate to the new structure without rewriting their queries Wouldn't it be good enough to define the 'WITHIN' as expecting a unique-constraint name rather than an index name (even though those happen to be the same strings)? I think constraints are part of the logical definition of the database, and a new storage technique which doesn't use indexes should still have names for its unique constraints. -M- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collations and Replication; Next Steps
Here is where I think the timezone and PostGIS cases are fundamentally different: I can pretty easily make sure that all my servers run in the same timezone. That's just good practice. I'm also going to install the same version of PostGIS everywhere in a cluster. I'll build PostGIS and its dependencies from the exact same source files, regardless of when I build the machine. Timezone is a user level setting; PostGIS is a user level library used by a subset. glibc is a system level library, and text is a core data type, however. Changing versions to something that doesn't match the kernel can lead to system level instability, broken linkers, etc. (I know because I tried). Here are some subtle other problems that fall out: * Upgrading glibc, the kernel, and linker through the package manager in order to get security updates can cause the corruption. * A basebackup that is taken in production and placed on a backup server might not be valid on that server, or your desktop machine, or on the spare you keep to do PITR when someone screws up. * Unless you keep _all_ of your clusters on the same OS, machines from your database spare pool probably won't be the right OS when you add them to the cluster because a member failed. Keep in mind here, by OS I mean CentOS versions. (we're running a mix of late 5.x and 6.x, because of our numerous issues with the 6.x kernel) The problem with LC_IDENTIFICATION is that every machine I have seen reports revision 1.0, date 2000-06-24. It doesn't seem like the versioning is being actively maintained. I'm with Martjin here, lets go ICU, if only because it moves sorting to a user level library, instead of a system level. Martjin do you have a link to the out of tree patch? If not I'll find it. I'd like to apply it to a branch and start playing with it. - Matt K On Sep 17, 2014, at 7:39 AM, Martijn van Oosterhout klep...@svana.org wrote: On Tue, Sep 16, 2014 at 02:57:00PM -0700, Peter Geoghegan wrote: On Tue, Sep 16, 2014 at 2:07 PM, Peter Eisentraut pete...@gmx.net wrote: Clearly, this is worth documenting, but I don't think we can completely prevent the problem. There has been talk of a built-in index integrity checking tool. That would be quite useful. We could at least use the GNU facility for versioning collations where available, LC_IDENTIFICATION [1]. By not versioning collations, we are going against the express advice of the Unicode consortium (they also advise to do a strcmp() tie-breaker, something that I think we independently discovered in 2005, because of a bug report - this is what I like to call the Hungarian issue. They know what our constraints are.). I recognize it's a tricky problem, because of our historic dependence on OS collations, but I think we should definitely do something. That said, I'm not volunteering for the task, because I don't have time. While I'm not sure of what the long term solution should be, it *is not* okay that we don't version collations. I think that even the best possible B-Tree check tool is a not a solution. Personally I think we should just support ICU as an option. FreeBSD has been maintaining an out of tree patch for 10 years now so we know it works. The FreeBSD patch is not optimal though, these days ICU supports UTF-8 directly so many of the push-ups FreeBSD does are no longer necessary. It is often faster than glibc and the key sizes for strxfrm are more compact [1] which is relevent for the recent optimisation patch. Lets solve this problem for once and for all. [1] http://site.icu-project.org/charts/collation-icu4c48-glibc -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collations and Replication; Next Steps
Let me double check that assertion before we go too far with it. Most of the problems I've seen are across 5 and 6 boundaries. I thought I had case where it was within a minor release but I can't find it right now. I'm going to dig. That being said the sort order changes whether you statically or dynamically link (demonstrated on 4+ machines running different linux flavors), so at the point I have no reason to trust the stability of the sort across any build. I legitimately question whether strcoll is buggy. Ex. I have cases where for three strings a, b and c: a b, but (a || c) (b || c). That's right postfixing doesn't hold. It actually calls into question the index scan optimization that occurs when you do LIKE 'test%' even on a single machine, but I don't want to bite that off at the moment. My mentality has switched to 'don't trust any change until shown otherwise', so that may have bled into my last email. - Matt K. On Sep 17, 2014, at 8:17 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Sep 17, 2014 at 9:07 AM, Matthew Kelly mke...@tripadvisor.com wrote: Here is where I think the timezone and PostGIS cases are fundamentally different: I can pretty easily make sure that all my servers run in the same timezone. That's just good practice. I'm also going to install the same version of PostGIS everywhere in a cluster. I'll build PostGIS and its dependencies from the exact same source files, regardless of when I build the machine. Timezone is a user level setting; PostGIS is a user level library used by a subset. glibc is a system level library, and text is a core data type, however. Changing versions to something that doesn't match the kernel can lead to system level instability, broken linkers, etc. (I know because I tried). Here are some subtle other problems that fall out: * Upgrading glibc, the kernel, and linker through the package manager in order to get security updates can cause the corruption. * A basebackup that is taken in production and placed on a backup server might not be valid on that server, or your desktop machine, or on the spare you keep to do PITR when someone screws up. * Unless you keep _all_ of your clusters on the same OS, machines from your database spare pool probably won't be the right OS when you add them to the cluster because a member failed. Keep in mind here, by OS I mean CentOS versions. (we're running a mix of late 5.x and 6.x, because of our numerous issues with the 6.x kernel) The problem with LC_IDENTIFICATION is that every machine I have seen reports revision 1.0, date 2000-06-24. It doesn't seem like the versioning is being actively maintained. I'm with Martjin here, lets go ICU, if only because it moves sorting to a user level library, instead of a system level. Martjin do you have a link to the out of tree patch? If not I'll find it. I'd like to apply it to a branch and start playing with it. What I find astonishing is that whoever maintains glibc (or the Red Hat packaging for it) thinks it's OK to change the collation order in a minor release. I'd understand changing it between, say, RHEL 6 and RHEL 7. But the idea that minor release, supposedly safe updates think they can whack this around without breaking applications really kind of blows my mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Collations and Replication; Next Steps
Hello, Last month, I brought up the following issue to the general mailing list about how running streaming replication between machines running different versions of glibc can cause corrupt indexes. http://www.postgresql.org/message-id/ba6132ed-1f6b-4a0b-ac22-81278f5ab...@tripadvisor.com In the month following, we have done further investigation here at TripAdvisor and have found that scope of this issue is far more troubling than initially thought. Hackers seems like appropriate place to present this update because it will certainly motivate some discussion about the approach to collation support going forward. After the initial episode, we thought it was necessary to find the true scope of the problem. We developed a quick smoke test to evaluate the integrity of the indexes on a given machine. We understood that the test was not exhaustive, but it would catch most instances of corrupt indexes given TripAdvisor's normal database usage pattern. The source code with documentation about how it works is available at (https://github.com/mkellycs/postgres_index_integrity_check) for those interested. What we found with this simple check was simply frightening. In every single streaming replica cluster where one or more machines had been commissioned at a different time, that member was found to be corrupt. When hardware upgrades of the master had been accomplished with a streaming replication, the new master was also found to have similar issues. The following numbers are only as small as they are because our adoption of streaming replication has barely just begun. So far we have found: * 8 internal production databases, and 2 live site database servers effected. * Up to 3771 rows out of place in a single index (more correctly: 3771 times a row was smaller then the row before it when sorted in ascending order, the actual number of incorrectly placed rows is probably much higher) * On the worst offender, there were 12806 rows out of place across 26 indexes * On average roughly 15% of indexes containing text keys on tables larger 100MB were found to exhibit this issue * In at least one case, rebuilding a unique index on a master revealed that the database had allowed 100+ primary key violations. It sounds like we as a community knew that these issues were theoretically possible, but I now have empirical evidence demonstrating the prevalence of this issue on our corpus of international data. Instances of this issue showed up in indexes of member usernames, location/property names, and even Facebook url's. I encourage other sufficiently large operations who index internationalized text to run similar tests; its highly likely they have similar latent issues that they just have not detected yet. Here is the simple reality. Collation based indexes, streaming replication, and multiple versions of glibc/os cannot coexist in a sufficiently large operation and not cause corrupt indexes. The current options are to collate all of your indexes in C, or to ensure that all of your machines run exactly the same OS version. The first and immediate TODO is to patch the documentation to add warnings regarding this issue. I can propose a doc patch explaining the issue, if no one has any objections. The second and far more challenging problem is how do we fix this issue? As of our last discussion, Peter Geoghegan revived the proposal of using ICU as an alternative. (http://www.postgresql.org/message-id/CAEYLb_WvdCzuL=cyf1xyzjwn-1cvo6kzeawmkbxts3jphtj...@mail.gmail.com) I do not feel qualified to compare the value of this library to other options, but I am certainly willing to help with the patch process once a direction has been selected. I will be at Postgres Open in Chicago this week, and I will be more than willing to further discuss the details of what we have found. Regards, Matt Kelly
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
All, I have successfully isolated this error and created a simple SQL script to reproduce it. Just to recap - this script will cause a server crash with exception 0xC409 as described in previous emails. The crux of the problem seems to be my creation / use of the function st_transform_null. My intent with this function is to wrap the st_transform function provided by PostGIS, but account for the situation where the argument to be transformed is NULL. In this situation, st_transform throws an internal_error, which my function catches and returns NULL for. The error / crash is not caused by a NULL argument; rather, it is caused by the final value in the attached script's INSERT statement, which contains a lat/lon pair that is beyond PostGIS's range. I'm not questioning whether this value is actually outside the legal range, but I do not think such an input should cause the server to crash completely. Here are the steps to reproduce the crash: 1) Create a new instance of a 9.2 server (Windows 64-bit), and a new database (call it test) with the PostGIS extension. 2) Run the script: psql -U postgres -d test -f C:\server_crash.sql You should see the following: psql:C:/server_crash.sql:31: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:C:/server_crash.sql:31: connection to server was lost 3) Check your log for the error. I hope this helps. It took me quite a while to track down the problem so I hope someone can figure out what is going on under the hood. It seems to be a pretty significant problem. Cheers, Matt On Sun, Nov 11, 2012 at 9:45 PM, Matthew Gerber gerber.matt...@gmail.comwrote: On Sun, Nov 11, 2012 at 8:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Noah Misch n...@leadboat.com writes: So, I can reproduce the lower threshold, but the exception type does not agree with the one Matthew observed. I finally got around to looking at the link you provided about error 0xC409, and realized that I'd been completely confusing it with stack overflow --- but actually, it's a report that something scribbled past the end of a finite-size local-variable array. So I now think that Matthew's stumbled across two completely independent bugs, and we've fixed only one of them. The 0xC409 error is something else, and possibly a lot worse since it could conceivably be a security issue. It still seems likely that the actual location of the bug is either in PostGIS or in the GIST index code, but without the ability to reproduce the failure it's awfully hard to find it. Matthew, could you try a bit harder to find a self-contained test case that produces that error? regards, tom lane Sure, it might take me a while to find time but I'll keep it on my list. Matt server_crash.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 11, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: On Sun, Nov 04, 2012 at 02:30:38PM -0500, Tom Lane wrote: Matthew Gerber gerber.matt...@gmail.com writes: Here is the command that was executing when the 0xC409 exception was raised: INSERT INTO places (bounding_box,country,full_name,id,name,type,url) VALUES (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789 32.771786,-97.034085 32.771786))',4326),26918),'United States','Irving, TX','dce44ec49eb788f5','Irving','city',' http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'), Assuming that 0xC409 does actually represent a stack-overrun condition, It signifies scribbling past the end of the frame's local variables: http://msdn.microsoft.com/en-us/library/8dbf701c.aspx A slight update on this: previously, my insert code involved a long SELECT ... UNION ALL ... SELECT ... UNION ALL ... command. If this command was too long, I would get a stack depth exception thrown back to my client application. I reduced the length of the command to eliminate the client-side exceptions, but on some occasions I would still get the 0xC409 crash on the server side. I have eliminated the long SELECT ... UNION ALL ... command, and I now get no errors on either side. So it seems like long commands like this were actually causing the server-side crashes. The proper behavior would seem to be throwing the exception back to the client application instead of crashing the server. Hope this helps... Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 11, 2012 at 11:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: Matthew Gerber gerber.matt...@gmail.com writes: On Sun, Nov 11, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote: It signifies scribbling past the end of the frame's local variables: http://msdn.microsoft.com/en-us/library/8dbf701c.aspx A slight update on this: previously, my insert code involved a long SELECT ... UNION ALL ... SELECT ... UNION ALL ... command. How long is long? I was seeing queries with around 5000-7000 UNION ALL statements. Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 11, 2012 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Matthew Gerber gerber.matt...@gmail.com writes: On Sun, Nov 11, 2012 at 11:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: How long is long? I was seeing queries with around 5000-7000 UNION ALL statements. Hm. I experimented with test queries created like so: perl -e 'print SELECT 1 a, 2 b, 3 c\n; print UNION ALL SELECT 1 a, 2 b, 3 c\n foreach (1..8200);' | psql On the machine I tried this on, it works up to about 8200 and then fails in the way I'd expect: ERROR: stack depth limit exceeded HINT: Increase the configuration parameter max_stack_depth (currently 2048kB), after ensuring the platform's stack depth limit is adequate. But then when I cranked it up to 8, kaboom: connection to server was lost Inspection of the core dump shows transformSetOperationTree is the problem --- it's recursing but lacks a check_stack_depth test. So that's easy to fix, but I wonder why the critical depth limit seems to be so much less on your machine. I get the expected error up to about 65000 UNION ALLs --- why is yours crashing at a tenth of that? Tom, Interesting. I really have no idea why mine seemed to fail so much sooner. I recalled my 5k-7k figure from memory, so I might be off on that, but probably not by an order of magnitude. In any case, it sounds like you know how to fix the problem. Should I file this as a bug report or will you take care of it from here? Best, Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 11, 2012 at 2:43 PM, Noah Misch n...@leadboat.com wrote: On Sun, Nov 11, 2012 at 10:10:31AM -0500, Matthew Gerber wrote: Matthew Gerber gerber.matt...@gmail.com writes: Here is the command that was executing when the 0xC409 exception was raised: INSERT INTO places (bounding_box,country,full_name,id,name,type,url) VALUES (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789 32.771786,-97.034085 32.771786))',4326),26918),'United States','Irving, TX','dce44ec49eb788f5','Irving','city',' http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'), A slight update on this: previously, my insert code involved a long SELECT ... UNION ALL ... SELECT ... UNION ALL ... command. If this command was too long, I would get a stack depth exception thrown back to my client application. I reduced the length of the command to eliminate the client-side exceptions, but on some occasions I would still get the 0xC409 crash on the server side. I have eliminated the long SELECT ... UNION ALL ... command, and I now get no errors on either side. So it seems like long commands like this were actually causing the server-side crashes. The proper behavior would seem to be throwing the exception back to the client application instead of crashing the server. Above, you quoted an INSERT ... VALUES of two rows. Have you observed an exception-0xC409 crash with an INSERT ... VALUES query, or only with an INSERT ... SELECT ... thousands of UNION query? Every time the server crashed with 0xC409, the log reported that it was running the simple INSERT command (two rows) that I started this thread with. However, this didn't make any sense to me given the simplicity of the INSERT command and the fact that the error indicated a stack overflow. So I removed the long SELECT ... UNION ALL ... command since it seemed more relevant to the error, and the process has been running continuously for a few days now. To answer your question directly: I was seeing the server crash when using the simple INSERT and long SELECT ... UNION ... (these commands are issued independently at different points in the program). Now my program is only using the simple INSERT, and the crashes are gone. Hope this helps... Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 11, 2012 at 8:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Noah Misch n...@leadboat.com writes: So, I can reproduce the lower threshold, but the exception type does not agree with the one Matthew observed. I finally got around to looking at the link you provided about error 0xC409, and realized that I'd been completely confusing it with stack overflow --- but actually, it's a report that something scribbled past the end of a finite-size local-variable array. So I now think that Matthew's stumbled across two completely independent bugs, and we've fixed only one of them. The 0xC409 error is something else, and possibly a lot worse since it could conceivably be a security issue. It still seems likely that the actual location of the bug is either in PostGIS or in the GIST index code, but without the ability to reproduce the failure it's awfully hard to find it. Matthew, could you try a bit harder to find a self-contained test case that produces that error? regards, tom lane Sure, it might take me a while to find time but I'll keep it on my list. Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 4, 2012 at 3:39 AM, Craig Ringer ring...@ringerc.id.au wrote: On 11/04/2012 08:47 AM, Matthew Gerber wrote: So I attached the VS debugger, but the server died without raising an exception in VS. Not sure what's going on here. Try creating a directory called crashdumps in the data directory, at the same level as pg_xlog and pg_clog etc. Give the postgresql user the full control permission on it. Then run the test again. Do any minidump files appear in the directory? If so, you can examine them with windbg or Visual Studio to see where the crash happened. I did this but nothing appears in crashdumps after the server crashes. The latest test I did included the addition of this directory and the disabling of my antivirus software. Nothing seems to have changed. Following Tom's suggestion, I'll try to get a stack trace again (last time didn't produce anything). The only other thing I've noticed is that the crash always occurs when inserting into the places table (definition in previous email), even though there are two other tables that are also receiving inserts. This is odd to me. Any thoughts? Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Sun, Nov 4, 2012 at 3:39 AM, Craig Ringer ring...@ringerc.id.au wrote: On 11/04/2012 08:47 AM, Matthew Gerber wrote: Here is the command that was executing when the 0xC409 exception was raised: INSERT INTO places (bounding_box,country,full_name,id,name,type,url) VALUES (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789 32.771786,-97.034085 32.771786))',4326),26918),'United States','Irving, TX','dce44ec49eb788f5','Irving','city',' http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'), OK, so you're using PostGIS. What other extensions are loaded? PL/R? Any other PLs? PostGIS is the only extension that I added to the default configuration. I didn't change anything else. Can you show the definition of the table `places`, incuding any associated triggers, etc? Use `\d+` in psql for the table def'n. Here the definition of the places table: twitter=# \d+ places Table public.places Column| Type | Modifiers | Storage | Stats target |Description --+-+---+--+--+- bounding_box | geometry(Polygon,26918) | | main | | country | character varying | | extended | | full_name| character varying | | extended | | id | character varying | not null | extended | | name | character varying | | extended | | type | character varying | | extended | | url | character varying | | extended | | Indexes: places_pkey PRIMARY KEY, btree (id) places_bounding_box_idx gist (bounding_box) places_type_idx btree (type) Referenced by: TABLE tweets CONSTRAINT tweets_place_id_fkey FOREIGN KEY (place_id) REFERENCES places(id) ON DELETE CASCADE Has OIDs: no Please also post the query plan. http://explain.depesz.com/ is useful for this. Here is the query plan: QUERY PLAN Insert on public.places (cost=0.00..0.01 rows=1 width=0) (actual time=1.000..1.000 rows=0 loops=1) - Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1) Output: '0103202669010005005E8705A4E32C38C1EE553AE6E95E 4D41086A91990B1B38C11620AF9784874D41FCA5741676E437C19436654287814D41C43E11332BF6 37C17C863746F0584D415E8705A4E32C38C1EE553AE6E95E4D41'::geometry(Polygon,26918), 'United States'::character varying, 'Irving, TX'::character varying, 'dce44ec49e b788f5'::character varying, 'Irving'::character varying, 'city'::character varyi ng, 'http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'::charactervarying Total runtime: 1.157 ms (4 rows) So I attached the VS debugger, but the server died without raising an exception in VS. Not sure what's going on here. Try creating a directory called crashdumps in the data directory, at the same level as pg_xlog and pg_clog etc. Give the postgresql user the full control permission on it. Then run the test again. Running it now. Do any minidump files appear in the directory? If so, you can examine them with windbg or Visual Studio to see where the crash happened. Will try it. Thanks for your help so far, guys. Hopefully we get somewhere on this... Matt
Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server
On Fri, Nov 2, 2012 at 9:00 PM, Noah Misch n...@leadboat.com wrote: On Fri, Nov 02, 2012 at 02:05:47PM -0500, Merlin Moncure wrote: On Fri, Nov 2, 2012 at 1:25 PM, Matthew Gerber gerber.matt...@gmail.com wrote: I am encountering an error on my Postgres installation for Windows Server 64-bit. The error was posted here a couple months ago; however, no solution was found on the pgsql-bugs list, so I am reposting it to pgsql-hackers in the hopes that someone will be able to help. My error message is identical to the one previously posted: 2012-11-01 22:36:26 EDT LOG: 0: server process (PID 7060) was terminated by exception 0xC409 2012-11-01 22:36:26 EDT DETAIL: Failed process was running: INSERT INTO [snipped SQL command] Could you post an anonymized query, post an anonymized query plan, and/or describe the general nature of the query? Does it call functions? About how many rows does it insert? Here is the command that was executing when the 0xC409 exception was raised: INSERT INTO places (bounding_box,country,full_name,id,name,type,url) VALUES (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789 32.771786,-97.034085 32.771786))',4326),26918),'United States','Irving, TX','dce44ec49eb788f5','Irving','city',' http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'), (st_transform_null(ST_GeometryFromText('POLYGON((107.610398 -6.9006302,107.610398 -6.864448,107.637222 -6.864448,107.637222 -6.9006302,107.610398 -6.9006302))',4326),26918),'Indonesia','Coblong, Kota Bandung','2c0294c5eab821c9','Coblong','city',' http://api.twitter.com/1/geo/id/2c0294c5eab821c9.json') The st_transform_null function is simply a wrapper around the PostGIS st_transform function that deals with NULL values. The other fields are all VARCHARs. This insert is only adding two values. In general, the insert commands I'm running insert anywhere up to 100 rows each, so they're not huge. What server settings have you customized? https://wiki.postgresql.org/wiki/Server_Configuration I haven't customized any settings. If you could get a stack trace or minidump, that would be most helpful: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows So I attached the VS debugger, but the server died without raising an exception in VS. Not sure what's going on here. Magnus's questions for the reporter of bug #7517 are relevant, too. Does the system have any antivirus software installed? Testing disabling the AV software now. Will post back. Google suggests those unadorned messages originate in R. Do the affected systems use PL/R? If so ... Nope. Really appreciate any help you can provide. Matt
Re: [HACKERS] enhanced error fields
Peter Geoghegan pe...@2ndquadrant.com writes: So I took a look at the patch eelog-2012-05-09.diff today. All of the following remarks apply to it alone. I've been trying out this patch for my own interest (I'm very pleased to see work on this feature), and I have a couple of suggestions from a user's point of view. First: if a not null constraint is violated, the error report includes CONSTRAINT NAME 'not_null_violation'. I think I would find it more useful if CONSTRAINT NAME were left unset rather than given a value that doesn't correspond to a real constraint. A client program can tell it's a null constraint violation from the SQLSTATE. Second: in the case where a foreign-key constraint is violated by a change in the primary-key table, the error report gives the following information: TABLE NAME:name of primary-key table SCHEMA NAME: schema of primary-key table CONSTRAINT NAME: name of foreign-key constraint CONSTRAINT SCHEMA: schema of foreign-key table It doesn't include the name of the foreign-key table (except in the human-readable error message). But in principle you need to know that table name to reliably identify the constraint that was violated. I think what's going on is a mismatch between the way the constraint namespace works in the SQL standard and in PostgreSQL: it looks like the standard expects constraint names to be unique within a schema, while PostgreSQL only requires them to be unique within a table. (A similar issue makes information_schema less useful than the pg_ tables for foreign key constraints.) So I think it would be helpful to go beyond the standard in this case and include the foreign-key table name somewhere in the report. Possibly the enhanced-error reports could somehow add the table name to the string in the CONSTRAINT NAME field, so that the interface PostgreSQL provides looks like the one the standard envisages (which ought to make it easier to write cross-database client code). Or it might be simpler to just add a new enhanced-error field; I can imagine cases where that table name would be the main thing I'd be interested in. -M- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
On 25/01/12 18:37, Hitoshi Harada wrote: I'm still not sure whether to just revise (almost) all the SQL function examples to use parameter names, and declare them the right choice; as it's currently written, named parameters still seem rather second-class. Agreed. I'll try a more comprehensive revision of the examples. The patch seems ok, except an example I've just found. db1=# create function t(a int, t t) returns int as $$ select t.a $$ language sql; CREATE FUNCTION db1=# select t(0, row(1, 2)::t); t --- 1 (1 row) Should we throw an error in such ambiguity? Or did you make it happen intentionally? If latter, we should also mention the rule in the manual. I did consider it, and felt it was the most consistent: # select t.x, t, z from (select 1) t(x), (select 2) z(t); x | t | z ---+---+- 1 | 2 | (2) (1 row) I haven't yet managed to find the above behaviour described in the documentation either, though. To me, it feels like an obscure corner case, whose description would leave the rules seeming more complicated than they generally are. Maybe it'd be better suited to be explicitly discussed in the comments? Thanks, Matthew -- matt...@trebex.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
On 19/01/12 20:28, Hitoshi Harada wrote: (Now it occurred to me that forgetting the #include parse_func.h might hit this breakage..., so I'll fix it here and continue to test, but if you'll fix it yourself, let me know) I fixed it here and it now works with my environment. Well spotted; that's exactly what I'd done. :/ The regression tests pass, the feature seems working as aimed, but it seems to me that it needs more test cases and documentation. For the tests, I believe at least we need ambiguous case given upthread, so that we can ensure to keep compatibility. For the document, it should describe the name resolution rule, as stated in the patch comment. Attached are a new pair of patches, fixing the missing include (and the other warning), plus addressing the above. I'm still not sure whether to just revise (almost) all the SQL function examples to use parameter names, and declare them the right choice; as it's currently written, named parameters still seem rather second-class. Aside from them, I wondered at first what if the function is schema-qualified. Say, CREATE FUNCTION s.f(a int) RETURNS int AS $$ SELECT b FROM t WHERE a = s.f.a $$ LANGUAGE sql; It actually errors out, since function-name-qualified parameter only accepts function name without schema name, but it looked weird to me at first. No better idea from me at the moment, though. By my reading of (a draft of) the spec, Subclause 6.6, identifier chain, Syntax Rules 8.b.i-iii, the current behaviour is correct. But I join you in wondering whether we should permit the function name to be schema-qualified anyway. Matthew -- matt...@trebex.net diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml new file mode 100644 index 7064312..cc5b5ef *** a/doc/src/sgml/xfunc.sgml --- b/doc/src/sgml/xfunc.sgml *** SELECT getname(new_emp()); *** 538,556 programlisting CREATE FUNCTION tf1 (acct_no integer, debit numeric) RETURNS numeric AS $$ UPDATE bank ! SET balance = balance - $2 ! WHERE accountno = $1 RETURNING balance; $$ LANGUAGE SQL; /programlisting Here the first parameter has been given the name literalacct_no/, and the second parameter the name literaldebit/. ! So far as the SQL function itself is concerned, these names are just ! decoration; you must still refer to the parameters as literal$1/, ! literal$2/, etc within the function body. (Some procedural ! languages let you use the parameter names instead.) However, ! attaching names to the parameters is useful for documentation purposes. When a function has many parameters, it is also useful to use the names while calling the function, as described in xref linkend=sql-syntax-calling-funcs. --- 538,580 programlisting CREATE FUNCTION tf1 (acct_no integer, debit numeric) RETURNS numeric AS $$ UPDATE bank ! SET balance = balance - debit ! WHERE accountno = acct_no RETURNING balance; $$ LANGUAGE SQL; /programlisting Here the first parameter has been given the name literalacct_no/, and the second parameter the name literaldebit/. ! Named parameters can still be referenced as ! literal$replaceablen//; in this example, the second ! parameter can be referenced as literal$2/, literaldebit/, ! or literaltf1.debit/. ! /para ! ! para ! If a parameter is given the same name as a column that is available ! in the query, the name will refer to the column. To explicitly ! refer to the parameter, you can qualify its name with the name of ! the containing function. For example, ! ! programlisting ! CREATE FUNCTION tf1 (accountno integer, debit numeric) RETURNS numeric AS $$ ! UPDATE bank ! SET balance = balance - debit ! WHERE accountno = tf1.accountno ! RETURNING balance; ! $$ LANGUAGE SQL; ! /programlisting ! ! This time, the first parameter has been given the ambiguous name ! literalaccountno/. ! Notice that inside the function body, literalaccountno/ still ! refers to literalbank.accountno/, so literaltf1.accountno/ ! must be used to refer to the parameter. ! /para ! ! para When a function has many parameters, it is also useful to use the names while calling the function, as described in xref linkend=sql-syntax-calling-funcs. diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c new file mode 100644 index 5642687..fe87990 *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** *** 23,28 --- 23,29 #include nodes/makefuncs.h #include nodes/nodeFuncs.h #include parser/parse_coerce.h + #include parser/parse_func.h #include tcop/utility.h #include utils/builtins.h #include utils/datum.h *** typedef SQLFunctionCache *SQLFunctionCac *** 115,121 --- 116,124
[HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
I just remembered to make time to advance this from WIP to proposed patch this week... and then worked out I'm rudely dropping it into the last commitfest at the last minute. :/ Anyway, my interpretation of the previous discussion is a general consensus that permitting ambiguous parameter/column references is somewhat unsavoury, but better than the alternatives: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00433.php http://archives.postgresql.org/pgsql-hackers/2011-04/msg00744.php (and surrounds) The attached patch is essentially unchanged from my WIP version; it's updated to current master (d0dcb31), and fixes a trivial copy/paste error. It passes `make check`. Also attached is a rather light doc patch, which I've separated because I'm hesitant about which approach to take. The attached version just changes the existing mention of naming parameters in: http://www.postgresql.org/docs/9.1/static/xfunc-sql.html#XFUNC-NAMED-PARAMETERS It presumably ought to be clearer about the name resolution priorities... in a quick look, I failed to locate the corresponding discussion of column name references to link to (beyond a terse sentence in 4.2.1). The alternative would be to adjust most of the examples in section 35.4, using parameter names as the preferred option, and thus make $n the other way. I'm happy to do that, but I figure it'd be a bit presumptuous to present such a patch without some discussion; it's effectively rewriting the project's opinion of how to reference function parameters. With regard to the discussion about aliasing the function name when used as a qualifier (http://archives.postgresql.org/pgsql-hackers/2011-04/msg00871.php), my only suggestion would be that using $0 (as in, '$0.paramname') appears safe -- surely any spec change causing it issues would equally affect the existing $1 etc. '$.paramname' seems to look better, but presumably runs into trouble by looking like an operator. That whole discussion seems above my pay grade, however. Original WIP post: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01479.php This is an open TODO: http://wiki.postgresql.org/wiki/Todo#SQL-Language_Functions I've just added the above post to the CF app; I'll update it to point to this one once it appears. Thanks! Matthew -- matt...@trebex.net diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c new file mode 100644 index 5642687..74f3e7d *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** typedef SQLFunctionCache *SQLFunctionCac *** 115,121 --- 115,123 */ typedef struct SQLFunctionParseInfo { + char *name; /* function's name */ Oid *argtypes; /* resolved types of input arguments */ + char **argnames; /* names of input arguments */ int nargs; /* number of input arguments */ Oid collation; /* function's input collation, if known */ } SQLFunctionParseInfo; *** typedef struct SQLFunctionParseInfo *** 123,128 --- 125,132 /* non-export function prototypes */ static Node *sql_fn_param_ref(ParseState *pstate, ParamRef *pref); + static Node *sql_fn_post_column_ref(ParseState *pstate, ColumnRef *cref, Node *var); + static Node *sql_fn_param_ref_num(ParseState *pstate, SQLFunctionParseInfoPtr pinfo, int paramno, int location); static List *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); *** prepare_sql_fn_parse_info(HeapTuple proc *** 162,167 --- 166,172 int nargs; pinfo = (SQLFunctionParseInfoPtr) palloc0(sizeof(SQLFunctionParseInfo)); + pinfo-name = NameStr(procedureStruct-proname); /* Save the function's input collation */ pinfo-collation = inputCollation; *** prepare_sql_fn_parse_info(HeapTuple proc *** 200,205 --- 205,240 pinfo-argtypes = argOidVect; } + if (nargs 0) + { + Datum proargnames; + Datum proargmodes; + int argnum; + int n_arg_names; + bool isNull; + + proargnames = SysCacheGetAttr(PROCNAMEARGSNSP, procedureTuple, + Anum_pg_proc_proargnames, + isNull); + if (isNull) + proargnames = PointerGetDatum(NULL); /* just to be sure */ + + proargmodes = SysCacheGetAttr(PROCNAMEARGSNSP, procedureTuple, + Anum_pg_proc_proargmodes, + isNull); + if (isNull) + proargmodes = PointerGetDatum(NULL); /* just to be sure */ + + n_arg_names = get_func_input_arg_names(proargnames, proargmodes, pinfo-argnames); + + if (n_arg_names nargs) + pinfo-argnames = NULL; + } + else + { + pinfo-argnames = NULL; + } + return pinfo; } *** prepare_sql_fn_parse_info(HeapTuple proc *** 209,222 void sql_fn_parser_setup(struct ParseState *pstate, SQLFunctionParseInfoPtr pinfo) { - /* Later we might use these hooks to support parameter names */ pstate-p_pre_columnref_hook = NULL
Re: [HACKERS] Improve lseek scalability v3
On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote: * Benjamin LaHaise (b...@kvack.org) wrote: For such tables, can't Postgres track the size of the file internally? I'm assuming it's keeping file descriptors open on the tables it manages, in which case when it writes to a file to extend it, the internally stored size could be updated. Not making a syscall at all would scale far better than even a modified lseek() will perform. We'd have to have it in shared memory and have a lock around it, it wouldn't be cheap at all. Yep, that makes perfect sense. After all, the kernel does basically the same thing to maintain this information; why should we have userspace duplicating the same infrastructure? I must admit, I'd never heard of this usage of lseek to get the current size of a file before; I'd assumed everybody used fstat. Given this legitimate reason for a high-frequency calling of lseek, I withdraw my earlier objection to the patch series. -- Matthew Wilcox Intel Open Source Technology Centre Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote: I sent an email containing benchmarks from Robert Haas regarding the Subject. Looking at lkml.org I can't see it right now, Will recheck when I am at home. He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times the lseek performance. The workload was 64 clients hammering postgres with a simple readonly workload (pgbench -S). Yay! Data! For reference see the thread in the postgres archives which also links to performance data: http://archives.postgresql.org/message- id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=vkjpknrxk2ktdr+hsa...@mail.gmail.com So both fstat and lseek do more work than postgres wants. lseek modifies the file pointer while fstat copies all kinds of unnecessary information into userspace. I imagine this is the source of the slowdown seen in the 1-client case. There have been various proposals to make the amount of information returned by fstat limited to the 'cheap' (for various definitions of 'cheap') fields. I'd like to dig into the requirement for knowing the file size a little better. According to the blog entry it's used for the query planner. Does the query planner need to know the exact number of bytes in the file, or is it after an order-of-magnitude? Or to-the-nearest-gigabyte? -- Matthew Wilcox Intel Open Source Technology Centre Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
On 2011-04-22 21:55, Greg Smith wrote: On 04/22/2011 09:32 AM, Simon Riggs wrote: OK, that's good, but ISTM we still have a hole during RemoveOldXlogFiles() where we don't fsync or open/close the file, just rename it. This is also something that many applications rely upon working as hoped for here, even though it's not technically part of POSIX. Early versions of ext4 broke that, and it caused a giant outcry of complaints. http://www.h-online.com/open/news/item/Ext4-data-loss-explanations-and-workarounds-740671.html has a good summary. This was broken on ext4 from around 2.6.28 to 2.6.30, but the fix for it was so demanded that it's even been ported by the relatively lazy distributions to their 2.6.28/2.6.29 kernels. As far as I can make out, the current situation is that this fix (the auto_da_alloc mount option) doesn't work as advertised, and the ext4 maintainers are not treating this as a bug. See https://bugzilla.kernel.org/show_bug.cgi?id=15910 -M- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: Allow SQL-language functions to reference parameters by parameter name
Attached is a WIP patch that allows SQL-language functions to reference their parameters by name. It uses p_post_columnref_hook, so potentially ambiguous references prefer the column... that seems to make the most sense, both because it avoids a backwards incompatibility, and it conforms with SQL's usual notion of assuming you mean the nearest name. It allows the parameter name to be qualified with the function name, for when you really mean you want the parameter. This being my first foray into the PostgreSQL source, I expect the code is horribly wrong in a variety of ways. Other than that, the regression tests I've been using are a slight modification of existing queries; I imagine they should look measurably different. And finally, conspicuously absent are the documentation changes that will obviously need to accompany a real patch. (This builds passes `make check` on current HEAD, a4425e3) Thanks! Matthew -- matt...@trebex.net diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c new file mode 100644 index ce3b77b..be71fbb *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** typedef SQLFunctionCache *SQLFunctionCac *** 116,122 --- 116,124 */ typedef struct SQLFunctionParseInfo { + char *name; /* function's name */ Oid *argtypes; /* resolved types of input arguments */ + char **argnames; /* names of input arguments */ int nargs; /* number of input arguments */ Oid collation; /* function's input collation, if known */ } SQLFunctionParseInfo; *** typedef struct SQLFunctionParseInfo *** 124,129 --- 126,133 /* non-export function prototypes */ static Node *sql_fn_param_ref(ParseState *pstate, ParamRef *pref); + static Node *sql_fn_post_column_ref(ParseState *pstate, ColumnRef *cref, Node *var); + static Node *sql_fn_param_ref_num(ParseState *pstate, SQLFunctionParseInfoPtr pinfo, int paramno, int location); static List *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); *** prepare_sql_fn_parse_info(HeapTuple proc *** 163,168 --- 167,173 int nargs; pinfo = (SQLFunctionParseInfoPtr) palloc0(sizeof(SQLFunctionParseInfo)); + pinfo-name = NameStr(procedureStruct-proname); /* Save the function's input collation */ pinfo-collation = inputCollation; *** prepare_sql_fn_parse_info(HeapTuple proc *** 201,206 --- 206,241 pinfo-argtypes = argOidVect; } + if (nargs 0) + { + Datum proargnames; + Datum proargmodes; + int argnum; + int n_arg_names; + bool isNull; + + proargnames = SysCacheGetAttr(PROCNAMEARGSNSP, procedureTuple, + Anum_pg_proc_proargnames, + isNull); + if (isNull) + proargmodes = PointerGetDatum(NULL); /* just to be sure */ + + proargmodes = SysCacheGetAttr(PROCNAMEARGSNSP, procedureTuple, + Anum_pg_proc_proargmodes, + isNull); + if (isNull) + proargmodes = PointerGetDatum(NULL); /* just to be sure */ + + n_arg_names = get_func_input_arg_names(proargnames, proargmodes, pinfo-argnames); + + if (n_arg_names nargs) + pinfo-argnames = NULL; + } + else + { + pinfo-argnames = NULL; + } + return pinfo; } *** prepare_sql_fn_parse_info(HeapTuple proc *** 210,223 void sql_fn_parser_setup(struct ParseState *pstate, SQLFunctionParseInfoPtr pinfo) { - /* Later we might use these hooks to support parameter names */ pstate-p_pre_columnref_hook = NULL; ! pstate-p_post_columnref_hook = NULL; pstate-p_paramref_hook = sql_fn_param_ref; /* no need to use p_coerce_param_hook */ pstate-p_ref_hook_state = (void *) pinfo; } /* * sql_fn_param_ref parser callback for ParamRefs ($n symbols) */ --- 245,354 void sql_fn_parser_setup(struct ParseState *pstate, SQLFunctionParseInfoPtr pinfo) { pstate-p_pre_columnref_hook = NULL; ! pstate-p_post_columnref_hook = sql_fn_post_column_ref; pstate-p_paramref_hook = sql_fn_param_ref; /* no need to use p_coerce_param_hook */ pstate-p_ref_hook_state = (void *) pinfo; } + static Node * + sql_fn_resolve_name(ParseState *pstate, SQLFunctionParseInfoPtr pinfo, const char *paramname, int location) + { + int i; + for (i = 0; i pinfo-nargs; i++) + if (pinfo-argnames[i] strcmp(pinfo-argnames[i], paramname) == 0) + return sql_fn_param_ref_num(pstate, pinfo, i + 1, location); + + return NULL; + } + + /* + * sql_fn_post_column_ref parser callback for ColumnRefs + */ + static Node * + sql_fn_post_column_ref(ParseState *pstate, ColumnRef *cref, Node *var) + { + SQLFunctionParseInfoPtr pinfo = (SQLFunctionParseInfoPtr) pstate-p_ref_hook_state; + int names; + Node *field1; + Node *subfield = NULL; + const char *pname; + Node *param; + + if (var != NULL) + return NULL; /* there's a table column, prefer
Re: [HACKERS] compile/install of git
On Sat, Sep 18, 2010 at 02:20:53PM -0400 I heard the voice of David Blewett, and lo! it spake thus: Sorry for top-posting... I was under the impression that git over http was just as efficient since 1.6.6 [1]. That's about talking over HTTP to a git server running as CGI; it doesn't help if you're talking HTTP to just a plain HTTP host. -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] git: uh-oh
On Tue, Aug 17, 2010 at 01:57:02PM -0600 I heard the voice of Alex Hunsaker, and lo! it spake thus: On Tue, Aug 17, 2010 at 13:52, Alex Hunsaker bada...@gmail.com wrote: How sure are we that its not the cvs2svn step that is screwing it up? urp, I jumped to a conclusion while skimming the cvs2git.options file Magnus posted. With all the references to svn and things like GitRevisionRecorder('cvs2svn-tmp/git-blob.dat'). It sure sounded like it converts to svn first and then to git... im not sure what it does. It's not that it converts to svn, but that it's built on (/part of) cvs2svn, so presumably a lot of the figure out changesets and branch membership logic and the get things in the shape svn wants logic are intertwined. -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Trouble with COPY IN
(Yes, I know I'm not on the hackers list. Most interested parties should get this directly anyway.) Additionally the interface exposed by the JDBC driver lets the user write arbitrary CopyData bytes to the server, so without parsing all of that we don't know whether they've issued CopyData(EOF) or not. Okay, so you can't know with absolute certainty without parsing the data, but the usual case would be handled by holding onto the last-N bytes or so. Enough to fit the EOF and perhaps a little more for paranoia's sake. That's not to say that I'm missing the problem. When (not if, when) the user feeds data past a CopyData(EOF), it's going to get interesting. This is the reason why the patch to the JDBC driver that I sent in is very fragile. In the case where a user provides a binary copy with lots of data after the EOF, the processCopyData method *will* get called after the CommandComplete and ReadyForQuery messages have been received, even if we try to delay processing of the ReadyForQuery message. [Thinking about the logic necessary to handle such a case and avoid network buffer deadlock...] I would think the least invasive way to handle it would be to set the CommandComplete and ReadyForQuery messages aside when they are received if CopyDone hasn't been sent, continue the COPY operation as usual until it is shutdown, send CopyDone and, finally, reinstate CommandComplete and RFQ as if they were just received.. Basically, yes. We need to introduce a little more state into the JDBC driver. Currently, the driver is in one of two states: 1. In the middle of a copy. 2. Not in a copy. These states are recorded in the lock system. We need to introduce a new state, where the copy is still locked, but we know that the CommandComplete and ReadyForQuery messages have been received. We can no longer unlock the copy in processCopyData - we need to do that in endCopy instead, after calling processCopyData to ensure that we wait for a valid CommandComplete and ReadyForQuery message first. Matthew -- Terrorists evolve but security is intelligently designed? -- Jake von Slatt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Alias to rollback keyword
Hey, After dealing with a production fault and having to rollback all the time, I kept typing a different word instead of rollback. So I created a patch to accept this word as an alias for rollback. Obviously it's not part of the sql standard, but could be a nice extension for postgresql. See the patch for more details. Cheers Matt diff -U5 -r postgresql-8.4.2.orig/src/include/parser/kwlist.h postgresql-8.4.2/src/include/parser/kwlist.h --- postgresql-8.4.2.orig/src/include/parser/kwlist.h 2009-04-06 18:12:53.0 +0930 +++ postgresql-8.4.2/src/include/parser/kwlist.h 2010-04-01 10:55:46.0 +1030 @@ -55,10 +55,11 @@ PG_KEYWORD(begin, BEGIN_P, UNRESERVED_KEYWORD) PG_KEYWORD(between, BETWEEN, TYPE_FUNC_NAME_KEYWORD) PG_KEYWORD(bigint, BIGINT, COL_NAME_KEYWORD) PG_KEYWORD(binary, BINARY, TYPE_FUNC_NAME_KEYWORD) PG_KEYWORD(bit, BIT, COL_NAME_KEYWORD) +PG_KEYWORD(bollocks, ROLLBACK, UNRESERVED_KEYWORD) PG_KEYWORD(boolean, BOOLEAN_P, COL_NAME_KEYWORD) PG_KEYWORD(both, BOTH, RESERVED_KEYWORD) PG_KEYWORD(by, BY, UNRESERVED_KEYWORD) PG_KEYWORD(cache, CACHE, UNRESERVED_KEYWORD) PG_KEYWORD(called, CALLED, UNRESERVED_KEYWORD) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 8.5 release timetable, again
Alvaro Herrera wrote: This seems a good idea. Possibly pushing the betas more aggresively to current users would make them tested not only by PG hackers ... Isn't this the purpose of the new alpha releases, at lease to some extent. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: However, my comment above was too optimistic, because in an insert-only scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. I'm sure I'm missing something, but I thought the point of this was to lessen the impact of VACUUM and now you are suggesting that we have to add vacuums to tables that have never needed one before. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Aidan Van Dyk wrote: * Greg Stark [EMAIL PROTECTED] [081117 03:54]: I thought of saying that too but it doesn't really solve the problem. Think of what happens if someone sets a hint bit on a dirty page. If the page is dirty from a real change, then it has a WAL backup block record already, so the torn-page on disk is going to be fixed with the wal replay ... *because* of the torn-page problem already being solved in PG. You don't get the hint-bits back, but that's no different from the current state. But nobody's previously cared if hint-bits wern't set on WAL replay. What if all changes to a page (even hit bits) are WAL logged when running with Block-level CRC checks enables, does that make things easier? I'm sure it would result in some performance loss, but anyone enabling Block Level CRCs is already trading some performance for safety. Thoughts? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] In-place upgrade
Tom Lane wrote: Decibel! [EMAIL PROTECTED] writes: I think that's pretty seriously un-desirable. It's not at all uncommon for databases to stick around for a very long time and then jump ahead many versions. I don't think we want to tell people they can't do that. Of course they can do that --- they just have to do it one version at a time. Also, people may be less likely to stick with an old outdated version for years and years if the upgrade process is easier. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RAM-only temporary tables
Kevin Grittner wrote: An idea for a possible enhancement to PostgreSQL: allow creation of a temporary table without generating any disk I/O. (Creating and dropping a three-column temporary table within a database transaction currently generates about 150 disk writes). If some circumstances don't make it feasible to always do this as a RAM-only operation, perhaps a clause could be added to the CREATE TEMPORARY TABLE syntax to specify this behavior along with whatever limitations on the temporary table are required for this to work. (For example, maybe this is only feasible if the table will be dropped by the end of the transaction?) As someone else already pointed out you can put temp tables on a RAM disk, but the larger issue is that temp tables still cause system table churn which will always need to be on stable media. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 8.3 .4 + Vista + MingW + initdb = ACCESS_DENIED
Tom Lane wrote: Andrew Chernow [EMAIL PROTECTED] writes: Be careful. From LockFileEx docs: However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them. ROTFL ... so to translate: If your program crashes, please release locks before crashing. Obviously that wasn't the intent of the above, but I guess it is the net effect. Either way, I don't think it's a huge problem, it just means that PG may not be able to restart for a few seconds until the OS has time to clean-up the locks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE DATABASE vs delayed table unlink
The error on createdb happened again this morning. However, this time an abandoned directory was not created. The full error message was: $ createdb -E SQL_ASCII -U flyminebuild -h brian.flymine.org -T production-flyminebuild production-flyminebuild:uniprot createdb: database creation failed: ERROR: could not stat file base/33049747/33269704: No such file or directory However, my colleagues promptly dropped the database that was being copied and restarted the build process, so I can't diagnose anything. Suffice to say that there is no abandoned directory, and the directory 33049747 no longer exists either. I'll try again to get some details next time it happens. Matthew -- $ rm core Segmentation Fault (core dumped) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE DATABASE vs delayed table unlink
On Thu, 9 Oct 2008, Tom Lane wrote: So I'm mystified how Matthew could have seen the expected error and yet had the destination tree (or at least large chunks of it) left behind. Remember I was running 8.3.0, and you mentioned a few changes after that version which would have made sure the destination tree was cleaned up properly. [ thinks for a bit... ] We know there were multiple occurrences. Matthew, is it possible that you had other createdb failures that did *not* report file does not exist? For instance, a createdb interrupted by a fast database shutdown might have left things this way. Well, we didn't have any fast database shutdowns or power failures. I don't think so. Matthew -- Heat is work, and work's a curse. All the heat in the universe, it's going to cool down, because it can't increase, then there'll be no more work, and there'll be perfect peace. -- Michael Flanders -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE DATABASE vs delayed table unlink
Heikki Linnakangas [EMAIL PROTECTED] writes: Another thought is to ignore ENOENT in copydir. On Wed, 8 Oct 2008, Tom Lane wrote: Yeah, I thought about that too, but it seems extremely dangerous ... I agree. If a file randomly goes missing, that's not an error to ignore, even if you think the only way that could happen is safe. I could be wrong - but couldn't other bad things happen too? If you're copying the files before the checkpoint has completed, couldn't the new database end up with some of the recent changes going missing? Or is that prevented by FlushDatabaseBuffers? Matthew -- Isn't Microsoft Works something of a contradiction? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] September CommitFest Closed
Josh Berkus wrote: For the September commitfest, 29 patches were applied (one to pgFoundry) and 18 patches were sent back for more work. More importantly, six *new* reviewers completed reviews of of various patches: Abbas Butt, Alex Hunsaker, Markus Wanner, Ibrar Ahmed, Ryan Bradetich and Gianni Colli. Several other new reviewers volunteered, but we ran out of patches to check, and Kenneth Marshall had to quit reviewing because of Hurricane Ike. Yay, reviewers. If nothing else ever came of the Commit Fest approach, if it creates more reviewers, then I think the Commit Fest process would be a success. I think the Commit Fest approach does this by lowering the bar of entry to become a reviewer. It does this because it brings more focus to patch review on a regular basis and since it changes the patch review process from a last minute dash that only experience hackers should get involved with to a process with more time before the final deadline, hence more friendly for new reviewers to get involved. Good news all around! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] VACUUM Improvements - WIP Patch
Joshua D. Drake wrote: Merlin Moncure wrote: Well, there doesn't seem to be a TODO for partial/restartable vacuums, which were mentioned upthread. This is a really desirable feature for big databases and removes one of the reasons to partition large tables. I would agree that partial vacuums would be very useful. I think everyone agrees that partial vacuums would be useful / *A Good Thing* but it's the implementation that is the issue. I was thinking about Alvaro's recent work to make vacuum deal with TOAST tables separately, which is almost like a partial vacuum since it effectively splits the vacuum work up into multiple independent blocks of work, the limitation obviously being that it can only split the work around TOAST. Is there anyway that vacuum could work per relfile since we already split tables into files that are never greater than 1G? I would think that if Vacuum never had more than 1G of work to do at any given moment it would make it much more manageable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] VACUUM Improvements - WIP Patch
Tom Lane wrote: Matthew T. O'Connor [EMAIL PROTECTED] writes: I think everyone agrees that partial vacuums would be useful / *A Good Thing* but it's the implementation that is the issue. I'm not sure how important it will really be once we have support for dead-space-map-driven vacuum. Is that something we can expect any time soon? I haven't heard much about it really happening for 8.4. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Do we really want to migrate plproxy and citext into PG core distribution?
Tom Lane wrote: Greg Sabino Mullane [EMAIL PROTECTED] writes: Code outside of core, is, in reality, less reviewed, less likely to work well with recent PG versions, and more likely to cause problems. It's also less likely to be found by people, less likely to be used by people, and less likely to be included by distros. Not to say that everything should get shoved into core, of course, but there are strong arguments for both sides. These are all true statements, of course, but ISTM they should be looked on as problems to be solved. Pushing stuff into core instead of solving these problems is not a scalable long-term answer. A few random thoughts... The application that comes to mind first for me when you talk plugins is Firefox. They make it very easy to browse for plugins and to install, update, remove them. Their plug-in system also tries to account for Firefox version and OS platform which we would need to do also. Perhaps one thing that would help PostgreSQL plug-ins is a nice GUI plug-in browser and management application. The logical place to add this IMHO is PGAdmin since it is GUI, already talks to the DB and is cross platform. I'm not saying a GUI should be required to manage plug-ins, a fully CLI option should be made available too. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Concurrent VACUUM and ANALYZE
Jonah H. Harris wrote: On Mon, Jul 21, 2008 at 10:19 PM, Tom Lane [EMAIL PROTECTED] wrote: I don't find this a compelling argument, at least not without proof that the various vacuum-improvement projects already on the radar screen (DSM-driven vacuum, etc) aren't going to fix your problem. Is DSM going to be in 8.4? The last I had heard, DSM+related improvements weren't close to being guaranteed for this release. If it doesn't make it, waiting another year and a half for something easily fixed would be fairly unacceptable. Should I provide a patch in the event that DSM doesn't make it? Can't hurt to submit a patch. Also, could you do something to help mitigate the worse case, something like don't update the stats in pg_class if the analyze finishes after a vacuum has finished since the current analyze started? Matt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuuming leaked temp tables (once again)
Tom Lane wrote: We might have to rearrange the logic a bit to make that happen (I'm not sure what order things get tested in), but a log message does seem like a good idea. I'd go for logging anytime an orphaned table is seen, and dropping once it's past the anti-wraparound horizon. Is there an easy way for an Admin clean-up the lost temp tables that autovacuum is complaining about? It seems like it could be along time and a lot of log messages between when they are first orphaned and and finally dropped due to anti-wraparound protection. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XIDs and big boxes again ...
Hans-Juergen Schoenig wrote: i suggest to introduce a --with-long-xids flag which would give me 62 / 64 bit XIDs per vacuum on the entire database. this should be fairly easy to implement. i am not too concerned about the size of the tuple header here - if we waste 500 gb of storage here i am totally fine. As you say later in the thread, you are on 8.1. Alot of work has gone into reducing the effect, impact and frequency of XID wrap around and vacuuming since then. In 8.3 transactions that don't actually update a table no long use a real XID and autovacuum you no longer need a database wide vacuum to solve the XID wraparound problem, so I think the answer is upgrade to 8.3 and see if you still have this problem. Matthew O'Connor -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posting to hackers and patches lists
Alex Hunsaker wrote: In fact I would argue -patches should go away so we dont have that split. +1I think the main argument for the split is to keep the large patch emails off the hackers list, but I don't think that limit is so high that it's a problem. People have to gzip their patches to the patches list fairly often anyway. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posting to hackers and patches lists
Alex Hunsaker wrote: A big part of my problem with the split is if there is a discussion taking place on -hackers I want to be able to reply to the discussion and say well, here is what I was thinking. Sending it to -patches first waiting for it to hit the archive so I can link to it in my reply on -hackers seems pointless and convoluted. But if thats what you want, thats what ill try to do from now on :) For instance the patch Tom reviewed of mine yesterday only -hackers was copied, so I maintained that but also added -patches because I was sending in a patch... I think It will be an ongoing problem though especially for new people as they probably wont understand the logical split... Patches are an integral part of the conversation about development, I think trying to split them up is awkward at best. Do people really still think that the potential for larger messages is really a problem? By the way, what is the actual size limit on hackers vs patches. I would imagine that most patches would already fit in the current hackers limit, especially since you can gzip. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] US VISA CISP PCI comp. needs SHA1
Not sure if I posted in correct spot pg_8.2.6 Centos5 Windows based app. encryped pwd = yes SSL = yes, hostssl with explicit IP w/md5. (no pg_crypto) We are in process of VISA CISP PCI compliance for our application. (online cc auth - no stored cc data) [next phase will include stored cc data] We just heard back today that they would like to use SHA1 for pwd auth. does anyone have any doco that will support md5 vs. SHA1? We also have global customers so we understand the us v non-US export stuff. Any direction is appreciated. Thanks in advance. /matthew wetmore -- Matthew Wetmore Secom International, Inc 9610 Bellanca, Ave. Los Angeles, CA 90045 310-641-1290 This e-mail is intended for the addressee shown. It contains information that is confidential and protected from disclosure. Any review, dissemination or use of this transmission or its contents by persons or unauthorized employees of the intended organisations is strictly prohibited. The contents of this email do not necessarily represent the views or policies of Secom International Inc., or its employees. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Visa CISP PCI compliance needs SHA1?
Not sure if I posted in correct spot But seems to be topic of today...funny on same day I hear from Visa. pg_8.2.6 Centos5 Windows based app. encryped pwd = yes SSL = yes, hostssl, with explicit IP w/md5,. (no pg_crypto) This is just with client / server pwd auth We are in process of VISA CISP PCI compliance for our application. (online cc auth - no stored cc data) [next phase will include stored cc data] We just heard back today that they would like to use SHA1 NOT md5 for pwd auth. does anyone have any doco that will support md5 vs. SHA1? is PG_crypto in the db (meaning crypt the md5 hash )still the same as md5 auth We also have global customers so we understand the US v non-US export stuff. Any direction is appreciated. Thanks in advance. /matthew wetmore -- Matthew Wetmore Secom International, Inc 9610 Bellanca, Ave. Los Angeles, CA 90045 310-641-1290 This e-mail is intended for the addressee shown. It contains information that is confidential and protected from disclosure. Any review, dissemination or use of this transmission or its contents by persons or unauthorized employees of the intended organisations is strictly prohibited. The contents of this email do not necessarily represent the views or policies of Secom International Inc., or its employees. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] offtopic, historic, in 8.1, how to find relfrozenxid by table?
in 8.1, how do I find the relfrozenxid of a table? we're running 8.1.9, and have turned off autovacuum for a wee bit too long... thanks to the verbose warnings, we still have a few million transactions left before it locks up, although thats not much time for us. I'd like to focus vacuum on the tables that need it, but 8.1 does not have a relfrozenxid column in pg_class, although it does have datfrozenxid in pg_database. sorry for skipping the general list, but I'm not subscribed to that, and unsure of the latency there. thanks, Matt -- Matt Grosso [EMAIL PROTECTED] cell (201)780-9592 http://www.connexuscorp.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Built PostgreSQL 8.3 BETA on FreeBSD/SPARC64
Machine: Sun Microsystems Ultra 5 (SPARC IIi). OS: FreeBSD 6.2/SPARC64 Compiler: gcc v.3.4.6 PostgreSQL 8.3 Beta builds on my system with no trouble. Matthew Alton
Re: [HACKERS] First steps with 8.3 and autovacuum launcher
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: This is an interesting idea, but I think it's attacking the wrong problem. To me, the problem here is that an ANALYZE should not block CREATE INDEX or certain forms of ALTER TABLE. I doubt that that will work; in particular I'm pretty dubious that you can safely make CREATE INDEX and VACUUM run together. Since they'd be unlikely to be using the identical OldestXmin horizon, you'd likely end up with dangling index entries (ie, CREATE INDEX indexes a tuple that the VACUUM removes shortly afterward). I think the main issue is ANALYZE not VACUUM (at least in this thread) since it's DB load times that are in question. Can CREATE INDEX and ANALYZE be made to run concurrently? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] First steps with 8.3 and autovacuum launcher
Tom Lane wrote: If you insist on crafting a solution that only fixes this problem for pg_restore's narrow usage, you'll be back revisiting it before beta1 has been out a month. I don't know much about what is involved in crafting these solutions, but it seems we're close to beta and probably don't want to make drastic changes to anything. As such it seems to me that solving the problem for analyze is a nice piece of low-hanging fruit that solves an immediate problem that has been reported. I would think that reducing the locking implications of VACUUM is much more involved, no? Also, I would think that the fix for ANALYZE will be totally different than the fix for VACUUM no? Are you proposing that we solve the VACUUM locking problem before we release 8.3? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] More logging for autovacuum
Gregory Stark wrote: I'm having trouble following what's going on with autovacuum and I'm finding the existing logging insufficient. In particular that it's only logging vacuum runs *after* the vacuum finishes makes it hard to see what vacuums are running at any given time. Also, I want to see what is making autovacuum decide to forgo vacuuming a table and the log with that information is at DEBUG2. So I would like to suggest adding two options: log_autovacuum_jobs - output every time a vacuum or analyze *starts* log_autovacuum_level - set the log level for the autovacuum process I would also suggest raising the level of the DEBUG2 message indicating why tables were chosen or not. At least to DEBUG1 if not to INFO. Am I missing anything? Are there ways to get this info already that I'm missing? I imagine it would be pretty simple to add these and I'll be happy to do it and send the patch to -patches assuming others (Alvaro? :) agree. I think this sounds good. There was talk a while ago about need a special log level setting just for autovacuum, but nobody did the leg work. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Still recommending daily vacuum...
Alvaro Herrera wrote: Matthew T. O'Connor wrote: Well, if a table has 10 rows, and we keep the current threshold of 1000 rows, then this table must have 1002 dead tuples (99% dead tuples, 1002 dead + 10 live) before being vacuumed. This seems wasteful because there are 500 dead tuples on it and only 10 live tuples. So each scan must wade through all the dead tuples. Another small table with 100 tuples will be vacuumed on every iteration as well, even if there are just two dead tuples. So you are right -- maybe dropping it all the way to 0 is too much. But a small value of 10 is reasonable? That will make the 10 tuple table be vacuumed when there are 10 dead tuples (50% of dead tuples), and the 100 tuple table when there are 11 (11% of dead tuples). It decreases quickly to the scale factor (2%, or do we want to decrease it to 1%?) I think it's probably fine. I think, that the optimal number for the base_threhold is probably dependant on the width of the row, for a very narrow row where you might have many on the same page, 20 or 50 might be right, but for a very wide table a smaller number might be optimal, however I think it probably doesn't matter much anyway. Reducing the default to 10 seems fine, and perhaps even removing it as a tuning knob. I think there are too many autovacuum knobs and it confuses people. Is it too late to possibly remove this GUC altogether? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Still recommending daily vacuum...
Alvaro Herrera wrote: Jim C. Nasby wrote: FWIW, I normally go with the 8.2 defaults, though I could see dropping vacuum_scale_factor down to 0.1 or 0.15. I also think the thresholds could be decreased further, maybe divide by 10. How about pushing thresholds all the way down to 0? As long as it handles small (or zero row) tables ok then yes. The base_threshold in the originial contrib autovacuum was just an easy way to not vacuum really small tables too often. If a table has only 10 rows, it's going to get vacuumed every time one row is updated. I guess that's not a big problem with a table that small but still seems excessive. If you think this isn't a problem with the current autovacuum, then sure turn it down to zero, and perhaps we can even get rid of it altogether in another release or two. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Michael Paesold wrote: Matthew T. O'Connor schrieb: Do we need a configurable autovacuum naptime at all? I know I put it in the original contrib autovacuum because I had no idea what knobs might be needed. I can't see a good reason to ever have a naptime longer than the default 60 seconds, but I suppose one might want a smaller naptime for a very active system? A PostgreSQL database on my laptop for testing. It should use as little resources as possible while being idle. That would be a scenario for naptime greater than 60 seconds, wouldn't it? Perhaps, but that isn't the use case PostgresSQL is being designed for. If that is what you really need, then you should probably disable autovacuum. Also a very long naptime means that autovacuum will still wake up at random times and to do the work. At least with short naptime, it will do the work shortly after you updated your tables. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] COPYable logs status
Andrew Dunstan wrote: The situation with this patch is that I now have it in a state where I think it could be applied, but there is one blocker, namely that we do not have a way of preventing the interleaving of log messages from different backends, which leads to garbled logs. This is an existing issue about which we have had complaints, but it becomes critical for a facility the whole purpose of which is to provide logs in a format guaranteed to work with our COPY command. Unfortunately, there is no solution in sight for this problem, certainly not one which I think can be devised and implemented simply at this stage of the cycle. The solution we'd like to use, LWLocks, is not workable in his context. In consequence, I don't think we have any option but to shelve this item for the time being. I think this will get shot down, but here goes anyway... How about creating a log-writing-process? Postmaster could write to the log files directly until the log-writer is up and running, then all processes can send their log output through the log-writer. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] COPYable logs status
Tom Lane wrote: Matthew T. O'Connor [EMAIL PROTECTED] writes: How about creating a log-writing-process? Postmaster could write to the log files directly until the log-writer is up and running, then all processes can send their log output through the log-writer. We *have* a log-writing process. The problem is in getting the data to it. By that I assume you mean the bgwriter, I thought that was for WAL data, I didn't think it could or perhaps should be used for normal log file writing, but I also know I'm way outside my comfort area in talking about this, so excuse the noise if this is way off base. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Alvaro Herrera wrote: Jim C. Nasby escribió: There *is* reason to allow setting the naptime smaller, though (or at least there was; perhaps Alvero's recent changes negate this need): clusters that have a large number of databases. I've worked with folks who are in a hosted environment and give each customer their own database; it's not hard to get a couple hundred databases that way. Setting the naptime higher than a second in such an environment would mean it could be hours before a database is checked for vacuuming. Yes, the code in HEAD is different -- each database will be considered separately. So the huge database taking all day to vacuum will not stop the tiny databases from being vacuumed in a timely manner. And the very huge table in that database will not stop the other tables in the database from being vacuumed either. There can be more than one worker in a single database. Ok, but I think the question posed is that in say a virtual hosting environment there might be say 1,000 databases in the cluster. Am I still going to have to wait a long time for my database to get vacuumed? I don't think this has changed much no? (If default naptime is 1 minute, then autovacuum won't even look at a given database but once every 1,000 minutes (16.67 hours) assuming that there isn't enough work to keep all the workers busy.) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit
Alvaro Herrera wrote: Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: But this is misleading (started postmaster with good value, then edited postgresql.conf and entered -2): 17903 LOG: received SIGHUP, reloading configuration files 17903 LOG: -2 is outside the valid range for parameter autovacuum_vacuum_cost_limit (-1 .. 1000) Note how it still says the range is -1 .. 1000. Can we redefine things to make zero be the disabled value, thus keeping the range of valid values contiguous? That would be another solution ... though it would be different from the valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid value). Also it would be a different value from previous versions. I don't think either of these is a showstopper, so let's go for that if nobody objects. Can you make 0 and -1 both valid disabled values? That way it will be compatible with previous releases. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Tom Lane wrote: Andrew Hammond [EMAIL PROTECTED] writes: Hmmm... it seems to me that points new users towards not using autovacuum, which doesn't seem like the best idea. I think it'd be better to say that setting the naptime really high is a Bad Idea. It seems like we should have an upper limit on the GUC variable that's less than INT_MAX ;-). Would an hour be sane? 10 minutes? This is independent of the problem at hand, though, which is that we probably want the launcher to notice postmaster death in less time than autovacuum_naptime, for reasonable values of same. Do we need a configurable autovacuum naptime at all? I know I put it in the original contrib autovacuum because I had no idea what knobs might be needed. I can't see a good reason to ever have a naptime longer than the default 60 seconds, but I suppose one might want a smaller naptime for a very active system? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Florian G. Pflug wrote: Work done so far: - .) Don't start autovacuum and bgwriter. Do table stats used by the planner get replicated on a PITR slave? I assume so, but if not, you would need autovac to do analyzes. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Alvaro Herrera wrote: Simon Riggs wrote: On Wed, 2007-06-06 at 12:17 -0400, Matthew T. O'Connor wrote: Florian G. Pflug wrote: Work done so far: - .) Don't start autovacuum and bgwriter. Do table stats used by the planner get replicated on a PITR slave? I assume so, but if not, you would need autovac to do analyzes. The replication is an exact block-level replication of the master. We can't write very much at all on the slave. Hmm, something to keep in mind is forcing cache invals when the master causes them (for example relation cache, catalog caches and plan caches). Perhaps if you are as PITR master and you have active readonly slaves then there should be a WAL record to note plan invalidations, etc? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Autovacuum versus rolled-back transactions
Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: Our documentation says | analyze threshold = analyze base threshold | + analyze scale factor * number of tuples | is compared to the total number of tuples inserted, updated, or deleted | since the last ANALYZE. but deleted tuples are not considered in the total number, because the delta of {n_live_tuples + n_dead_tuples} is not changed by DELETE. We add the number of DELETE into n_live_tuples and subtract it from n_dead_tuples. Yeah, I was concerned about that when I was making the patch, but didn't see any simple fix. A large number of DELETEs (without any inserts or updates) would trigger a VACUUM but not an ANALYZE, which in the worst case would be bad because the stats could have shifted. We could fix this at the cost of carrying another per-table counter in the stats info, but I'm not sure it's worth it. I believe that whenever autovacuum performs a VACUUM it actually performs a VACUUM ANALYZE at leas the old contrib version did and I think Alvaro copied that. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Autovacuum versus rolled-back transactions
Tom Lane wrote: This means that a table could easily be full of dead tuples from failed transactions, and yet autovacuum won't do a thing because it doesn't know there are any. Perhaps this explains some of the reports we've heard of tables bloating despite having autovac on. I think this is only a problem for failed inserts as failed updates will be accounted for correctly by autovac and as you said, failed deletes really do nothing. So is there a way for rollback to just add the number of rolled back inserts to the n_tup_del counter? Then we would be ok, no? I think it's fairly obvious how n_live_tup and n_dead_tup ought to change in response to a failed xact, but maybe not so obvious for the other counters. I suggest that the scan/fetch counters (seq_scan, seq_tup_read, idx_scan, idx_tup_fetch) as well as all the block I/O counters should increment the same for committed and failed xacts, since they are meant to count work done regardless of whether the work was in vain. I am much less sure how we want n_tup_ins, n_tup_upd, n_tup_del to act though. Should they be advanced as normal by a failed xact? That's what the code is doing now, and if you think they are counters for work done, it's not so unreasonable. I think autovac only considers n_tup_(upd|ins|del) so while it might be correct to fix those other counters, I don't know that they are must fix items. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] What X86/X64 OS's do we need coverage for?
Devrim Gündüz wrote: Hi, On Fri, 2007-04-06 at 01:23 -0400, Matthew T. O'Connor wrote: The other thing to consider is that CentOS 5 has Xen built right in, so you should be able run VMs without VMWare on it. ... if the kernel of the OS has Xen support, there will be no performance penalty (only 2%-3%) (Para-virtualization). Otherwise, there will be full-virtualization, and we should expect a performance loss about 30% for each guest OS (like Windows). I may be wrong but I thought that the guest OS kernel only needs special support if the underlying CPU doesn't have virtualization support which pretty much all the new Intel and AMD chips have. No? ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What X86/X64 OS's do we need coverage for?
Larry Rosenman wrote: I might use that as the base then, since the hardware finishes getting here tomorrow. The other thing to consider is that CentOS 5 has Xen built right in, so you should be able run VMs without VMWare on it. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Documentation access problems.
Joshua D. Drake wrote: The big thing for me, is a single document, zero clicks, that is searchable. PDF and plain text are the only thing that give me that. If you are really zealous you can even use Beagle (which I don't) to preindex the PDF for you for easy searching. Lots of projects publish their HTML docs in two formats: One Big HTML file with everything; Broken up into many HTML files that link to each other. This would allow you you have one big searchable document. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Documentation access problems.
Bruce Momjian wrote: Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: Lots of projects publish their HTML docs in two formats: One Big HTML file with everything; Broken up into many HTML files that link to each other. This would allow you you have one big searchable document. The key word there being big ;-) ... I don't have any problem with making such a version available on the website, but I don't think shipping two versions of the HTML docs in our tarballs is reasonable. I think having the single HTML file version available on our web site is enough. Agreed. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 3
My initial reaction is that this looks good to me, but still a few comments below. Alvaro Herrera wrote: Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases - [ Snip ] launcher and worker interactions [Snip] worker to-do list - When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a snapshot of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. Does a new worker really care about the PID of other workers or what table they are currently working on? It removes from its to-do list the tables being processed. Finally, it writes the list to disk. Just to be clear, the new worker removes from it's todo list all the tables mentioned in the todo lists of all the other workers? The table list will be written to a file in PGDATA/vacuum/database-oid/todo.worker-pid The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. This all sounds good to me so far. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) Again it's not clear to me what this is gaining us? It seems to me that if when a worker starts up writes out it's to-do list, it should just do it, I don't see the value in workers constantly updating their todo lists. Maybe I'm just missing something can you enlighten me? other things to consider This proposal doesn't deal with the hot tables stuff at all, but that is very easy to bolt on later: just change the first phase, where the initial to-do list is determined, to exclude cold tables. That way, the vacuuming will be fast. Determining what is a cold table is still an exercise to the reader ... I think we can make this algorithm naturally favor small / hot tables with one small change. Having workers remove tables that they just vacuumed from their to-do lists and re-write their todo lists to disk. Assuming the todo lists are ordered by size ascending, smaller tables will be made available for inspection by newer workers sooner rather than later. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 3
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. Well the big problem that we have is not that large tables are being starved, so this doesn't bother me too much, plus there is only so much IO, so one worker working sequentially through the big tables seems OK to me. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. This creates a behavior of whoever gets to it first rather than allowing workers to claim tables that they actually won't be able to service any time soon. Right, but you could wind up with as many workers working concurrently as you have tables in a database which doesn't seem like a good idea either. One thing I like about the todo list setup Alvaro had is that new workers will be assigned fewer tables to work on and hence exit sooner. We are going to fire off a new worker every autovac_naptime so availability of new workers isn't going to be a problem. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. BTW, another issue that I don't think we can ignore: we actually need to do this on a per-tablespace level, or at least have the ability to disable or somehow limit it. While it's not common, there are users that run a hundred or more databases in a single cluster; it would be ugly if we suddenly had 100 vacuums trying to run on the same set of drives concurrently. I think we all agree that autovacuum needs to become tablespace aware at some point, but I think that is further down the line, we're having enough trouble figuring things out without that additional complication. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Resumable vacuum proposal and design overview
Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2007-02-27 at 10:37 -0600, Jim C. Nasby wrote: ... The idea would be to give vacuum a target run time, and it would monitor how much time it had remaining, taking into account how long it should take to scan the indexes based on how long it's been taking to scan the heap. When the amount of time left becomes less than the estimate of the amount of time required to scan the indexes (and clean the heap), you stop the heap scan and start scanning indexes. I do like this idea, but it also seems easy to calculate that bit yourself. Run VACUUM, after X minutes issue stop_vacuum() and see how long it takes to finish. Adjust X until you have it right. One problem with it is that a too-small target would result in vacuum proceeding to scan indexes after having accumulated only a few dead tuples, resulting in increases (potentially enormous ones) in the total work needed to vacuum the table completely. I think it's sufficient to have two cases: abort now, and restart from the last cycle-completion point next time (this would basically just be SIGINT); or set a flag to stop at the next cycle-completion point. It occurs to me that we may be thinking about this the wrong way entirely. Perhaps a more useful answer to the problem of using a defined maintenance window is to allow VACUUM to respond to changes in the vacuum cost delay settings on-the-fly. So when your window closes, you don't abandon your work so far, you just throttle your I/O rate back to whatever's considered acceptable for daytime vacuuming. I thought we already did that? Which BTW was part of my plan on how to deal with a vacuum that is still running after it's maintenance window has expired. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Jim C. Nasby wrote: That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. Ok, I think we may be actually getting somewhere. Me too. I propose to have two different algorithms for choosing the tables to work on. The worker would behave differently, depending on whether there is one or more workers on the database already or not. The first algorithm is the plain threshold equation stuff we use today. If a worker connects and determines that no other worker is in the database, it uses the plain worker mode. A worker in this mode would examine pgstats, determine what tables to vacuum/analyze, sort them by size (smaller to larger), and goes about its work. This kind of worker can take a long time to vacuum the whole database -- we don't impose any time limit or table size limit to what it can do. Right, I like this. The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. How can you determine what tables can be vacuumed within autovacuum_naptime? I agree that large tables should be excluded, but I don't know how we can do that calculation based on autovacuum_naptime. So at: t=0*autovacuume_naptime: worker1 gets started on DBX t=1*autovacuume_naptime: worker2 gets started on DBX worker2 determines all tables that need to be vacuumed, worker2 excludes tables that are too big from it's to-do list, worker2 gets started working, worker2 exits when it either: a) Finishes it's entire to-do-list. b) Catches up to worker1 I think the questions are 1) What is the exact math you are planning on using to determine which tables are too big? 2) Do we want worker2 to exit when it catches worker1 or does the fact that we have excluded tables that re too big mean that we don't have to worry about this? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Matthew T. O'Connor wrote: How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Thoughts? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Matthew T. O'Connor wrote: I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? Well, this would only happen the first time, until the plain worker processed the table; next time it would be picked up by the hot table worker. But yeah, we can build a better estimate using the same trick the planner uses: estimate tuple density as reltuples/relpages times the actual number of blocks on disk. Well even skipping it the first time isn't good, anything that further delays a hot table from getting vacuumed is bad. Also, I'm not sure it would just be the first time since plain VACUUM isn't going to reclaim most of the space, just mark it as reusable. This is moot however if we use a good metric, I have no idea if what you suggest above would be good enough. A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. Simple is better, at least until proven otherwise. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Matthew T. O'Connor wrote: I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The shared memory state could include the OID of the table each worker is currently working on, and we could keep the to-do list in some simple flat file for instance (since we don't care about crash safety). So far we are only talking about one parameter, the hot_table_size_threshold, which I agree would be a guess by an admin, but if we went in this direction, I would also advocate adding a column to the pg_autovacuum table that allows an admin to explicitly define a table as hot or not. Also I think each worker should be mostly independent, the only caveat being that (assuming each worker works in size order) if we catch up to an older worker (get to the table they are currently working on) we exit. Personally I think this is all we need, but others felt the additional threshold was needed. What do you think? Or what do you think might be better? I'm not certain exactly what key off needs to mean; perhaps each worker should make its own to-do list and then discard items that are either in-progress or recently done by another worker when it gets to them. My initial design didn't have any threshold at all, but others felt this would/could result in too many worker working concurrently in the same DB. I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. Maybe we never made that clear, I was always working on the assumption that two workers would never try to work on the same table at the same time. BTW, it's probably necessary to treat shared catalogs specially ... Certainly. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? The entire point of this is to ensure that the second daemon will only vacuum tables that it can finish very quickly. If you let a table bloat so it's too big, then you just can't vacuum it very frequently without risking all your other hot tables bloating because they're no longer getting vacuumed. The reality is that you can actually vacuum a pretty good-sized table in 60 seconds with typical cost-delay settings (ie: defaults except cost_delay set to 10). That means you can do 9 pages ~100 times a second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, that's still 27k pages per minute. At the risk of sounding like a broken record, I still think the size limit threshold is unnecessary. Since all workers will be working in on tables in size order, younger workers will typically catch older workers fairly quickly since the tables will be either small, or recently vacuumed and not need work. And since younger workers exit when they catch-up to an older worker, there is some inherent stability in the number of workers. Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). table1:10 rows table2: 100 rows table3: 1,000 rows table4:10,000 rows table5: 100,000 rows table6: 1,000,000 rows time=0*naptime: No workers in the DB time=1*naptime: worker1 starts on table1 time=2*naptime: worker1 has finished table1,table2 and table3, it's now working on table4, worker2 starts on table1. time=3*naptime: worker1 is on table5, worker2 is working on table4, worker3 starts on table1. time=4*naptime: worker1 is still on table5, worker2 has caught up to worker1 and exits, worker3 also catches up to worker1 since tables2-4 didn't require vacuum at this time so it exits, worker4 starts on table1 time=5*naptime: worker1 is working on table6, worker4 is up to table4, worker5 starts on table1 time=6*naptime: worker1 is working on table6, worker4 catches up to worker1 and exits, worker5 finds no additional work to be done and exits, worker6 starts at table1. time=7*naptime: worker1 still working on table6, worker6 is up to table4, worker7 starts at table1. time=8*naptime: worker1 still working on table6, worker6 still working on table4, worker7 working on table3, worker8 starting on table1. time=9*naptime: worker1 still working on table6, worker6 working on table5, worker7 catches worker 6 and exits, worker8 finds nothing more todo and exits, worker9 starts on table1 time=10*naptim: worker1 still working on table6, worker9 working on table4, worker10 starts on table1. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: The real problem is trying to set that up in such a fashion that keeps hot tables frequently vacuumed; Are we assuming that no single worker instance will vacuum a given table more than once? (That's not a necessary assumption, certainly, but without it there are so many degrees of freedom that I'm not sure how it should act.) Given that assumption, the maximum vacuuming rate for any table is once per autovacuum_naptime, and most of the magic lies in the launcher's algorithm for deciding which databases to launch workers into. Yes, I have been working under the assumption that a worker goes through the list of tables once and exits, and yes the maximum vacuuming rate for any table would be once per autovacuum_naptime. We can lower the default if necessary, as far as I'm concerned it's (or should be) fairly cheap to fire off a worker and have it find that there isn't anything todo and exit. I'm inclined to propose an even simpler algorithm in which every worker acts alike; its behavior is 1. On startup, generate a to-do list of tables to process, sorted in priority order. 2. For each table in the list, if the table is still around and has not been vacuumed by someone else since you started (including the case of a vacuum-in-progress), then vacuum it. That is what I'm proposing except for one difference, when you catch up to an older worker, exit. This has the benefit reducing the number of workers concurrently working on big tables, which I think is a good thing. Detecting already vacuumed since you started is a bit tricky; you can't really rely on the stats collector since its info isn't very up-to-date. That's why I was thinking of exposing the to-do lists explicitly; comparing those with an advertised current-table would allow accurate determination of what had just gotten done. Sounds good, but I have very little insight into how we would implement already vacuumed since you started or have I caught up to another worker. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: BTW, to what extent might this whole problem be simplified if we adopt chunk-at-a-time vacuuming (compare current discussion with Galy Lee)? If the unit of work has a reasonable upper bound regardless of table size, maybe the problem of big tables starving small ones goes away. So if we adopted chunk-at-a-time then perhaps each worker processes the list of tables in OID order (or some unique and stable order) and does one chunk per table that needs vacuuming. This way an equal amount of bandwidth is given to all tables. That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? Matt ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: Tom Lane wrote: I'm inclined to propose an even simpler algorithm in which every worker acts alike; That is what I'm proposing except for one difference, when you catch up to an older worker, exit. No, that's a bad idea, because it means that any large table starves even-larger tables. True, but the assumption I'm making is that there is a finite amount of bandwidth available and more concurrent activity will have a net negative effect the time it takes to vacuum all tables. I'm willing to pay that price to prevent small hot tables from getting starved, but less willing to pay the same price for large tables where the percentage of bloat will be much smaller. (Note: in all this I assume we're all using size as a shorthand for some sort of priority metric that considers number of dirty tuples not only size. We don't want every worker insisting on passing over every small read-only table every time, for instance.) I was using size to mean reltuples. The whole concept of sorting by size was designed to ensure that smaller (more susceptible to bloat) tables got priority. It might be useful for workers to sort their to-do lists by some other metric, but I don't have a clear vision of what that might be. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? It seems fairly trivial to me to have a scheme where you do one fill-workmem-and-scan-indexes cycle per invocation, and store the next-heap-page-to-scan in some handy place (new pg_class column updated along with relpages/reltuples, likely). Galy is off in left field with some far more complex ideas :-( but I don't see that there's all that much needed to support this behavior ... especially if we don't expose it to the SQL level but only support it for autovac's use. Then we're not making any big commitment to support the behavior forever. Well, if we can make it happen soon, it might be the best thing for autovacuum. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). I fail to see how a simple 6 table case is 'worst case'. It's common to see hundreds of tables, and I've run across more than one database with thousands of tables (think partitioning). In cases like those it's certainly possible, perhaps even likely that you would get many daemons running in the database at one time just from different tables suddenly needing vacuuming and appearing at a higher point in the list than other tables. With 100 ~1G tables getting updates it certainly wouldn't be hard to end up with 10 of those being vacuumed all at the same time. Yes 6 tables is small, the worst-case part of the example was that all the tables would need to be vacuumed constantly. Most databases only have a few hot tables. Most tables only need to vacuumed every once in a while. I do like the idea since it should be easier to tune, but I think we still need some limit on it. Perhaps as a first-pass we could just have a hard limit and log a message and/or set a flag any time we hit it. That would hopefully allow us to get information about how big a problem it really is. We could go one step further and say that the last daemon that can start in a database will only vacuum tables that can be done quickly; that's essentially what we've been talking about, except the limit we've been discussing would be hard-coded at 2. I'm confused, what limit would be set at 2? The number of concurrent workers? I've never said that. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) Which is exactly why we needn't bother benchmarking it. Even if it weren't complex and unsafe, it will be a net loss when you consider the fact that it adds I/O instead of removing it. I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? I'm saying that I don't like the idea of trying to stop on a dime by saving the current contents of vacuum's dead-TID array to disk with the idea that we can trust those values 100% later. Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I'm for stopping only after completing an index-cleaning pass, at the point where we empty the dead-TID array anyway. If you really have to have stop on a dime, just kill -INT the process, accepting that you will have to redo your heap scan since the last restart point. OK, so if I understand correct, a vacuum of a table with 10 indexes on it can be interrupted 10 times, once after each index-cleaning pass? That might have some value, especially breaking up the work required to vacuum a large table. Or am I still not getting it? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [Monotone-devel] Re: SCMS question
On Sat, Feb 24, 2007 at 10:27:38PM -0500 I heard the voice of Andrew Dunstan, and lo! it spake thus: This decision really belongs to the handful of people who do most of the maintenance and live with most of any CVS pain that exists: such as Tom, Bruce, Peter, Neil, Alvaro. Othe people have a right to voice an opinion, but nobody should be pushing on it. One thing that the DVCS crowd pushes is that that's _not_ the whole story. With CVS (or other centralized systems), the VCS is a development tool for the few core people, and a glorified FTP/snapshotting system for everyone else. With a DVCS, _everybody_ gets a development tool out of it. ObBias: After much resistance, I drank the distributed Kool-Aid. My poison of choice is bzr, which is very probably not ready performance-wise for Pg. So, I also look forward to a switch happening not now, but in a year or two, when the performance failings are historical and bzr can be chosen 8-} -- Matthew Fuller (MF4839) | [EMAIL PROTECTED] Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [Monotone-devel] Re: SCMS question
On Sun, Feb 25, 2007 at 06:28:20PM -0500 I heard the voice of Andrew Dunstan, and lo! it spake thus: I don't really drink this koolaid, at least not to the extent of disavowing what I said above. Oh, don't take my message as You're wrong, you're not taking into account [...]. It was meant more as a This is a convenient place to make [...] explicit. It seems that there are really 3 sequential questions here. 1) Do we switch VCS's? The averaged answer to this is pretty much Probably, but not right now, and not in the very near future. Given that, the rest of the discussion is probably somewhat pointless; at the least it should be carried out with this answer kept firmly in mind. 2) Do we go the DVCS route?, and only after THAT is resolved do we go on to: 3) Which VCS? The feature/capability lists of the various DVCS's contain a mix of those features which are inherent in (or at least pretty much universal among) DVCS's as a class, and those which are more particular to the given system. But in a discussion of which VCS to (hypothetically) use, you really want to separate them out so you can know when you're arguing for/against $SYSTEM, and when you're arguing for/against $CLASS_OF_SYSTEMS. -- Matthew Fuller (MF4839) | [EMAIL PROTECTED] Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SCMS question
On Sun, Feb 25, 2007 at 06:06:57PM -0500 I heard the voice of Neil Conway, and lo! it spake thus: The ability to do history-sensitive merges actually results in a significant reduction in the need for manual conflict resolution. I would say that a far greater contributor in practice would simply be frequency. If you diverge on your significant feature for 6 months, then try to merge in upstream changes from the main dev, you will be in hell no matter what merge algorithm you use. If you merge in upstream changes every few days, however, you will have many fewer and much simplier conflicts to deal with. A VCS that makes frequent merges easy results in easier conflict handling, not by some magical auto-resolution, but just by letting you do it in ongoing regular and small bites. -- Matthew Fuller (MF4839) | [EMAIL PROTECTED] Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. Rather than required people to manually identify hot tables, what if we just prioritize based on table size? So if a second autovac process hits a specific database, it would find the smallest table in need of vacuuming that it should be able to complete before the next naptime and vacuum that. It could even continue picking tables until it can't find one that it could finish within the naptime. Granted, it would have to make some assumptions about how many pages it would dirty. ISTM that's a lot easier than forcing admins to mark specific tables. So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. Thoughts? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. That would likely result in a number of workers running in one database, unless you limited how many workers per database. And if you did that, you wouldn't be addressing the frequently update table problem. A second vacuum in a database *must* exit after a fairly short time so that we can go back in and vacuum the important tables again (well or the 2nd vacuum has to periodically re-evaluate what tables need to be vacuumed). I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Ok, scratch that :-) Another round of braindumping below. I still think this is solution in search of a problem. The main problem we have right now is that hot tables can be starved from vacuum. Most of this proposal doesn't touch that. I would like to see that problem solved first, then we can talk about adding multiple workers per database or per tablespace etc... (This idea can be complemented by having another GUC var, autovacuum_hot_workers, which allows the DBA to have more than one worker on hot tables (just for the case where there are too many hot tables). This may be overkill.) I think this is more along the lines of what we need first. Ron Mayer expressed the thought that we're complicating needlessly the UI for vacuum_delay, naptime, etc. He proposes that instead of having cost_delay etc, we have a mbytes_per_second parameter of some sort. This strikes me a good idea, but I think we could make that after this proposal is implemented. So this take 2 could be implemented, and then we could switch the cost_delay stuff to using a MB/s kind of measurement somehow (he says waving his hands wildly). Agree this is probably a good idea in the long run, but I agree this is lower on the priority list and should come next. Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) If we consider Ron Mayer's idea of measuring MB/s, but we do it per tablespace, then we would inflict the correct amount of vacuum pain to each tablespace, sleeping as appropriate. I think this would require workers of different databases to communicate what tablespaces they are using, so that all of them can utilize the correct amount of bandwidth. I agree that in the long run it might be better to have multiple workers with MB/s throttle and tablespace aware, but we don't have any of that infrastructure right now. I think the piece of low-hanging fruit that your launcher concept can solve is the hot table starvation. My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. This seems a very simple solution (given your launcher work) that can solve the starvation problem. Thoughts? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] GiST Comparing IndexTuples/Datums
Good news: I think we've got GiST working (somewhat anyways), as we found gistKeyIsEQ(giststate, 0, datum, currdatum) in gistutil.c does the trick of comparing two datums. I swear most of our trouble is just finding our way around the postgres codebase, but we're getting there little by little. We've gone back to revisit hash to see if we can figure it out now that we understand a little bit about GiST, but we can't find an equivelent function in hash for the KeyIsEQ(). So two questions really. The first is if such a function exists for hash. The second is that nbtree and Neil Conways work a few years ago ( http://archives.postgresql.org/pgsql-patches/2003-09/msg00252.php) use the scan and scankey stuff, but we're having trouble understanding how these work. Is there some documentation on using these correctly (outside of just looking at nbtree code)? Thanks so much for the help folks! -Matt On 2/13/07, Teodor Sigaev [EMAIL PROTECTED] wrote: indexes, then it must use operator number so-and-so for equality. But there are lots of GiST opclasses that don't include equality at all; we can't break that case. There is a GiST support function for equality of keys, in btree_gist it's named as gbt_*_same. Equality function has support number 7 and is used for stored keys. But the real issue in unique GiST index is unique :). First, the algorithm of insertion doesn't compare indexed keys on leaf page at all. Values on the same page are compared only when page is splitting (picksplit support method). Second, GiST implementation supports only unordered trees (btree_gist is a some kind of emulation) and it cannot guarantee that equal keys will be close in index. That's related to picksplit and gistpenalty method problem/optimization and data set. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: After staring at my previous notes for autovac scheduling, it has become clear that this basics of it is not really going to work as specified. So here is a more realistic plan: [Snip Detailed Description] How does this sound? On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). In general the only time it's a good idea to have multiple vacuums running at the same time is when a big table is starving a small hot table and causing bloat. I think we can extend the current autovacuum stats to add one more column that specifies is hot or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. Thoughts? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: Matthew T. O'Connor wrote: On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). Yeah, I understand that. However, I think that can be remedied by using a reasonable autovacuum_vacuum_cost_delay setting, so that each worker uses less than the total I/O available. The main point of the proposal is to allow multiple workers on a DB while also allowing multiple databases to be processed in parallel. So you are telling people to choose an autovacuum_delay so high that they need to run multiple autovacuums at once to keep up? I'm probably being to dramatic, but it seems inconsistent. I think we can extend the current autovacuum stats to add one more column that specifies is hot or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. How would you define what's a hot table? I wasn't clear, I would have the Admin specified it, and we can store it as an additional column in the pg_autovacuum_settings table. Or perhaps if the table is below some size threshold and autovacuum seems that it needs to be vacuumed every time it checks it 10 times in a row or something like that. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] GiST Comparing IndexTuples/Datums
Hey folks: I posted this to the pgsql-novice mailing list but was told that it'd probably be better to repost here. I've been working with a group trying to implement UNIQUE index functionality in GiST (we started with hash, and have branched out to try and understand some of the other indexing structures as well). We installed the btree-gist contrib module and have gist indexes working fine and understanding when it's supposed to be unique (regardless of which contrib module is installed). We currently can walk over all the IndexTuples in a page and we were hoping to compare the IndexTuple passed into the gistdoinsert() function (the new itup being added to the index) to each of the IndexTuples in said page. The idea we've been playing with right now is using 'datum = index_getattr(itup, 1, RelationGetDescr(r), isnull);' as its used in hashinsert.c, but we can't figure out how to compare the two datums then. The actual data type of the value being inserted would be different depending on the type of column you created the index on. Since you provide an opclass when creating a gist index, are we supposed to use one of the user defined functions to compare items? (One of them is 'same', so in btree-gist, the function is gbt_text_same() for indexes on char and text columns) Is there a general way to access these functions without knowing specifically what contrib module is specified, something like OpClassFunction7()? Thanks so much for humoring me, and I apologize as I get myself familiar with PostgreSQL and database concepts in general. -Matt
Re: [HACKERS] autovacuum process handling
Alvaro Herrera wrote: This is how I think autovacuum should change with an eye towards being able to run multiple vacuums simultaneously: [snip details] Does this raise some red flags? It seems straightforward enough to me; I'll submit a patch implementing this, so that scheduling will continue to be as it is today. Thus the scheduling discussions are being deferred until they can be actually useful and implementable. I can't really speak to the PostgreSQL signaling innards, but this sound logical to me. I think having the worker processes be children of the postmaster and having them be single-minded (or single-tasked) also makes a lot of sense. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Autovacuum Improvements
Alvaro Herrera wrote: I'd like to hear other people's opinions on Darcy Buskermolen proposal to have a log table, on which we'd register what did we run, at what time, how long did it last, how many tuples did it clean, etc. I feel having it on the regular text log is useful but it's not good enough. Keep in mind that in the future we may want to peek at that collected information to be able to take better scheduling decisions (or at least inform the DBA that he sucks). I'm not familiar with his proposal, but I'm not sure what I think of logging vacuum (and perhaps analyze) commands to a table. We have never logged anything to tables inside PG. I would be worried about this eating a lot of space in some situations. I think most people would just be happy if we could get autovacuum to log it's actions at a much higher log level. I think that autovacuum vacuumed table x is important and shouldn't be all the way down at the debug level. The other (more involved) solution to this problem was proposed which was create a separate set of logging control params for autovacuum so that you can turn it up or down independent of the general server logging. Now, I'd like this to be a VACUUM thing, not autovacuum. That means that manually-run vacuums would be logged as well. +1 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Autovacuum improvements
Alvaro Herrera wrote: Matthew T. O'Connor wrote: This still seems ambiguous to me, how would I handle a maintenance window of Weekends from Friday at 8PM though Monday morning at 6AM? My guess from what said is: mon dom dow starttime endtime null null6 20:00 null null null1 null 06:00 So how do we know to vacuum on Saturday or Sunday? I think clearly defined intervals with explicit start and stop times is cleaner. mon dom dow start end nullnull5 20:00 23:59:59 nullnull6 00:00 23:59:59 nullnull7 00:00 23:59:59 nullnull1 00:00 06:00 (1 = monday, 5 = friday) So it takes 4 lines to handle one logical interval, I don't really like that. I know that your concept of interval groups will help mask this but still. Now I'm starting to wonder what will happen between 23:59:59 of day X and 00:00:00 of day (X+1) ... Maybe what we should do is not specify an end time, but a duration as an interval: month int dom int dow int start time durationinterval That way you can specify the above as mon dom dow start duration nullnull5 20:00 (4 hours + 2 days + 6 hours) Now, if a DST boundary happens to fall in that interval you'll be an hour short, or it'll last an hour too long :-) I certainly like this better than the first proposal, but I still don't see how it's better than a full set of columns for start and end times. Can you tell me why you are trying to avoid that design? Hmm... this seems like queue is nearly a synonym for group. Can't we just add num_workers property to table groups? That seems to accomplish the same thing. And yes, a GUC variable to limits the total number of concurrent autovacuums is probably a good idea. queue = group of groups. But I'm not sure about this at all, which is why I took it away from the proposal. I think we can live without the groups of groups, at least for now. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Autovacuum improvements
First, thanks for working on this. I hope to be helpful with the design discussion and possibly some coding if I can find the time. My initial reaction to this proposal is that it seems overly complex, however I don't see a more elegant solution. I'm a bit concerned that most users won't figure out all the knobs. Alvaro Herrera wrote: I've been thinking how to improve autovacuum so that we can convince more people that it can be enabled by default. I would like to see it enabled by default too, however the reason it isn't already enabled by default is that it caused failures in the regression test when we tried to turn it on during the 8.2 dev cycle and it was too close to beta to fix everything. All this new machinery is great, but it doesn't address that problem. Here are my thoughts. There are two areas of improvements: 1. scheduling, and 2. process handling, i.e., how to have multiple vacuum processes running at any time. Fail enough, but I would say the two biggest area for improvement are scheduling and preventing HOT tables from becoming vacuum starved (essentially what you said, but with a different emphasis). [snip] Process Handling My idea here is to morph the current autovacuum daemon from an agent that itself runs a vacuum command, into something that launches other processes to run those vacuum commands. I'll call this the autovacuum launcher process, or the launcher for short. The idea here is that the launcher can take care of the scheduling while the worker processes do their work. If the launcher then determines that a particular instant there should be two vacuums running, then it simply starts two worker processes. How about calling it the autovacuum_master process? [snip autovacuum launcher process description] That all sounds reasonable to me. I think the harder part is what you are getting at below (how to get the launcher to figure out what to vacuum when). Scheduling == We introduce the following concepts: 1. table groups. We'll have a system catalog for storing OID and group name, and another catalog for membership, linking relid to group OID. pg_av_tablegroup tgrname name pg_av_tgroupmembers groupid oid relid oid 2. interval groups. We'll have a catalog for storing igroup name and OID, and another catalog for membership. We identify an interval by: - month of year - day of month - day of week - start time of day - end time of day This is modelled after crontabs. pg_av_intervalgroup igrnamename pg_av_igroupmembers groupidoid month int domint dowint starttime timetz endtimetimetz This seems to assume that the start and end time for an interval will be on the same day, you probably need to specify a start month, dom, dow, time and an end month, dom, dow and time. Since this is modeled after cron, do we allow wild-cards, or any of the other cron tricks like */20 or 1-3,5,7,9-11? Also your notation above is ambiguous, it took me a while to realize that pg_av_igroupmembers.groupid wasn't referencing the id from pg_av_tablegroup. Additionally, we'll have another catalog on which we'll store table groups to interval groups relationships. On that catalog we'll also store those autovacuum settings that we want to be able to override: whether to disable it for this interval group, or the values for the vacuum/analyze equations. pg_av_schedule tgroup oid igroup oid enabledbool queue int vac_base_threshint vac_scale_factor float anl_base_threshint anl_scal_factorfloat vac_cost_delay int vac_cost_limit int freeze_min_age int freeze_max_age int What is queue for? So the scheduler, at startup, loads the whole schedule in memory, and then wakes up at reasonable intervals and checks whether these equations hold for some of the tables it's monitoring. If they do, then launch a new worker process to do the job. We need a mechanism for having the scheduler rescan the schedule when a user modifies the catalog -- maybe having a trigger that sends a signal to the process is good enough (implementation detail: the signal must be routed via the postmaster, since the backend cannot hope to know the scheduler's PID. This is easy enough to do.) This all looks reasonable if not a bit complex. Question, what happens to the current pg_autovacuum relation? Also what about system defaults, will we have a hard coded default interval of always on, and one default table group that contains all the tables with one default entry in pg_av_schedule? I think we need more discussion on scheduling, we need to make sure this solves the vacuum starvation problem. Does the launcher process consider each row in pg_av_schedule that