Re: [HACKERS] modules
Joshua D. Drake wrote: On Thu, 3 Apr 2008 13:54:11 - Greg Sabino Mullane [EMAIL PROTECTED] wrote: Right now contrib is a real catch-all of various things; it would be nice to categorize them somehow. And by categorize, I emphatically do NOT mean move to pgfoundry, which is pretty much a kiss of death. Pgfoundry is not a kiss of death except that you spread falsehoods like that. PGfoundry is a very alive project that is constantly adding content and has continuing and very active projects. +1. This modules/packages concept seems 100% orthogonal to pgfoundry to me. Pgfoundry seems like a bug-tracking / development infrastructure somewhat like sourceforge. Some modules might want to use it, some might not (no doubt postgis would stay with refractions, etc). This hypothetical modules project is more focused on installers, and perhaps ways to find and run the module installers whether from pgfoundry or elsewhere. Ron PS: Regarding pgfoundry and credibility; it seems the stature and image of pgfoundry would go up a lot if postgresql itself were hosted there. But no, I'm not advocating that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] SHA1 on postgres 8.3
Sam Mason wrote: On Thu, Apr 03, 2008 at 07:07:56PM +0200, Svenne Krap wrote: ID serial Username varchar Password_md5 varchar Password_sha1 varchar ... Why not just use SHA-512, you get many more quality bits that way. Or if he just wanted to use builtin tools and reduce accidental collisions almost exactly the same much as he propopses, he could use password_md5_with_salt_xxx varchar password_md5_with_salt_yyy varchar but I also can't see the point. Won't he have more collisions from cosmic rays turning the results of his comparisons from false to true anyway? I would drop md5 totally and use sha1 and ripemd-160 if possible.. but currently i use only md5 as it is the only available one.. Loading pgcrypto is overkill for something as simple as hash-functions. Sounds like a good reason for moving the current md5 function out into pgcrypto as well! :) I'd rephrase it as saying a good reason for making it less intimidating to install modules. +1 to all the approaches that make this less scary. For Perl Digest-SHA1's in CPAN http://search.cpan.org/dist/Digest-SHA1/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] modules
Andrew Dunstan wrote: Tom Lane wrote: as having better system support for packages or modules or whatever you want to call them; and maybe we also need some marketing-type ...re-raise the question of getting rid of contrib... The PostgreSQL Standard Modules. While renaming, could we go one step further and come up with a clear definition of what it takes for something to qualify as a module? In particular I think standardizing the installation would go a long way to letting packagers automate the installation of modules from pgfoundry. I think it'd be especially cool if one could one-day have a command pg_install_module [modulename] -d [databasename] and it would magically get (or verify that it had) the latest version from pgfoundry; compile it (if needed) and install it in the specified database. The closest analogy to what I'm thinking is the perl CPAN or ruby gems. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] modules
Andrew Dunstan wrote: I think it'd be especially cool if one could one-day have a command pg_install_module [modulename] -d [databasename] Yes, and the CPAN analogy that has been in several minds, but it only goes so far. Perl and Ruby are languages - Postgres is a very different animal. Sure - but the benefits of standardizing installers for optional components seems to apply the same for both. We do in fact have some support for building / installing some modules in a standard way. It's called pgxs and it is used by quite a number of existing modules. Cool. Seems to handle at least quite a bit of the building part of standardized modules. One thing that might be worth looking at is an install command at the SQL level, so the INSTALL foo would run the install script for the foo module in the current database, assuming it's in the standard location. I'm guessing that this would be harder to add various options (install/ignore dependancies ; specify a different source web site) that a standard installer would like to have. We don't have a central repository of non-standard modules, like CPAN, and so of course no facility for fetching / building / installing them. Seems that could easily be improved in a number of ways. * The installer could specify the source. For example pg_install_module postgis -source http://www.refractions.net in exactly the same way ruby uses gem install rails –source http://gems.rubyonrails.org * pgfoundry could provide a repository of installable modules for projects hosted there. * perhaps pgfoundry could even have a section where it points to installers on third party sites? Not all modules fit a single pattern, either. There are addon languages, types, and function libraries, as we all as utilities that are not installed in the database at all. Agreed. Such a mechanism would only really apply for things that are installed in the database. But from an end user's point of view, installing functions, index types, languages, data types, etc all see to fit the pg_install postgis -d mydb, pg_install pl_ruby -d mydb, etc. pattern pretty well. Finally, setting up modules so they can be built for Windows, especially using MSVC, will probably be quite a challenge. Indeed. Seems ruby gems give you the option of installing a ruby version or a windows version that I'm guessing has pre-compiled object files. So if someone wants to make a start on any of this I'm sure we would all listen up. I'm happy to try, though might need pointing in the right directions. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump additional options for performance
Magnus Hagander wrote: On Tue, Feb 26, 2008 at 08:28:11AM -0500, Andrew Dunstan wrote: Simon Riggs wrote: Separate files seems much simpler... Yes, We need to stick to the KISS principle. ISTM that we could simply invent a new archive format of d for directory. Yeah, you can always ZIP (or whatever) the resulting directory when you're done.. I could imagine this being very rsync friendly too. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Permanent settings
Magnus Hagander wrote: If they don't have an actual database, it's fairly common to use SQLite or similar just to get proper database storage for it. With all the concern about parsing in this thread, perhaps it's best if this config-overrides file not be of the same syntax as postgresql.conf at all. If the interactive form of these overrides will be SET PERMANENTLY work_mem TO 65MB;, why not make the override config file use the same syntax; since a parser for it'll have to exist anyway? Maybe some XML bloat. Or, since you mentioned it, perhaps SQLite itself, since some people on the thread seem to want sql-like syntaxes to maintain it? [Personally, we maintain perl scripts that apply patches to the default postgresql.conf; and check those in to source control. I don't think I'd use this override file feature.] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Feature Freeze Date for Next Release
Wouldn't seeing which patches are trickling in during the first months of 8.4 development give a better indication of when it should be freezable? I'm all in favor of having lots of advance notice and predictable schedules --- but it seems in the next month or so we'll have a lot more insight of whether 8.4'll be getting big features or little ones based on what work-in-progress patches start coming. Simon Riggs wrote: Can I ask when the Feature Freeze for next release will be? This gives me flashbacks to this earlier thread: http://archives.postgresql.org/pgsql-hackers/2006-09/msg01979.php Simon Riggs in 2006 wrote: David Page wrote: Following the recent discussion on this list and another on pgsql-core, we have decided that we would like to aim to meet the following schedule for the release of PostgreSQL 8.3: April 1st 2007 - Feature freeze June 1st 2007 - Release I'm very happy to have clearly stated dates. That means we can all plan what we'll be able to achieve in that time, which is important when some of the largest or most complex features are being considered. Quoting Simon in 2008 again: We've long expressed the wish to move development onto a cycle that ends in the Spring, so next alternative would appear to be end-March-2009, which is 14 months away now. In light of the 2006 thread, I'd say April First like the previous thread would be better than End-March for those with a sense of irony. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] configurability of OOM killer
Decibel! wrote: Yes, this problem goes way beyond OOM. Just try and configure work_memory aggressively on a server that might see 50 database connections, and do it in such a way that you won't swap. Good luck. That sounds like an even broader and more difficult problem than managing memory. If you have 50 connections that all want to perform large sorts, what do you want to have happen? a) they each do their sorts in parallel with small amounts of memory for each; probably all spilling to disk? b) they each get a big chunk of memory but some have to wait for each other? c) something else? Seems (a)'s already possible today with setting small work_mem. Seems (b)'s already possible today with a larger work_mem and pg_pool. Stepping back from the technical details, what do you think should happen. (though perhaps it should be taken to a different thread) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Feature Freeze Date for Next Release
Simon Riggs wrote: Can I ask when the Feature Freeze for next release will be? Also, from http://www.postgresql.org/about/press/faq Q: When will 8.4 come out? A: Historically, PostgreSQL has released approximately every 12 months and there is no desire in the community to change from that pattern. So expect 8.4 sometime in the fourth quarter of 2008. So you can count back from then to guess a freeze. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] configurability of OOM killer
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: ... OOM_Killer Egad. Whoever thought *this* was a good idea should be taken out and shot: If I read this right, http://lkml.org/lkml/2007/2/9/275 even the shared memory is counted many times (once per child) for the parent process - even though it's (obviously) not copy-on-write so the shared memory's unlikely to contribute to problems. I wonder if postgres startup should write something (warning? at least log?) in the log file if the OOM killer is enabled.I assume most people who care deeply about their database dying would notice a warning in log files; while most people who don't mind the OOM killer also wouldn't be too bothered by extra noise in the file. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] configurability of OOM killer
Alvaro Herrera wrote: Yeah, the only way to improve the OOM problem would be to harass the Linux developers to tweak badness() so that it considers the postmaster to be an essential process rather than the one to preferentially kill. Wouldn't the more general rule that Jeff Davis pointed out upstream make more sense? That shared memory of the children should not be added to the size of the parent process multiple times regardless of if something's an essential process or not.Since those bytes are shared, it seems such bytes should only be added to the badness once, no? (assuming I understood Jeff correctly) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable
Tom Lane wrote: Kevin Grittner [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: Or is someone prepared to argue that there are no applications out there that will be broken if the same query, against the same unchanging table, yields different results from one trial to the next? Won't even autovacuum analyze cause this too if the new stats changes the plan? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable
Jeff Davis wrote: On Mon, 2008-01-28 at 23:13 +, Heikki Linnakangas wrote: clusteredness didn't get screwed up by a table that looks like this: 5 6 7 8 9 1 2 3 4 ...test table with a similar distribution to your example, and it shows a correlation of about -0.5, but it should be as good as something near -1 or +1. I am not a statistics expert, but it seems like a better measurement would be: what is the chance that, if the tuples are close together in index order, the corresponding heap tuples are close together?. Same applies for data clustered by zip-code. All rows for any State or City or County or SchoolZone are close together on the same pages; yet postgres's stats think they're totally unclustered. The answer to that question in your example is very likely, so there would be no problem. Is there a reason we don't do this? I've been tempted to do things like update pg_statistic set stanumbers3='{1.0}' where starelid=2617 and staattnum=7; after every analyze when I have data like this from tables clustered by zip. Seems it'd help more plans than it hurts, but haven't been brave enough to try in production. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Postgresql Materialized views
Mark Mielke wrote: Mark Mielke wrote: Counts, because as we all know, PostgreSQL count(*) is slow, and in any case, my count(*) is not on the whole table, but on a subset. Doing this in a general way seems complex to me as it would need to be able to evaluate whether a given INSERT or UPDATE or one of the dependent tables would impact the WHERE clause for the materialized view, and it still wouldn't know which rows to add/update/remove without detailed analysis, so it would either be throwing out the entire materialized view and recreating it on INSERT or UPDATE (or deferring until the next query?) in which case it may be very slow, or it may be very complex. Bah. I forgot to add: The feature I've been wondering about (and not necessarily looking for somebody else to do, although I don't think I know the code well enough to do it at this point): Web applications often make the same queries over and over. While memcache can be used to cache results, the memcache interface is different from the web application interfere requiring complex code, and as well, one loses the transaction guarantees as the memcache results are not guaranteed to be up-to-date with the database. Regarding up-to-dateness note that there is a pgfoundry project that helps there. http://pgfoundry.org/projects/pgmemcache/ The other advantages of doing the caching outside the database is that (a) the memory for the cached results don't have to sit in the database machine, and (b) you can cache post-processed (rendered into HTML or gifs) fragments rather than raw data. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Declarative partitioning grammar
Gavin Sherry wrote: CREATE TABLE is modified to accept a PARTITION BY clause. This clause contains one or more partition declarations. The syntax is as follows: PARTITION BY {partition_type} (column_name[, column_name...]) [PARTITIONS number] ( partition_declaration[, partition_declaration...] ) The partition type can be one of HASH, RANGE or LIST. What would be the drawbacks of CREATE TABLE tablename(...) PARTITION BY function_taking_row_returning_partition_name instead of the explicit types? It seems that with my own function I could pretty easily emulate the HASH,RANGE,or LIST types. It seems a function returning a partition name would largely avoid the need for the sub-partition stuff too -- at least for the cases when the only reason you wanted sub-partitions was for composite partition support. I'm not sure if a function would give more flexibility, but it sure seems it'd be easier for me to remember than the various PARTITION BY LIST (a) ( VALUES ('L') SUBPARTITION BY RANGE (b) (VALUES('x'),VALUES('y')), VALUES ('M') SUBPARTITION BY RANGE (b) (VALUES('u'),VALUES('t'))) or whowever it'd look. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
hris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Short summary: With the appropriate clustering, ISTM Segment Exclusion can be useful on all columns in a table. Cluster the table by one-bit-of-column1 || one-bit-of-column-2... That way any col2=value query could exclude at least about half the table regardless of if it's monotonically increasing or even totally random. It seems one could make Segment Exclusion even useful for multiple unrelated columns in a table.Consider a large table of people where one might want segment exclusion to help with both first name, and last name. One could cluster the table by first-letter-of-last-name || first-letter-of-first-name. Then a query for last name Smith could exclude all but the consecutive segments of S's; while the query for first name John would only have to look in the 26 runs of segments with AJ, BJ, CJ, ... Perhaps better - hash each column and interleave the bits col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3 If I understand segment exclusion right, that way on any table bigger than 8 segments any query of col1=val, or col2=val or col3=val would scan at most half the table; on a table bigger than 64 segments any such query would scan at most 1/4 of the table. Obviously this only benefits the rarely changing parts of tables; and new and updated values would be in a very hot segment at the end of the table that wouldn't be segment excluded much. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Chris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Really?I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. John's may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Andrew Sullivan wrote: On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: ...the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: ...you're searching for data in a given date range or for primary (sometimes artificial) identifiers in a range, and the source data increases (almost) monotonically. It seems to me the idea discussed elsewhere in the thread[1] about configurable segment sizes would make this limitation much less problematic for some types of data. Some of my biggest tables are clustered by zip-code; and are insert mostly. Common queries are where state_provence='TX' or where city='Dallas'. While I doubt I have enough data to fill a 1GB segment for any but the largest cities; I certainly have runs of many consecutive blocks - since clustering by zip tends to cluster cities as well. Even though the table's not monotonically increasing or decreasing, like values for cities and states are clustered together. Is my understanding right that these Segment Visibility Maps could help this case as well? [1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Sorting Improvements for 8.4
Gregory Stark wrote: Note that speeding up a query from 20s to 5s isn't terribly useful. I disagree totally with that. That is the difference between no chance of someone waiting for a web page to load; vs. a good chance they'd wait. And 2s vs 0.5s is the difference between a web site that feels responsive and one that doesn't. If it's OLTP you can't be using all your cores for each user anyways. Even so, I'd much rather keep each response time lower. If web page requests are coming in at 1 a second, it's much nicer to respond to each of them in 1 second than in 4 seconds -- even if the overall throughput is identical. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Sorting Improvements for 8.4
Mark Mielke wrote: I am curious - what algorithms exist to efficiently do a parallel sort? Do you mean if sorting 1 million items, it is possible to separate this into 2 sets of 500 thousand each, execute them in separate threads (with task administration and synchronization overhead) , combine the results, and complete the task in significantly less time than doing it in one thread? I am skeptical that this is possible... The link in the beginning of the thread points to articles that seem to describe one such algorithm; along with benchmarks. (http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m) The improvements were pretty consistent from set sizes ranging from very small sets (hundreds) to quite large ones (hundreds of K). Interestingly, even multi-threading helped a lot. Our tests correlate well with previous research that showed Intel’s implementation of SMT (Hyper-Threading) to be adept at hiding this latency [6, 20, 12].Table 4 shows that by having two threads access memory at the same time, performance improved over 80% when compared to the singlethreaded version. It uses both quicksort phases and merge phases; for the merge phase using 2CPUs (no hyperthreading) apparently gave more than 2X speed improvement; apparently because it could parallelize memory access with CPU more. Or do you mean being able to perform parts of the query plan fully in parallel? If this, then one would need a lot more than ParallelSort... I wouldn't recommend that - it seems like a Hard Problem. My guess is that the best way to use multiple threads in one backend would be to find specific algorithms like sorting that would be easier to isolate. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Sorting Improvements for 8.4
Tom Lane wrote: ...I can believe that suitable test cases would show 2X improvement for 2 threads, One other thing I found interesting is that their test case showed a near 2X improvement for hyperthreading; where I haven't heard of many other ways to get hyperthreading to show improvements for postgreql. but it doesn't follow that you will get 10X improvement with 10 threads, or even 4X with 4. Yeah - unless those 10 cores have additional I/O to the memories compared to a 1 core system (which I'd hope would be the case or else I'd expect many apps would be run into memory bottlenecks on such systems, no?). ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Sorting Improvements for 8.4
Has anyone looked into sorting algorithms that could use more than one CPU or core at a time? Benchmarks I see[1][2] suggest that sorting is an area that improves greatly with multiple processors and even with multi-threading on a single core processor. For 1-processor and 2-threads (1p2t), the algorithm sorts the relation about 48% faster than the single-threaded version with a speedup of 31% during the quicksort and 58% during the mergesort. The dual-processor (2p2t) version provides an even faster total speedup of 86% over the single-threaded version with a speedup of 60% during the quicksort and 100% during the merge sort. [from the acm paper on link 2 below] PS: Yeah, I know multi-threading is a hot-button on these lists; but sorting seems a relatively isolated of the code and I'd wonder if it'd be isolate-able enough that multiple CPUs could be used there. [1] http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf [2] http://delivery.acm.org/10.1145/112/1114254/DaMoN_103.pdf?key1=1114254key2=5713023711coll=dl=ACMCFID=15151515CFTOKEN=6184618 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [GENERAL] Stored procedure issue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/01/07 20:40, Dragan Zubac wrote: Hello I have a stored procedure which does the billing stuff in our system,it works ok,but if I put in production,where there is some 5-10 billing events per second,the whole database slows down. It won't even drop some test table,reindex,vacuum,things which were done before in the blink of an eye. If I stop the application which calls the procedure,all is back to normal. We didn't implement any special locking mechanism in the procedure,all is default. The procedure is updating user's balance in table 'users'. On the other hand a couple of 'heavy load' table has foreign keys pointing to table 'users'. Is it the matter of concurency and some locking issue or maybe the existing of all those foreign keys pointing to table 'users',or maybe something else which we're not aware at the moment ? Are you using transactions? Are the tables properly indexed? Are the queries in the SP using the indexes properly? Did you do all the testing on a tiny database. Is the SP written as efficiently? (Think of ways to refactor it in order to get the same results with less effort.) - -- Ron Johnson, Jr. Jefferson LA USA %SYSTEM-F-FISH, my hovercraft is full of eels -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHUh9nS9HxQb37XmcRAjPTAJ4jRUZUaF+j2KAB3+lBY6A3ROfynACfawWT 0QN026Ncl/Iag2M6E1kfjUg= =RlXy -END PGP SIGNATURE- ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] PG 7.3 is five years old today
Robert Treat wrote: On Tuesday 27 November 2007 15:07, Simon Riggs wrote: On Tue, 2007-11-27 at 14:02 -0500, Tom Lane wrote: There has been some discussion of making a project policy of dropping support for old releases after five years. Should we consider formally instituting that? ... Perhaps we should ask for volunteers to maintain that branch? ... +1 to see if anyone else wants to take over management of the branch. I also think we should be a bit more generous on the EOL notice. One thing that could soften the blow is if the EOL notice mentions which commercial organizations will provide paid support for longer than the community does. I assume that's one of the benefits of going with the commercial support organizations? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] How to keep a table in memory?
Heikki Linnakangas wrote: Luke Lonergan wrote: Vacuum is a better thing to run, much less CPU usage. Vacuum is actually not good for this purpose, because it's been special-cased to not bump the usage count. Though the OS's page cache will still see it as accesses, no? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Feature Freeze date for 8.4
Joshua D. Drake wrote: We develop and commit like normal *until* the community feels there is enough for release. Then we announce a feature freeze. I think you just described what will happen in reality regardless of whatever is decided to be an official plan. :) I don't think that's necessarily a bad thing. IMHO 8.3 is looking like a better product because it's late, and DBAs concerned about stability will likely stay on 8.2 for a while anyway. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Release notes introductory text
Tom Lane wrote: Joshua D. Drake [EMAIL PROTECTED] writes: With respect to you Kevin, your managers should wait. You don't install .0 releases of any software into production without months of testing. At which point, normally a .1 release has come out anyway. How exactly do you expect the software to get from a .0 to a .1 release, or to have addressed the bugs that might bite you when it does get to .1, if you aren't helping to test it? In most environments I've seen, developer and QA systems don't hesitate to move to .0 releases (or even beta). I agree with Joshua that it's nerve wracking to move _production_ systems to .0 releases from most software vendors. Now I realize that you did say test above, but way too often I see this sort of argument as a justification for doing nothing and expecting somebody else to fix it. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Use of postmaster
Brendan Jurd wrote: Seems it would be best to apply this nomenclature consistently, and simply drop the name postmaster from use. +1 I agree the term postmaster references in the docs, etc should go away - with perhaps the exception of one faq that say that postmaster's a deprecated name in case anyone encounters it in other web sites or old docs. PS: Oh it's so hard to resist saying it should be renamed postmaSQLter. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Why is there a tsquery data type?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Why does text search need a tsquery data type? I realize it needs tsvector so it can create indexes and updated trigger columns, but it seems tsquery could instead just be a simple text string. By that logic, we don't need any data types other than text. Could similar logic argue that we'd want special types for regular expressions too? That seems quite parallel to the tsquery type to me. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] SQL feature requests
Tom Lane wrote: Part of the reason for being conservative about changing here is that we've got a mix of standard and nonstandard behaviors A lot of this is legacy behavior that would never have passed muster if it had been newly proposed in the last few years --- we have gotten *far* stricter about SQL compliance than we used to be. But at this point backwards compatibility also has to weigh heavily. Has there been any thought to eventually phasing them out? Perhaps a GUC to give warnings in the log file when they're encountered. I guess we'd need 3 levels of warnings, off, reasonable and pedantic. When set to the reasonable level it could only give smart warning messages like Warning: Use of frivolous nonstandard behavior XXX. Hint: Use the standard YYY instead. and when set to pedantic it would point out every non-standard SQL statement - useful only for someone to be aware of how much postgresql dependent behavior they might have. Then a farther future release could deprecate the frivolous non-standard pieces presumably leading to simpler code in the long run. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch2 patch status report
Merlin Moncure wrote: On 8/21/07, Magnus Hagander [EMAIL PROTECTED] wrote: OTOH, if we do it as a compat package, we need to set a firm end-date on it, so we don't have to maintain it forever. I would suggest making a pgfoundry project...that's what was done with userlocks. I'm pretty certain no one besides me has ever used the wrappers I created...a lot more people use tsearch2 than userlocks though. Hmm.. In that case I'd think people should ask if anyone would use the tsearch2 compatibility layer before even doing pgfoundry. Speaking for myself, I expect migrating to the core text search APIs as something we'd do as part of our 8.3 migration even if such a compatibility layer existed. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Joshua D. Drake wrote: Tom Lane wrote: Robert Treat [EMAIL PROTECTED] writes: What exactly does default_text_search_config buy us? I think it is supposed to simplify things, but it sounds like it adds a bunch of corner cases, Well, the main thing we'd lose if we remove it is all trace of upward compatibility from the contrib version of tsearch. I don't think this is all that big of a deal. In fact I would expect it going from contrib to core and never had any illusion to the effect that I would be able to just upgrade from 8.2 (8.1) Tsearch2 to 8.3. FWIW, I also would _not_ have expected compatibility between contrib and core. In fact, I would have expected contrib tsearch to be a place where experimental APIs existed and that the single biggest difference between contrib vs core was that the core APIs removed any cruft that might have been in contrib. If default_text_search_config makes things more confusing or more fragile, I'd rather see it gone than kept around for backward-compatibility-to-pre-core reasons. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Magnus Hagander wrote: I don't use the functional index part, but for new users I can see how that's certainly a *lot* easier. Can someone with modern hardware test to see if it's still quite a bit slower than the extra column. I had tried it too years ago; and found the functional index to be quite a bit slower: http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php but it'd be interesting to see if faster CPUs changed this. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Oleg Bartunov wrote: What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I think I asked about this kind of usage a couple years back; and Oleg pointed out other reasons why it wasn't as good an idea too. http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php The particular question I had asked why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data made it inadvisable to avoid the extra column anyway. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. Are you saying the majority of users have a separate column with a trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Ron Mayer wrote: Bruce Momjian wrote: Oleg Bartunov wrote: What is a basis of your assumption ? I think I asked about this kind of usage a couple years back;... http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php ...why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data ... ... Are you saying the majority of users have a separate column with a trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. OK, it is good you measured it. I wonder how GIN would behave because it is not lossy. Too bad I don't have the same database around anymore. It seems the re-parsing for re-checking for the lossy index was very expensive, tho. In the end, I suspect it depends greatly on what fraction of rows match. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Ron Mayer wrote: I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. That's fair. Either the docs need to make it totally obvious or the software should force people to do something safe. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Idea that might inspire more patch reviewing.
Bruce Momjian wrote: In talking to people who are assigned to review patches or could review patches, I often get the reply, Oh, yea, I need to do that. Would it inspire more people to learn enough to become patch reviewers if patch authors scheduled walkthroughs of their patches with question and answer sessions over IRC or maybe even some voice conferencing system like skype? While it might not be of immediate value, I imagine a number of inspiring-to-be-hackers might find such walkthroughs enlightening, and if actual qualified reviewers participate in the QA during those walkthroughs seeing the kinds of questions raised would be quite educational as well. I don't know if this would help - I guess it needs 3 things: patch authors willing to do such walkthroughs, qualified people willing to participate in them, and enough wannabe hackers wanting to listen in. Do people think this would help- or is it just a clunkier way of doing what's already done via email on the patches list? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Not ready for 8.3
Andrew Dunstan wrote: Josh Berkus wrote: I think that may be where we're heading. In that case, we may need to talk about branching earlier so that developers can work on the new version who are frozen out of the in-process one. I've argued this in the past. But be aware that it will make more work for committers. It is very far from cost-free. I hate to bring up the CMS flamewar again, but IMHO the one advantage the distributed systems (Git, Monotone, Darcs, etc) have over CVS subversion is that they push this branch management cost down to the developer who wants a branch - rather from the guys maintaining the official CMS. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] My honours project - databases using dynamically attached entity-properties
Josh Berkus wrote: And then what? dynamically construct all your SQL queries? Sure, sounds like a simple solution to me... Not to mention DB security issues. How do you secure your database when your web client has DDL access? So, Edward, the really *interesting* idea would be to come up with a secure, normalized way to do UDFs *without* EAV tables. People would be very impressed. I have a system with many essentially user-defined fields, and was thinking of creating something similar to an Array type and writing some GIST indexes for it. My current workaround is to store them as a YAML document and use tsearch to index it (with application logic to further refine the results) - but a EAV datatype that could be put in tables and effectively indexed would be of quite a bit of interest here. And yes, a better say to do UDFs would be even cooler. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SCMS question
Bruce Momjian wrote: My typical cycle is to take the patch, apply it to my tree, then cvs diff and look at the diff, adjust the source, and rerun until I like the diff and apply. How do I do that with this setup? The most similar to what you're doing would be to merge the patch's branch into yours. It's about exactly the same amount of work as applying a patch (a one liner if there are no conflicts). From that point you could continue exactly as you are now - with the additional benefit(?) that the checkin history of the branch should (I hope) be preserved through the merge process so the SCM's history would let you see which changes from the patch came from the submitter and which changes came from the modifications in your tree. (I think this SCM requirements list http://changelog.complete.org/posts/528-Whose-Distributed-VCS-Is-The-Most-Distributed.html is one of the more interesting. The two features I like about the distributed systems are # 5. Branching preserves full history # 6. Merging preserves full history. so the history of the branch (including which changes came from the submitter and which were modifications in your tree) are preserved when they're eventually committed to head.) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] SCMS question
Alvaro Herrera wrote: Yes, it's nice. Consider this: Andrew develops some changes to PL/perl in his branch. Neil doesn't like something in those changes, so he commits a fix there. If I understand right, another advantage is that the SCM will keep track of which of those changes came from Neil and which from Andrew and that this history will be preserved when eventually merged with head. In contrast, with CVS I think detailed history of changes within Andrew's branch would be lost and only recorded as one large checkin in CVS. Right? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: Once autovacuum_naptime... autovacuum_max_workers... How does this sound? The knobs exposed on autovacuum feel kinda tangential to what I think I'd really want to control. IMHO vacuum_mbytes_per_second would be quite a bit more intuitive than cost_delay, naptime, etc. ISTM I can relatively easily estimate and/or spec out how much extra I/O bandwidth I have per device for vacuum; and would pretty much want vacuum to be constantly running on whichever table that needs it the most so long as it can stay under that bandwith limit. Could vacuum have a tunable that says X MBytes/second (perhaps per device) and have it measure how much I/O it's actually doing and try to stay under that limit? For more fine-grained control a cron job could go around setting different MBytes/second limits during peak times vs idle times. If people are concerned about CPU intensive vacuums instead of I/O intensive ones (does anyone experience that? - another tuneable vacuum_percent_of_cpu would be more straightforward than delay_cost, cost_page_hit, etc. But I'd be a bit surprised if cpu intensive vacuums are common. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] A more general approach (Re: Data archiving/warehousing idea)
Hannu Krosing wrote: ...is storing all tuple visibility info in a separate file. At first it seems to just add overhead, but for lots (most ? ) usecases the separately stored visibility should be highly compressible, so for example for bulk-loaded tables you could end up with one bit per page saying that all tuples on this page are visible. Seems you could do one bit per page compression even with visibility data stored in the pages themselves. I could imagine a table re-writing vacuum freeze that finds pages with all data visible to everyone and packs them with a single bit saying everything here's frozen. The penalty would be an expensive splitting of the page (and who knows what evil locks would be needed) if an update is ever done on those wholly frozen pages -- but we're talking about read-mostly tables here so that tradeoff might not be bad. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch in core patch, for inclusion
Andrew Dunstan wrote: Joshua D. Drake wrote: Where on the website can I see what plugins are included with PostgreSQL? YES! That's IMHO a more fundamental problem. The specific question about Text Search seems more like a symptom. While I don't mind Text Search in core, it seems an even bigger deal that it's hard to find information on extensions (whether from contrib or from gborg or from external places like postgis). A web page with a table easily visible on the postgresql web site that had Extension (i.e. tsearch2, postgis) Project Maturity (i.e. alpha/beta/stable) Compatability (i.e. extension 1.0 works with postgresql 8.2) Description (i.e. full text search) URL would be a partial fix. contrib is a horrible misnomer. Can we maybe bite the bullet and call it something else? +1 How about plugins or extensions or optional libraries. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [GENERAL] Autovacuum Improvements
Gregory Stark wrote: Actually no. A while back I did experiments to see how fast reading a file sequentially was compared to reading the same file sequentially but skipping x% of the blocks randomly. The results were surprising (to me) and depressing. The breakeven point was about 7%. [...] The theory online was that as long as you're reading one page from each disk track you're going to pay the same seek overhead as reading the entire track. Could one take advantage of this observation in designing the DSM? Instead of a separate bit representing every page, having each bit represent 20 or so pages might be a more useful unit. It sounds like the time spent reading would be similar; while the bitmap would be significantly smaller. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Autovacuum Improvements
Matthew T. O'Connor wrote: Alvaro Herrera wrote: I'd like to hear other people's opinions on Darcy Buskermolen proposal to have a log table, on which we'd register what did we run, at what time, how long did it last, [...] I think most people would just be happy if we could get autovacuum to log it's actions at a much higher log level. I think that autovacuum vacuumed table x is important and shouldn't be all the way down at the debug level. +1 here. Even more than autovacuum vacuumed table x, I'd like to see vacuum starting table x and vacuum done table x. The reason I say that is because the speculation autovacuum might have been running is now a frequent phrase I hear when performance questions are asked. If vacuum start and end times were logged at a much earlier level, that feature plus log_min_duration_statement could easily disprove the vacuum might have been running hypothesis. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Function execution costs 'n all that
Tom Lane wrote: BTW, I'm thinking that a cost constant probably ought to be measured in units of cpu_operator_cost. The default for built-in functions would thus be 1, at least till such time as someone wants to refine the estimates. We'd probably want the default for PL and SQL functions to be 10 or 100 or so. Any chance that costs could eventually change to real-world units? It's hard for me to guess how many cpu_operator_cost units something might take; but relatively easy for me to measure or estimate in fractions-of-a-seconds how long something takes. I could imagine having the other planner costs be measured in seconds too - perhaps with the goal of eventually writing some auto-config code that tries to measure values like cpu_tuple_cost on a given piece of hardware. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] InitPostgres and flatfiles question
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: What value is allowing multiple queies via PQexec() The only argument I can think of is that it allows applications to be sloppy about parsing a SQL script into individual commands before they send it. (I think initdb may be guilty of exactly that BTW...) At the same time you could argue that such sloppiness is inherently a Bad Idea. Doesn't it also avoid some network(?) overhead when you have a large number of small inserts or updates? I seem to recall a previous company where we had a major performance by concatenating a bunch of updates with ;s in between and sending them to postgresql as a single command. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Load distributed checkpoint
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: I have a new idea. ...the BSD kernel...similar issue...to smooth writes: Linux has a more complex solution to this (of course) which has undergone a few generations over time. Older kernels had a user space daemon called bdflush which called an undocumented syscall every 5s. More recent ones have a kernel thread called pdflush. I think both have various mostly undocumented tuning knobs but neither makes any sort of guarantee about the amount of time a dirty buffer might live before being synced. Earlier in this thread (around the 7th) was a discussion of /proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs which seem to be the tunables that matter here. Googling suggests that dirty_expire_centisecs specifies that data which has been dirty in memory for longer than this interval will be written out next time a pdflush daemon wakes up and dirty_writeback_centisecs expresses the interval between those wakeups It seems to me that the sum of the two times does determine the maximum time before the kernel will start syncing a dirtied page. Bottom line, though is that it seems both postgresql and the OS's are trying to delay writes in the hopes of collapsing them; and that the actual delay is the sum of the OS's delay and postgresql's delay. I think Kevin Grittner's experimentation earlier in the thread did indeed suggest that getting writes to the OS faster and let it handle the collapsing of the writes was an effective method of reducing painful checkpoints. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] EXPLAIN ANALYZE on 8.2
At 10:45 AM 12/15/2006, Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: There are various attempts at providing better timing infrastructure at low overhead but I'm not sure what's out there currently. I expect to do this what we'll have to do is invent a pg_* abstraction that has various implementations on different architectures. You've got to be kidding. Surely it's glibc's responsibility, not ours, to implement gettimeofday correctly for the hardware. regards, tom lane I agree with Tom on this. Perhaps the best compromise is for the pg community to make thoughtful suggestions to the glibc community? Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Load distributed checkpoint
ITAGAKI Takahiro wrote: Kevin Grittner [EMAIL PROTECTED] wrote: ...the file system cache will collapse repeated writes to the same location until things settle ... If we just push dirty pages out to the OS as soon as possible, and let the file system do its job, I think we're in better shape... Maybe we have two entirely different tuning approaches: 1. Retain dirty buffers in database, and keep OS buffers clean. 2. Keep database clean, and entrust OS to manage dirty buffers. I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints vary in the approaches; write() will be worse in 1st, fsync() in 2nd. The fsync() won't necessarily be worse in the second approach. OS's have quite a few tunable parameters that can encourage the system to physically write the pending write()s before the fsync() - either in the background or by the process doing the write() itself when there are too many dirty pages. For checkpoints, I think the main question is whether postgresql's background writer is smarter or less smart than pdflush or the equivalent on your system for database workloads. Also, database has own access-frequency information for its buffers, so I think 1st approach behaves better in handling re-dirty of buffers. I'm curious what access-frequency info the OS and the database has. One thing I do worry about is if both postgresql and the OS are both delaying write()s in the hopes of collapsing them at the same time. If so, this would cause both to be experience bigger delays than expected, and make checkpoints worse. I'm guesing if you use approach 1. you might be better off turning down the amount of buffering that the OS does with dirty pages - and if you use approach 2, you might be better off turning down the amount of delays that postgresql adds. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Grouped Index Tuples
Jim C. Nasby wrote: On usage, ISTM it would be better to turn on GIT only for a clustered index and not the PK? I'm guessing your automatic case is intended for SERIAL PKs, but maybe it would be better to just make that explicit. Not necessarily; since often (in my tables at least) the data for come columns has some local grouping of similar values even though it's not the clustered index. The obvious example is addresses (city, state, street, etc may not appear clustered - but if you cluster your table by zip code, GIT will work well for them). Other examples would be tables containing dates and product names - even though there's no total monotonic ordering if your product families get refreshed every couple years you'll find that one range of product names shows up mostly in old years, and others in new years - so GIT could prove useful there - despite not being a clustered index. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: syntax of operation with tsearch's configuration
Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: I don't see any comparable arguments about this full-text search stuff. In particular I don't see any arguments why a change would necessary at all, including why moving to core would be necessary in the first place. AFAIR the only argument in favor of that is basically a marketing one: users perceive a feature as more real, or more supported, if it's in core. I don't find this argument especially compelling myself. On the flip side of that argument - the more non-SQL-standard pieces are in core, the more non-real other pieces non-in-core appear. People seem to have little doubts regarding the CPAN, or Ruby Gems. I believe because to a large part that's because a lot of very important and well supported functionality exists outside of their core distributions. The less that's pre-baked into core, I think the more people will be aware of the rich set of extensions postgresql enables. From a marketing point of view (should I have moved this to .advocacy), it seems to me the biggest problem is the name contrib. If it were called optional or advanced or extra I think it'd be seen less suspiciously by hosting companies (who seem to have the biggest problem with contrib) and we wouldn't need as many discussions of which contribs to move into core. Ron M ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [SPAM?] Re: Asynchronous I/O Support
Zeugswetter Andreas ADI SD wrote: POSIX_FADV_WILLNEED definitely sounds very interesting, but: I think this interface was intended to hint larger areas (megabytes). But the wishful thinking was not to hint seq scans, but to advise single 8k pages. Surely POSIX_FADV_SEQUENTIAL is the one intended to hint seq scans, and POSIX_FADV_RANDOM to hint random access. No? ISTM, _WILLNEED seems just right for small random-access blocks. Anyway, for those who want to see what they do in Linux, http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c Pretty scary that Bruce said it could make older linuxes dump core - there isn't a lot of code there. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Hints WAS: Index Tuning Features
Mark Woodward wrote: Exactly. IMHO, it is a frustrating environment. PostgreSQL is a great system, and while I completely respect the individuals involved, I think the management for lack of a better term, is difficult. 'course you're welcome to fork the project as well if your style and/or priorities are different than the postgresql core team's. If your approach is that much less frustrating, your project would gain that much more momentum from developers joining you. If more developers like your style and/or priorities, they'll migrate to your project. I think Bizgres, Mammoth, EnterpriseDB and RedHat DB and Gentoo's-occasional-bizzaro-patches are both proofs that it can work as well as proofs that it's difficult. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index Tuning Features
Andrew Sullivan wrote: Just because I'm one of those statistics true believers, what sort of information do you think it is possible for the DBA to take into consideration, when building a hint, that could not in principle be gathered efficiently by a statistics system? It seems to me that you're claiming that DBAs can have magic knowledge. Is one example is the table of addresses clustered by zip-code and indexes on State, City, County, etc? The current statistics systems at least see no correlation between these fields (since the alphabetical ordering of cities and numbering of postal codes is quite different). This makes the planner under-use the indexes because it sees no correlation and overestimates the number of pages read and the random accesses needed. However since San Francisco, CA data happens to be tightly packed on a few pages (since it shares the same few zip codes), few pages are needed and mostly sequential access could be used when querying SF data -- though the optimizer guesses most pages in the table may be hit, so often ignores the indexes. Now I'm not saying that a more advanced statistics system couldn't one-day be written that sees these patterns in the data -- but it doesn't seem likely in the near term. DBA-based hints could be a useful interim work-around. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Fwd: Is the fsync() fake on FreeBSD6.1?
Andrew - Supernews wrote: Whether the underlying device lies about the write completion is another matter. All current SCSI disks have WCE enabled by default, which means that they will lie about write completion if FUA was not set in the request, which FreeBSD never sets. (It's not possible to get correct results by having fsync() somehow selectively set FUA, because that would leave previously-completed requests in the cache.) WCE can be disabled on either a temporary or permanent basis by changing the appropriate modepage. It's possible that Linux does this automatically, or sets FUA on all writes, though that would surprise me considerably; however I disclaim any knowledge of Linux internals. The Linux SATA driver author Jeff Garzik suggests [note 1] that The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE command to be generated has only been present in the most recent [as of mid 2005] 2.6.x kernels. See the write barrier stuff that people have been discussing. Furthermore, read-after-write implies nothing at all. The only way to you can be assured that your data has hit the platter is (1) issuing [FLUSH|SYNC] CACHE, or (2) using FUA-style disk commands It sounds like your test (or reasoning) is invalid. Before those min-2005 2.6.x kernels apparently fsync on linux didn't really try to flush caches even when drives supported it (which apparently most actually do if the requests are actually sent). [note 1] http://lkml.org/lkml/2005/5/15/82 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Optimizer improvements: to do or not to do?
Simon Riggs wrote: On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote: That's what I want to do: 1. Replace not very useful indexCorrelation with indexClustering. An opinion such as not very useful isn't considered sufficient explanation or justification for a change around here. Not sufficient for some types of data would have been more fair. I speculate that an new additional stat of average # of unique values for a column within a block would go a long way to helping my worst queries. It's common here for queries to vastly overestimate the number of pages that would need to be read because postgresql's guess at the correlation being practically 0 despite the fact that the distinct values for any given column are closely packed on a few pages. Our biggest tables (180G or so) are mostly spatial data with columns like City State Zip County Street School District, Police Beat, lat/long etc; and we cluster the table on zip,street. Note that practically all the rows for any single value of any of the columns will lay in the same few blocks. However the calculated correlation being low because the total ordering of the other values doesn't match that of zip codes. This makes the optimizer vastly overestimate the cost of index scans because it guesses that most of the table will need to be read, even though in reality just a few pages are needed. If someone does look at the correlation calculations, I hope this type of data gets considered as well. I speculate that a new stat of average # of unique values for a column within a block could be useful here in addition to correlation. For most all my columns in my big table, this stat would be 1 or 2; which I think would be a useful hint that despite a low correlation, the distinct values are indeed packed together in blocks. That way the optimizer can see that a smaller number of pages would need to be accessed than correlation alone would suggest. Does this make sense, or am I missing something. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Optimizer improvements: to do or not to do?
Gregory Stark wrote: Ron Mayer [EMAIL PROTECTED] writes: ...vastly overestimate the number of pages .. because postgresql's guess at the correlation being practically 0 despite the fact that the distinct values for any given column are closely packed on a few pages. I think we need a serious statistics jock to pipe up with some standard metrics that do what we need. Otherwise we'll never have a solid footing for the predictions we make and will never know how much we can trust them. Do we know if any such people participate/lurk on this list, or if the conversation should go elsewhere? That said I'm now going to do exactly what I just said we should stop doing and brain storm about an ad-hoc metric that might help: I wonder if what we need is something like: sort the sampled values by value and count up the average number of distinct blocks per value. That might let us predict how many pages a fetch of a specific value would retrieve. Or perhaps we need a second histogram where the quantities are of distinct pages rather than total records. Either of these sound like they might be an improvement over correlation itself to estimate the number of pages it'd need to read. Would it be relatively easy or hard for a programmer not too familiar with the code to experiment with these ideas? Where would be a good place to look. We might also need a separate average number of n-block spans per value metric to predict how sequential the i/o will be in addition to how many pages will be fetched. I'm wildly guessing that, the # of pages itself seems to be a bigger factor than the sequential/random nature. For example, I do a query for data from a particular small city I'd only need dozens of pages, not many thousands. OTOH, it'd be neat to know if this were true. Is there any good way to make something like explain analyze show both the expected and actual # of pages and # of seeks? ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] 8.2 features status
Tom Lane wrote: But a quick troll through the CVS logs shows ... multi-row VALUES, not only for INSERT but everywhere SELECT is allowed ... multi-argument aggregates, including SQL2003-standard statistical aggregates ... standard_conforming_strings can be turned on (HUGE deal for some people) ... support SQL-compliant row comparisons; they can be indexscan quals ISTM this could be spun as a standards-focused release as well (at least partial implementations of a number of optional(?) SQL2003 features). ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 8.2 features status
Tom Lane wrote: I tend to agree --- I don't see much value in trying to institute a formalized process. One more problem with the formalized process of claiming features in advance may stop what I suspect is a significant source of contributions -- people who add features/patches for internal work in their company and only after the fact find that they are something they'd contribute back. The small contribution I made (to help admins know when FSM settings were too low by monitoring log files instead of manual checks[1]) was done because we wanted it internally. Only after it proved useful to us, it was mentioned to the lists. Thanks in part to the BSD nature of postgresql, I suspect there are many internal-and-not-yet-released useful patches lurking around in industry. If I'm right, I'd wonder what the advocacy guys could do to get corporations to volunteer to contribute changes back that they've found useful internally. We have not had that many cases where lack of communication was a problem. One could say too much communication was the problem this time. I get the impression people implied they'd do something on a TODO and didn't. Arguably the project had been better off if noone had claimed the TODO, so if another company/team/whatever needed the feature badly, they could have worked on it themselves rather than waiting in hope of the feature. Of course they could have done this anyway - but if they see it on an implied roadmap document for the next release they're more likely to wait. Ron [1] http://archives.postgresql.org/pgsql-patches/2005-02/msg00171.php ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] 8.2 features status
[EMAIL PROTECTED] wrote: Ron Mayer wrote: We have not had that many cases where lack of communication was a problem. One could say too much communication was the problem this time. I get the impression people implied they'd do something on a TODO and didn't. Arguably the project had been better off if noone had claimed the TODO, so if another company/team/whatever needed the feature badly, they could have worked on it themselves rather than waiting in hope of the feature. This is just perverse. Surely you are not seriously suggesting that we should all develop in secret and then spring miracles fully grown on the community? Of course not. What I'm suggesting is two things. (1) That misleading information is worse than no information; and that speculative information next to TODOs can do as much harm discouraging others as it the good it does for communication. Perhaps a name/assignment/claim on a todo might be nice if someone wanted a private conversation with someone who knows about a feature; but even there wouldn't a public discussion on the lists likely be better? (2) That much corporate development on BSD projects is indeed developed in secret. Although may want to be contributed later either because the company no longer decides it's a trade-secret or gets tired of maintaining their own fork. Sure, such patches might need even more discussion and revision than if they were designed with core - but I think it's a reality that such work exists. We have bumped patches before because they have done things without discussing them, and were found not to be accepatble. The more complex features get, the more communication is needed. Agreed, of course. This makes me think that ongoing discussion on hackers patches is the only way to judge progress on a todo; and anything like assigned names estimated dates releases are less likely to be meaningful than what one could infer from discussions on the lists. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] O_NOATIME
Tom Lane wrote: Ron Mayer [EMAIL PROTECTED] writes: Would people be interested in a trivial patch that adds O_NOATIME to open() for platforms that support it? (apparently Linux 2.6.8 and better). Isn't that usually, and more portably, handled in the filesystem mount options? Yes to both. I could imagine that for small systems/workstations you might have some files that want access time, and others that wanted NOATIME -- it seems the new flag lets you choose on a file-by-file bases. That's why I asked. I imagine it won't help on any well-administered production server since they'd probably mount the whole filesystem that way; but might help a bit on out-of-the-box-default-config benchmarks done by naive users who don't tweak filesystem settings. Don't know if we'd care about such an audience or not. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] O_NOATIME
Tom Lane wrote: Ron Mayer [EMAIL PROTECTED] writes: Tom Lane wrote: Isn't that usually, and more portably, handled in the filesystem mount options? Yes to both. I could imagine that for small systems/workstations you might have some files that want access time, and others that wanted NOATIME -- it seems the new flag lets you choose on a file-by-file bases. Personally, if I were an admin who wanted access times, I'd regard the existence of such a flag as a security hole. I'm not sure I see that. I'd have thought since postgresql already caches stuff in shared buffers, the atime of a postgresql file isn't reliable anyway; and outside of postgresql O_NOATIME doesn't seem to me to affect admins any worse the existence of utime(). OTOH, I'm not going to argue for the patch either. I think it'd be fair to say adding a linuxism and only benefiting novice/casual users isn't that exciting. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] On-disk bitmap index patch
Tom Lane wrote: Both of these pages say up front that they are considering read-only data. Can I assume read-mostly partitions could use the read-I/O efficient indexes on update-intensive partitions of the same table could use b-tree indexes? All of my larger (90GB+) tables can have partitions that are either almost read-only (spatial data updated one state/country at a time, about quarterly) or are totally read-only (previous months and years historical data). So one of the questions that has to be answered (and the submitters have been entirely mum about) is exactly how bad is the update performance? If it's really awful that's going to constrain the use cases quite a lot, whereas merely a bit slower than btree wouldn't be such a problem. Once we have easy-to-use partitioning, would it be the case that many larger tables will have near-read-only partitions that could use I/O friendlier indexes (GIN? Hash? Bitmap?)? Any examples of a large data set that can't be partitioned into hot areas and read-mostly areas? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] GUC with units, details
Peter Eisentraut wrote: Jim Nasby wrote: The truth is, virtually no one, even highly technical people, ever picks nits between kB vs KiB vs KB. The question isn't so much whether to allow KiB and such -- that would obviously be trivial. The question is whether we want to have kB mean 1000 bytes instead of 1024 bytes. Would it satisfy everyone if the documentation states that specifying a value of N kB means that *at least* N 1000 bytes are allocated; and perhaps even documenting that in the current implementation it happens to be 24 extra bytes. In my mind, that goes against current practice. The only argument raised in favor was that international standards require such use. I'm as much a fan of measurement standards as anyone, but I'm also a practitioner of current practice. With the spec reading at least N KB, even the most pedantic spec reader can't complain, because it is true. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] On-disk bitmap index patch
Tom Lane wrote: [EMAIL PROTECTED] writes: Reading 1/4, for a larger table, has a good chance of being faster than reading 4/4 of the table. :-) Really? If you have to hit one tuple out of four, it's pretty much guaranteed that you will need to fetch every heap page. I think it's not uncommon to have data that is more clustered than expected. In my case it shows up often in GIS data; where often a query that only accesses 1/4 of the rows in the table really will only access about 1/4 of the pages in the table -- for example when analyzing Texas or the Southwest US. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Units in postgresql.conf
Peter Eisentraut wrote: I think it would be useful to allow units to be added to these settings, for example... shared_buffers = 512MB which is a bit cumbersome to calculate right now (you'd need = 65536). I haven't thought yet how to parse or implement this, but would people find this useful? Would this extend to things like random_page_cost and similar? If the random_page_cost were specifiable in seconds or ms it might be easier to someday write a program to measure such values on particular hardware platforms. (though I guess for that to work, the config file would also need to add the reference cost (is it a non-random page access) as well...) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] plPHP and plRuby
Peter Eisentraut wrote: Hannu Krosing wrote: So we would have src/pl/pljava/README.TXT and anybody looking for pl-s would find the info in a logical place Right. When was the last time any user looked under src/pl in the first place? Or even under src? If you're looking for pljava, it's the first hit in Google. The difference is that I will have reasonable confidence that the README.TXT under src/pl will give instructions that match the version of PostgreSQL that I have. I assume that README will call out the version of PL/R or PL/Ruby that I want that was tested with the release of PostgreSQL I have. The first hit on Google will probably give me the most recently blogged about version; which does nothing to help me find what I need. The organization of the source code is controlled by exactly two factors: 2. convenience of development I thought convenience of development included the addressing the problem that PLs are annoyingly deeply tied to specific versions of Core. I would imagine with this README.TXT proposal, it's the responsibility of the PL/XXX developer to port their PL to PostgreSQL during the Beta, and if the did and tested it, the release will point to the version of the PL supported by the PL maintainer for that version. If they don't do this testing during the beta, the README.TXT may merely say the PL/Haskell team did not complete testing during the 8.2 beta; so good luck. This aids to the convenience of development of PostgreSQL and the PLs because it defines the process and responsibility for integration testing the PLs with the Core releases; and puts some pressure to synchronize the releases. Anything else is between you and your packager. And if that didn't convince you, I still got PL/sh in the wait ... With which versions of PostgreSQL is this version of PL/sh supported? I see that PL/sh on http://pgfoundry.org/projects/plsh/ is version 1.1? Does that mean it goes with PostgreSQL 1.1? The Projects page for PL/SH (http://plsh.projects.postgresql.org/) suggests it was last modified in May 2005 - does that mean I need the May 2005 backend of PostgreSQL to compile it? Oh. The download page says older releases are also supported. Does that include 7.1? Putting this info in the README.TXT is one way to help users know what's supported. If I saw a README.TXT for pl/sh I'd have some confidence I'd be directly pointed to the version I need. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] plPHP and plRuby
Tom Lane wrote: The difference is that I will have reasonable confidence that the README.TXT under src/pl will give instructions that match the version of PostgreSQL that I have. I assume that README will call out the version of PL/R or PL/Ruby that I want that was tested with the release of PostgreSQL I have. On what do you base that assumption? A README file laying about in an otherwise unused part of the source tree is the very definition of out of sight, out of mind. I was hoping it would say something like PostgreSQL 8.2.0 has been tested with version XX.YY.01 of PL/Whatever You can install it by getting that release and doing the following. with specific version numbers rather than links to URLS that would change. It that wasn't the intent of the README.TXT, I'm not sure what is. I can pretty much guarantee you that it will NOT get updated, especially not during minor releases. Even if it is up to date at the instant we put out a release, it'll be obsolete as soon as the external project makes an update release. ISTM links like this are far better kept on project websites ... I was hoping that this README.TXT point to the specific old version that was tested in much the same way that the old 8.0.0 source tree continues to have the same PL/pgsql that has always been there. If the external project updates their release and breaks compatability I think they should be encouraged to update the README.TXT to say something like PostgreSQL 8.2.1 has been tested with PL/Whatever version XX.YY.99 If they don't make that update, the README would PostgreSQL 8.2.0 has been tested with PL/Whatever version XX.YY.00 I would imagine with this README.TXT proposal, it's the responsibility of the PL/XXX developer to port their PL to PostgreSQL during the Beta, and if the did and tested it, the release will point to the version of the PL supported by the PL maintainer for that version. And if they didn't? I was just noticing that the current release of plruby contains installation instructions that appear to date to 7.3. If he can't be bothered to update his own docs, what are the odds that he'll submit timely updates for a README in the main source tree? Yeah. Good point. I guess the alternatives are that the README still say PostgreSQL 7.3.0 has been tested with PL/Ruby version X.Y.Z or We are unaware of up-to-date instructions for PL/Ruby. Good Luck. Though if you'd welcome people in the community to submit patches to that README I suspect they'll be updated even more regularly than 2002 or whenever 7.3 come out. If I spent time figuring it out, I wouldn't mind submitting a patch for such a README; and I suspect the other guys who blog about PL/Ruby installation instructions in late 2005 would be happy to submit such patches as well. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Running a query twice to ensure cached results.
Tom Lane wrote: -- do it again to ensure fully cached bench=# select count(*) from accounts; Short summary: Does running a query only twice really insure that a result is cached? It seems not to be the case for seq-scans on Linux. I think this may matters to the discussions about a readahead thread/process that come up on this list that come up here occasionally. Experimental results here suggest that for larger tables Linux seems to detect a seq-scan and not bother caching. It's very reproducible for me here to do a reboot and not see the full speedup on a seq_scan until the third time I run a query.su An example shown below [1] shows that the third run of a query is faster than the second run. The output of a 'vmstat 5' [2] while these queries was happening agrees that significant I/O was still happening on the second run, but no I/O happened the third time. The table comfortably fits in memory (700MB table on a 2GB ram machine) and the machine was otherwise idle so noone else wiped out the cache between the first and second runs. Why do I think this is worth mentioning here? * I think it impacts the occasional thread about wanting to include logic in postgresql for readahead [3] or for the threads suggesting hinting to the the OS though madvise or similar to avoid caching seq-scans. It seems that the Linux is detecting and at least somewhat reacting to seq scans even with no hinting. Anything added to postgresql might end up being a duplicated effort. I think Bruce suggested that Solaris does this free-behind automatically [4], but this is the first I've noticed that Linux seems to do similar. * I think it matters to people who post explain analyze twice without running it so often they get stable results. (I note that this was not a problem for Tom since the timing of his first and second runs were the same so I assume he was just saying that he observed that the query was cached rather than that the first run forced the second run to be cached.) Ron = == [note 1] the repeated queries showing the speedup after 3 runs. == Running the same select count(*) 4 times after a clean reboot. == Seems the OS's caching logic decided that the first seq_scan == wasn't 'interesting' enough = fli=# select count(*) from facets_s; count -- 15976558 (1 row) Time: 29788.047 ms fli=# select count(*) from facets_s; count -- 15976558 (1 row) Time: 19344.573 ms fli=# select count(*) from facets_s; count -- 15976558 (1 row) Time: 13411.272 ms fli=# select count(*) from facets_s; count -- 15976558 (1 row) Time: 13107.856 ms # [note 2] vmstat 5 while the above queries were being run procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 1140 62140 71256 713360004731 9284 7 1 92 0 *** the first time 1 0140 50860 31912 80830402 2521529 1147 2612 49 15 0 36 1 0360 54420 2 85524000 23934 7 1139 2553 47 14 0 39 0 1360 54008 11100 87870800 2370425 1149 2467 46 12 0 41 0 1360 52512 11140 89659200 24062 6 1135 2460 47 11 0 41 *** the second time 0 0360 52688 11172 90691600 1335719 1085 1989 31 7 38 24 1 0360 53976 11076 9125400 44 1427357 1113 2102 32 7 29 32 2 0360 54788 10908 92378800 2450954 1171 2474 46 12 0 42 1 0360 54944 3096 93994800 1118039 1093 1976 65 13 0 22 *** the third time 3 0360 54280 3872 94050800 26414 1041 1560 85 15 0 0 1 0360 53852 3904 940940008829 1022 1505 53 9 36 2 2 0360 51616 4052 94306800 44354 1037 1552 82 15 0 4 1 0360 51488 4060 9431800022 2 1013 1522 84 16 0 0 # [3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01449.php [4] http://archives.postgresql.org/pgsql-performance/2003-10/msg00188.php ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] More thoughts about planner's cost estimates
Tom Lane wrote: One objection to this is that after moving off the gold standard of 1.0 = one page fetch, there is no longer any clear meaning to the cost estimate units; you're faced with the fact that they're just an arbitrary scale. I'm not sure that's such a bad thing, though. It seems to me the appropriate gold standard is Time, in microseconds or milliseconds. The default postgresql.conf can come with a set of hardcoded values that reasonably approximate some real-world system; and if that's documented in the file someone reading it can say hey, my CPU's about the same but my disk subsystem is much faster, so I know in which direction to change things. And another person may say ooh, now I know that my 4GHz machines should have about twice the number here as my 2GHz box. For people who *really* care a lot (HW vendors?), they could eventually make measurements on their systems. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Compression and on-disk sorting
Jim C. Nasby wrote: There's an fadvise that tells the OS to compress the data if it actually makes it to disk? Compressed-filesystem extension (like e2compr, and I think either Fat or NTFS) can do that. I think the reasons against adding this feature to postgresql are largely the same as the reasons why compressed filesystems aren't very popular. Has anyone tried running postgresql on a compressing file-system? I'd expect the penalties to outweigh the benefits (or they'd be more common); but if it gives impressive results, it might add weight to this feature idea. Ron M I think the real reason Oracle and others practically re-wrote their own VM-system and filesystems is that at the time it was important for them to run under Windows98; where it was rather easy to write better filesystems than your customer's OS was bundled with. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Automatic free space map filling
Jim C. Nasby wrote: ... how many pages per bit ... Are we trying to set up a complex solution to a problem that'll be mostly moot once partitioning is easier and partitioned tables are common? In many cases I can think of the bulk of the data would be in old partitions that are practically never written to (so would need no vacuuming and could always use index-only lookups); while the hot parts of large tables would be on partitions that would need frequent vacuuming and wouldn't benefit from index-only lookups. In these cases, 1 bit per partition would work well, and seems a lot easier to keep track of than bits-per-page. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Looking for a tool to * pg tables as ERDs
Where * == {print | save to PDF | save to mumble format | display on screen} Anyone know of one? TiA Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 08:37 PM 2/15/2006, Dann Corbit wrote: Adding some randomness to the selection of the pivot is a known technique to fix the oddball partitions problem. True, but it makes QuickSort slower than say MergeSort because of the expense of the PRNG being called ~O(lgN) times during a sort. However, Bentley and Sedgewick proved that every quick sort algorithm has some input set that makes it go quadratic Yep. OTOH, that input set can be so specific and so unusual as to require astronomically unlikely bad luck or hostile hacking in order for it to actually occur. (hence the recent popularity of introspective sort, which switches to heapsort if quadratic behavior is detected. The C++ template I submitted was an example of introspective sort, but PostgreSQL does not use C++ so it was not helpful). ...and there are other QuickSort+Other hybrids that address the issue as well. MergeSort, RadixExchangeSort, and BucketSort all come to mind. See Gonnet and Baeza-Yates, etc. Here are some cases known to make qsort go quadratic: 1. Data already sorted Only if one element is used to choose the pivot; _and_ only if the pivot is the first or last element of each pass. Even just always using the middle element as the pivot avoids this problem. See Sedgewick or Knuth. 2. Data reverse sorted Ditto above. 3. Data organ-pipe sorted or ramp Not sure what this means? Regardless, median of n partitioning that includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of the data is usually enough to guarantee O(NlgN) behavior unless the _specific_ distribution known to be pessimal to that sampling algorithm is encountered. The only times I've ever seen it ITRW was as a result of hostile activity: purposely arranging the data in such a manner is essentially a DoS attack. 4. Almost all data of the same value Well known fixes to inner loop available to avoid this problem. There are probably other cases. Randomizing the pivot helps some, as does check for in-order or reverse order partitions. Randomizing the choice of pivot essentially guarantees O(NlgN) behavior no matter what the distribution of the data at the price of increasing the cost of each pass by a constant factor (the generation of a random number or numbers). In sum, QuickSort gets all sorts of bad press that is far more FUD than fact ITRW. Ron. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 04:24 AM 2/17/2006, Ragnar wrote: On fös, 2006-02-17 at 01:20 -0500, Ron wrote: OK, so here's _a_ way (there are others) to obtain a mapping such that if a b then f(a) f (b) and if a == b then f(a) == f(b) By scanning the table once, we can map say 001h (Hex used to ease typing) to the row with the minimum value and 111h to the row with the maximum value as well as mapping everything in between to their appropriate keys. That same scan can be used to assign a pointer to each record's location. This step is just as expensive as the original sort you want to replace/improve. Why do you think that? External sorts involve the equivalent of multiple scans of the table to be sorted, sometimes more than lgN (where N is the number of items in the table to be sorted). Since this is physical IO we are talking about, each scan is very expensive, and therefore 1 scan is going to take considerably less time than = lgN scans will be. If you want to keep this mapping saved as a sort of an index, or as part ot each row data, this will make the cost of inserts and updates enormous. Not sure you've got this right either. Looks to me like we are adding a = 32b quantity to each row. Once we know the mapping, incrementally updating it upon insert or update would seem to be simple matter of a fast search for the correct ranking [Interpolation search, which we have all the needed data for, is O(lglgN). Hash based search is O(1)]; plus an increment/decrement of the key values greater/less than the key value of the row being inserted / updated. Given than we are updating all the keys in a specific range within a tree structure, that update can be done in O(lgm) (where m is the number of records affected). We can now sort the key+pointer pairs instead of the actual data and use an optional final pass to rearrange the actual rows if we wish. How are you suggesting this mapping be accessed? If the mapping is kept separate from the tuple data, as in an index, then how will you look up the key? ??? We've effectively created a data set where each record is a pointer to a DB row plus its key. We can now sort the data set by key and then do an optional final pass to rearrange the actual DB rows if we so wish. Since that final pass is very expensive, it is good that not all use scenarios will need that final pass. The amount of storage required to sort this representation of the table rather than the actual table is so much less that it turns an external sorting problem into a internal sorting problem with an optional final pass that is =1= scan (albeit one scan with a lot of seeks and data movement). This is a big win. It is a variation of a well known technique. See Sedgewick, Knuth, etc. That initial scan to set up the keys is expensive, but if we wish that cost can be amortized over the life of the table so we don't have to pay it all at once. In addition, once we have created those keys, then can be saved for later searches and sorts. What is the use case where this would work better than a regular btree index ? Again, ??? btree indexes address different issues. They do not in any way help create a compact data representation of the original data that saves enough space so as to turn an external ranking or sorting problem into an internal one. Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 05:19 AM 2/17/2006, Markus Schaber wrote: Hi, Ron, Ron schrieb: OK, so here's _a_ way (there are others) to obtain a mapping such that if a b then f(a) f (b) and if a == b then f(a) == f(b) Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb integer; a 4KB row becomes a 32Kb integer; etc) Since even a 1TB table made of such rows can only have 256M - 512M possible values even if each row is unique, a 28b or 29b key is large enough to represent each row's value and relative rank compared to all of the others even if all row values are unique. By scanning the table once, we can map say 001h (Hex used to ease typing) to the row with the minimum value and 111h to the row with the maximum value as well as mapping everything in between to their appropriate keys. That same scan can be used to assign a pointer to each record's location. But with a single linear scan, this cannot be accomplished, as the table contents are neither sorted nor distributed linearly between the minimum and the maximum. So what? We are talking about key assignment here, not anything that requires physically manipulating the actual DB rows. One physical IO pass should be all that's needed. For this mapping, you need a full table sort. One physical IO pass should be all that's needed. However, let's pretend you are correct and that we do need to sort the table to get the key mapping. Even so, we would only need to do it =once= and then we would be able to use and incrementally update the results forever afterward. Even under this assumption, one external sort to save all subsequent such sorts seems well worth it. IOW, even if I'm wrong about the initial cost to do this; it is still worth doing ;-) That initial scan to set up the keys is expensive, but if we wish that cost can be amortized over the life of the table so we don't have to pay it all at once. In addition, once we have created those keys, then can be saved for later searches and sorts. But for every update or insert, you have to resort the keys, which is _very_ expensive as it basically needs to update a huge part of the table. ??? You do not need to resort already ordered data to insert a new element into it such that the data stays ordered! Once we have done the key ordering operation once, we should not ever need to do it again on the original data. Else most sorting algorithms wouldn't work ;-) Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote: On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote: For this mapping, you need a full table sort. One physical IO pass should be all that's needed. However, let's pretend you are correct and that we do need to sort the table to get the key mapping. Even so, we would only need to do it =once= and then we would be able to use and incrementally update the results forever afterward. Even under this assumption, one external sort to save all subsequent such sorts seems well worth it. IOW, even if I'm wrong about the initial cost to do this; it is still worth doing ;-) I think you're talking about something different here. You're thinking of having the whole table sorted and when you add a new value you add it in such a way to keep it sorted. The problem is, what do you sort it by? If you've sorted the table by col1, then when the user does ORDER BY col2 it's useless. No, I'm thinking about how to represent DB row data in such a way that a= we use a compact enough representation that we can sort internally rather than externally. b= we do the sort once and avoid most of the overhead upon subsequent similar requests. I used the example of sorting on the entire row to show that the approach works even when the original record being sorted by is very large. All my previous comments on this topic hold for the case where we are sorting on only part of a row as well. If all you are doing is sorting on a column or a few columns, what I'm discussing is even easier since treating the columns actually being used a sort criteria as integers rather than the whole row as an atomic unit eats less resources during the key creation and mapping process. If the row is 2-4KB in size, but we only care about some combination of columns that only takes on = 2^8 or = 2^16 different values, then what I've described will be even better than the original example I gave. Basically, the range of a key is going to be restricted by how a= big the field is that represents the key (columns and such are usually kept narrow for performance reasons) or b= big each row is (the more space each row takes, the fewer rows fit into any given amount of storage) c= many rows there are in the table Between the conditions, the range of a key tends to be severely restricted and therefore use much less space than sorting the actual DB records would. ...and that gives us something we can take advantage of. Indeed, this is what btrees do, you store the order of the table seperate from the data. And you can store multiple orders. But even then, when someone does ORDER BY lower(col1), it's still useless. And you're right, we still need to do the single mass sort in the beginning, which is precisely what we're trying to optimise here. Sigh. My points were: 1= we have information available to us that allows us to map the rows in such a way as to turn most external sorts into internal sorts, thereby avoiding the entire external sorting problem in those cases. This is a huge performance improvement. 2= that an external sort is =NOT= required for initial key assignment, but even if it was it would be worth it. 3= that this is a variation of a well known technique so I'm not suggesting heresy here. Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Need pointers to standard pg database(s) for testing
I assume we have such? Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index
At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote: On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote: Even better (and more easily scaled as the number of GPR's in the CPU changes) is to use the set {L; L+1; L+2; t1; R-2; R-1; R} This means that instead of 7 random memory accesses, we have 3; two of which result in a burst access for three elements each. Isn't that improvement going to disappear competely if you choose a bad pivot? Only if you _consistently_ (read: the vast majority of the time: quicksort is actually darn robust) choose a _pessimal_, not just bad, pivot quicksort will degenerate to the O(N^2) behavior everyone worries about. See Corman Rivest for a proof on this. Even then, doing things as above has benefits: 1= The worst case is less bad since the guaranteed O(lgs!) pivot choosing algorithm puts s elements into final position. Worst case becomes better than O(N^2/(s-1)). 2= The overhead of pivot choosing can overshadow the benefits using more traditional methods for even moderate values of s. See discussions on the quicksort variant known as samplesort and Sedgewick's PhD thesis for details. Using a pivot choosing algorithm that actually does some of the partitioning (and does it more efficiently than the usual partitioning algorithm does) plus using partition-in-place (rather then Lomuto's method) reduces overhead very effectively (at the cost of more complicated / delicate to get right partitioning code). The above reduces the number of moves used in a quicksort pass considerably regardless of the number of compares used. 3= Especially in modern systems where the gap between internal CPU bandwidth and memory bandwidth is so great, the overhead of memory accesses for comparisons and moves is the majority of the overhead for both the pivot choosing and the partitioning algorithms within quicksort. Particularly random memory accesses. The reason (#GPRs - 1) is a magic constant is that it's the most you can compare and move using only register-to-register operations. In addition, replacing as many of the memory accesses you must do with sequential rather than random memory accesses is a big deal: sequential memory access is measured in 10's of CPU cycles while random memory access is measured in hundreds of CPU cycles. It's no accident that the advances in Grey's sorting contest have involved algorithms that are both register and cache friendly, minimizing overall memory access and using sequential memory access as much as possible when said access can not be avoided. As caches grow larger and memory accesses more expensive, it's often worth it to use a BucketSort+QuickSort hybrid rather than just QuickSort. ...and of course if you know enough about the data to be sorted so as to constrain it appropriately, one should use a non comparison based O(N) sorting algorithm rather than any of the general comparison based O(NlgN) methods. SIDE NOTE: IIRC glibc's qsort is actually merge sort. Merge sort performance is insensitive to all inputs, and there are way to optimize it as well. glibc-2.3.5/stdlib/qsort.c: /* Order size using quicksort. This implementation incorporates four optimizations discussed in Sedgewick: I can't see any references to merge sort in there at all. Well, then I'm not the only person on the lists whose memory is faulty ;-) The up side of MergeSort is that its performance is always O(NlgN). The down sides are that it is far more memory hungry than QuickSort and slower. Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] [HACKERS] qsort again
At 07:10 AM 2/16/2006, Florian Weimer wrote: * Neil Conway: On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote: It seems clear that our qsort.c is doing a pretty awful job of picking qsort pivots, while glibc is mostly managing not to make that mistake. I haven't looked at the glibc code yet to see what they are doing differently. glibc qsort is actually merge sort, so I'm not surprised it avoids this problem. qsort also performs twice as many key comparisons as the theoretical minimum. The theoretical minimum number of comparisons for a general purpose comparison based sort is O(lgN!). QuickSort uses 2NlnN ~= 1.38NlgN or ~1.38x the optimum without tuning (see Knuth, Sedgewick, Corman, ... etc) OTOH, QuickSort uses ~2x as many =moves= as the theoretical minimum unless tuned, and moves are more expensive than compares in modern systems. See my other posts for QuickSort tuning methods that attempt to directly address both issues. Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote: On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote: 3= Especially in modern systems where the gap between internal CPU bandwidth and memory bandwidth is so great, the overhead of memory accesses for comparisons and moves is the majority of the overhead for both the pivot choosing and the partitioning algorithms within quicksort. Particularly random memory accesses. The reason (#GPRs - 1) is a magic constant is that it's the most you can compare and move using only register-to-register operations. But how much of this applies to us? We're not sorting arrays of integers, we're sorting pointers to tuples. So while moves cost very little, a comparison costs hundreds, maybe thousands of cycles. A tuple can easily be two or three cachelines and you're probably going to access all of it, not to mention the Fmgr structures and the Datums themselves. Pointers are simply fixed size 32b or 64b quantities. They are essentially integers. Comparing and moving pointers or fixed size keys to those pointers is exactly the same problem as comparing and moving integers. Comparing =or= moving the actual data structures is a much more expensive and variable cost proposition. I'm sure that pg's sort functionality is written intelligently enough that the only real data moves are done in a final pass after the exact desired order has been found using pointer compares and (re)assignments during the sorting process. That's a standard technique for sorting data whose key or pointer is much smaller than a datum. Your cost comment basically agrees with mine regarding the cost of random memory accesses. The good news is that the number of datums to be examined during the pivot choosing process is small enough that the datums can fit into CPU cache while the pointers to them can be assigned to registers: making pivot choosing +very+ fast when done correctly. As you've noted, actual partitioning is going to be more expensive since it involves accessing enough actual datums that they can't all fit into CPU cache. The good news is that QuickSort has a very sequential access pattern within its inner loop. So while we must go to memory for compares, we are at least keeping the cost for it down it a minimum. In addition, said access is nice enough to be very prefetch and CPU cache hierarchy friendly. None of this is cache friendly. The actual tuples themselves could be spread all over memory (I don't think any particular effort is expended trying to minimize fragmentation). It probably would be worth it to spend some effort on memory layout just as we do for HD layout. Do these algorithms discuss the case where a comparison is more than 1000 times the cost of a move? A move is always more expensive than a compare when the datum is larger than its pointer or key. A move is always more expensive than a compare when it involves memory to memory movement rather than CPU location to CPU location movement. A move is especially more expensive than a compare when it involves both factors. Most moves do involve both. What I suspect you meant is that a key comparison that involves accessing the data in memory is more expensive than reassigning the pointers associated with those keys. That is certainly true. Yes. The problem has been extensively studied. ;-) Where this does become interesting is where we can convert a datum to an integer such that if f(A) f(B) then A B. Then we can sort on f(X) first with just integer comparisons and then do a full tuple comparison only if f(A) = f(B). This would be much more cache-coherent and make these algorithms much more applicable in my mind. In fact we can do better. Using hash codes or what-not to map datums to keys and then sorting just the keys and the pointers to those datums followed by an optional final pass where we do the actual data movement is also a standard technique for handling large data structures. Regardless of what tweaks beyond the basic algorithms we use, the algorithms themselves have been well studied and their performance well established. QuickSort is the best performing of the O(nlgn) comparison based sorts and it uses less resources than HeapSort or MergeSort. Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 10:52 AM 2/16/2006, Ron wrote: At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote: Where this does become interesting is where we can convert a datum to an integer such that if f(A) f(B) then A B. Then we can sort on f(X) first with just integer comparisons and then do a full tuple comparison only if f(A) = f(B). This would be much more cache-coherent and make these algorithms much more applicable in my mind. In fact we can do better. Using hash codes or what-not to map datums to keys and then sorting just the keys and the pointers to those datums followed by an optional final pass where we do the actual data movement is also a standard technique for handling large data structures. I thought some follow up might be in order here. Let's pretend that we have the typical DB table where rows are ~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in such a table. A 32b hash code can be assigned to each row value such that only exactly equal rows will have the same hash code. A 32b pointer can locate any of the 256M-512M rows. Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an optional pass to rearrange the actual rows if we so wish. We get the same result while only examining and manipulating 1/50 to 1/25 as much data during the sort. If we want to spend more CPU time in order to save more space, we can compress the key+pointer representation. That usually reduces the amount of data to be manipulated to ~1/4 the original key+pointer representation, reducing things to ~512M-1GB worth of compressed pointers+keys. Or ~1/200 - ~1/100 the original amount of data we were discussing. Either representation is small enough to fit within RAM rather than requiring HD IO, so we solve the HD IO bottleneck in the best possible way: we avoid ever doing it. Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 12:19 PM 2/16/2006, Scott Lamb wrote: On Feb 16, 2006, at 8:32 AM, Ron wrote: Let's pretend that we have the typical DB table where rows are ~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in such a table. A 32b hash code can be assigned to each row value such that only exactly equal rows will have the same hash code. A 32b pointer can locate any of the 256M-512M rows. Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b +32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an optional pass to rearrange the actual rows if we so wish. I don't understand this. This is a true statement: (H(x) != H(y)) = (x != y) This is not: (H(x) H(y)) = (x y) Hash keys can tell you there's an inequality, but they can't tell you how the values compare. If you want 32-bit keys that compare in the same order as the original values, here's how you have to get them: For most hash codes, you are correct. There is a class of hash or hash-like codes that maintains the mapping to support that second statement. More later when I can get more time. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
At 01:47 PM 2/16/2006, Ron wrote: At 12:19 PM 2/16/2006, Scott Lamb wrote: On Feb 16, 2006, at 8:32 AM, Ron wrote: Let's pretend that we have the typical DB table where rows are ~2-4KB apiece. 1TB of storage will let us have 256M-512M rows in such a table. A 32b hash code can be assigned to each row value such that only exactly equal rows will have the same hash code. A 32b pointer can locate any of the 256M-512M rows. Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b +32b= 64b*(2^28 to 2^29)= 2-4GB of pointers+keys followed by an optional pass to rearrange the actual rows if we so wish. I don't understand this. This is a true statement: (H(x) != H(y)) = (x != y) This is not: (H(x) H(y)) = (x y) Hash keys can tell you there's an inequality, but they can't tell you how the values compare. If you want 32-bit keys that compare in the same order as the original values, here's how you have to get them: For most hash codes, you are correct. There is a class of hash or hash-like codes that maintains the mapping to support that second statement. More later when I can get more time. Ron OK, so here's _a_ way (there are others) to obtain a mapping such that if a b then f(a) f (b) and if a == b then f(a) == f(b) Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb integer; a 4KB row becomes a 32Kb integer; etc) Since even a 1TB table made of such rows can only have 256M - 512M possible values even if each row is unique, a 28b or 29b key is large enough to represent each row's value and relative rank compared to all of the others even if all row values are unique. By scanning the table once, we can map say 001h (Hex used to ease typing) to the row with the minimum value and 111h to the row with the maximum value as well as mapping everything in between to their appropriate keys. That same scan can be used to assign a pointer to each record's location. We can now sort the key+pointer pairs instead of the actual data and use an optional final pass to rearrange the actual rows if we wish. That initial scan to set up the keys is expensive, but if we wish that cost can be amortized over the life of the table so we don't have to pay it all at once. In addition, once we have created those keys, then can be saved for later searches and sorts. Further space savings can be obtained whenever there are duplicate keys and/or when compression methods are used on the Key+pointer pairs. Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index
This behavior is consistent with the pivot choosing algorithm assuming certain distribution(s) for the data. For instance, median-of-three partitioning is known to be pessimal when the data is geometrically or hyper-geometrically distributed. Also, care must be taken that sometimes is not when there are many equal values in the data. Even pseudo random number generator based pivot choosing algorithms are not immune if the PRNG is flawed in some way. How are we choosing our pivots? At 06:28 PM 2/15/2006, Tom Lane wrote: I did some experimentation comparing the qsort from Fedora Core 4 (glibc-2.3.5-10.3) with our src/port/qsort.c. For those who weren't following the pgsql-performance thread, the test case is just this repeated a lot of times: create table atest(i int4, r int4); insert into atest (i,r) select generate_series(1,10), 0; insert into atest (i,r) select generate_series(1,10), random()*10; \timing create index idx on atest(r); \timing drop table atest; I did this 100 times and sorted the reported runtimes. (Investigation with trace_sort = on confirms that the runtime is almost entirely spent in qsort() called from our performsort --- the Postgres overhead is about 100msec on this machine.) Results are below. It seems clear that our qsort.c is doing a pretty awful job of picking qsort pivots, while glibc is mostly managing not to make that mistake. I haven't looked at the glibc code yet to see what they are doing differently. I'd say this puts a considerable damper on my enthusiasm for using our qsort all the time, as was recently debated in this thread: http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php We need to fix our qsort.c before pushing ahead with that idea. regards, tom lane 100 runtimes for glibc qsort, sorted ascending: Time: 459.860 ms snip Time: 488.852 ms Time: 514.639 ms Time: 529.287 ms Time: 612.185 ms Time: 660.748 ms Time: 742.227 ms Time: 866.814 ms Time: 1234.848 ms Time: 1267.398 ms 100 runtimes for port/qsort.c, sorted ascending: Time: 418.905 ms snip Time: 20865.979 ms Time: 21000.907 ms Time: 21297.585 ms Time: 21714.518 ms Time: 25423.235 ms Time: 27543.052 ms Time: 28314.182 ms Time: 29400.278 ms Time: 34142.534 ms ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index
At 08:21 PM 2/15/2006, Tom Lane wrote: Ron [EMAIL PROTECTED] writes: How are we choosing our pivots? See qsort.c: it looks like median of nine equally spaced inputs (ie, the 1/8th points of the initial input array, plus the end points), implemented as two rounds of median-of-three choices. OK, this is a bad way to do median-of-n partitioning for a few reasons. See Sedgewick's PhD thesis for details. Basically, if one is using median-of-n partitioning to choose a pivot, one should do it in =one= pass, and n for that pass should be = the numbers of registers in the CPU. Since the x86 ISA has 8 GPR's, n should be = 8. 7 for instance. Special purposing the median-of-n code so that the minimal number of comparisons and moves is used to sort the sample and then partitioning in place is the best way to do it. In addition, care must be taken to deal with the possibility that many of the keys may be equal. The (pseudo) code looks something like this: qs(a[],L,R){ if((R-L) SAMPLE_SIZE){ // Not worth using qs for too few elements SortSample(SAMPLE_SIZE,a[],L,R); // Sorts SAMPLE_SIZE= n elements and does median-of-n partitioning for small n // using the minimal number of comparisons and moves. // In the process it ends up partitioning the first n/2 and last n/2 elements // SAMPLE_SIZE is a constant chosen to work best for a given CPU. // #GPRs - 1 is a good initial guess. // For the x86 ISA, #GPRs - 1 = 7. For native x86-64, it's 15. // For most RISC CPUs it's 31 or 63. For Itanium, it's 127 (!) pivot= a[(L+R)1]; i= L+(SAMPLE_SIZE1); j= R-(SAMPLE_SIZE1); for(;;){ while(a[++i] pivot); while(a[--j] pivot); if(i = j) break; if(a[i] a[j]) swap(a[i],a[j]); } if((i-R) = (j-L)){qs(a,L,i-1);} else{qs(a,i,R);} else{OofN^2_Sort(a,L,R);} // SelectSort may be better than InsertSort if KeySize in bits RecordSize in bits } // End of qs Given that the most common CPU ISA in existence has 8 GPRs, SAMPLE_SIZE= 7 is probably optimal: t= (L+R); the set would be {L; t/8; t/4; t/2; 3*t/4; 7*t/8; R;} == {L; t3; t2; t1; (3*t)2; (7*t)3; R} as the locations. Even better (and more easily scaled as the number of GPR's in the CPU changes) is to use the set {L; L+1; L+2; t1; R-2; R-1; R} This means that instead of 7 random memory accesses, we have 3; two of which result in a burst access for three elements each. That's much faster; _and_ using a sample of 9, 15, 31, 63, etc (to max of ~GPRs -1) elements is more easily done. It also means that the work we do sorting the sample can be taken advantage of when starting inner loop of quicksort: items L..L+2, t, and R-2..R are already partitioned by SortSample(). Insuring that the minimum number of comparisons and moves is done in SortSample can be down by using a code generator to create a comparison tree that identifies which permutation(s) of n we are dealing with and then moving them into place with the minimal number of moves. SIDE NOTE: IIRC glibc's qsort is actually merge sort. Merge sort performance is insensitive to all inputs, and there are way to optimize it as well. I'll leave the actual coding to someone who knows the pg source better than I do. Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] What do the Windows pg hackers out there like for dev tools?
Subject line says it all. I'm going to be testing changes under both Linux and WinXP, so I'm hoping those of you that do M$ hacking will pass along your list of suggestions and/or favorite (and hated so I know what to avoid) tools. TiA, Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] ISO 8601 Intervals
Larry Rosenman wrote: Michael Glaesemann wrote: On Jan 8, 2006, at 12:12 , Larry Rosenman wrote: I was thinking of handling the TODO for ISO8601 Interval output. Just to be clear, you're talking about the ISO8601 duration syntax (PnYnMnDTnHnMnS), correct? (The SQL standard made the unfortunate choice to call durations, i.e., lengths of time, intervals.) Back in 2003 I submitted such a patch [1,1b] that resulted in a fair amount of discussion including some still (AFAIK) open issues about the naming of the datestyle settings to control it [2,3,4]. There was also some discussion of the range off ISO 8601 durations to support (ISO 8601 Basic Format, ISO 8601 Alternative Format, and ISO 8601 Extended Format (which is more human-readable)) [5]. Finally, there is a similar, but different syntax currently supported by postgresql (where '1Y1M' means 1 year 1 minute, while ISO 'P1Y1M' would mean 1 year 1 month) and Tom recommended ripping that code out[7] and at one point said my patch was looking cleaner than the exiting code [8]. My patch does not yet rip that out. I still use the patch myself, but don't have it updated to CVS tip. I'd be happy to do so if people want that as a starting point. Ron [1] http://archives.postgresql.org/pgsql-patches/2003-09/msg00103.php [1b] http://archives.postgresql.org/pgsql-patches/2003-09/msg00286.php [2] http://archives.postgresql.org/pgsql-patches/2003-09/msg00122.php [3] http://archives.postgresql.org/pgsql-patches/2003-09/msg00129.php [4] http://archives.postgresql.org/pgsql-patches/2003-09/msg00130.php [5] http://archives.postgresql.org/pgsql-patches/2003-09/msg00133.php [6] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php [7] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php [8] http://archives.postgresql.org/pgsql-patches/2003-09/msg00121.php ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ISO 8601 Intervals
One more link... this http://archives.postgresql.org/pgsql-patches/2003-12/msg00049.php was the final draft of the patch I submitted, with docs patches, that did not break backward computability (did not rip out the old syntax) and supported both input and output of ISO-8601 compliant intervals by setting the datestyle to iso8601basic as discussed in the thread linked in the quoted article below. It was applied http://archives.postgresql.org/pgsql-patches/2003-12/msg00253.php and then debated http://archives.postgresql.org/pgsql-patches/2003-12/msg00202.php and then unapplied http://archives.postgresql.org/pgsql-patches/2003-12/msg00030.php on Peter Eisentraut's recommendation to implement SQL standard intervals first. Ron Mayer wrote: Larry Rosenman wrote: Michael Glaesemann wrote: On Jan 8, 2006, at 12:12 , Larry Rosenman wrote: I was thinking of handling the TODO for ISO8601 Interval output. Just to be clear, you're talking about the ISO8601 duration syntax (PnYnMnDTnHnMnS), correct? (The SQL standard made the unfortunate choice to call durations, i.e., lengths of time, intervals.) Back in 2003 I submitted such a patch [1,1b] that resulted in a fair amount of discussion including some still (AFAIK) open issues about the naming of the datestyle settings to control it [2,3,4]. There was also some discussion of the range off ISO 8601 durations to support (ISO 8601 Basic Format, ISO 8601 Alternative Format, and ISO 8601 Extended Format (which is more human-readable)) [5]. Finally, there is a similar, but different syntax currently supported by postgresql (where '1Y1M' means 1 year 1 minute, while ISO 'P1Y1M' would mean 1 year 1 month) and Tom recommended ripping that code out[7] and at one point said my patch was looking cleaner than the exiting code [8]. My patch does not yet rip that out. I still use the patch myself, but don't have it updated to CVS tip. I'd be happy to do so if people want that as a starting point. Ron [1] http://archives.postgresql.org/pgsql-patches/2003-09/msg00103.php [1b] http://archives.postgresql.org/pgsql-patches/2003-09/msg00286.php [2] http://archives.postgresql.org/pgsql-patches/2003-09/msg00122.php [3] http://archives.postgresql.org/pgsql-patches/2003-09/msg00129.php [4] http://archives.postgresql.org/pgsql-patches/2003-09/msg00130.php [5] http://archives.postgresql.org/pgsql-patches/2003-09/msg00133.php [6] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php [7] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php [8] http://archives.postgresql.org/pgsql-patches/2003-09/msg00121.php ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Supporting NULL elements in arrays
Joe Conway wrote: Last time I thought about this problem, that's what I concluded. I don't think there is a reasonable and backward compatible solution. I also think the best non-compatible solution is to require non-numeric elements to be delimited (double quotes, configurable?), and use NULL unadorned to represent NULL. If we're going non-computable, would something that's a superset of the SQL Standard's array value constructor be useful; or is the standard's arrays so limited as to not even address the things that cause issues for postgresql arrays? If I read the confusing thing right, I think the standard does use NULL for nulls in arrays, single-quotes for strings, etc. like ARRAY['FOO',null,'BAR'] and unadorned numbers for numbers in arrays. That's similar to what I think Joe suggested, but with single rather than double quotes? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
I've now gotten verification from multiple working DBA's that DB2, Oracle, and SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in setups akin to Oracle RAC) when attached to a decent (not outrageous, but decent) HD subsystem... I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is attainable. Cache based bursts that high, yes. ASTR, no. The DBA's in question run RW installations that include Solaris, M$, and Linux OS's for companies that just about everyone on these lists are likely to recognize. Also, the implication of these pg IO limits is that money spent on even moderately priced 300MBps SATA II based RAID HW is wasted $'s. In total, this situation is a recipe for driving potential pg users to other DBMS. 25MBps in and 15MBps out is =BAD=. Have we instrumented the code in enough detail that we can tell _exactly_ where the performance drainage is? We have to fix this. Ron -Original Message- From: Luke Lonergan [EMAIL PROTECTED] Sent: Oct 5, 2005 11:24 AM To: Michael Stone [EMAIL PROTECTED], Martijn van Oosterhout kleptog@svana.org Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Nope - it would be disk wait. COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 15 MB/s (out). - Luke -Original Message- From: Michael Stone [mailto:[EMAIL PROTECTED] Sent: Wed Oct 05 09:58:41 2005 To: Martijn van Oosterhout Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org Subject:Re: [HACKERS] [PERFORM] A Better External Sort? On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote: COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) [snip] the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. Note that 45% of that cpu usage is system--which is where IO overhead would end up being counted. Until you profile where you system time is going it's premature to say it isn't an IO problem. Mike Stone ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
First I wanted to verify that pg's IO rates were inferior to The Competition. Now there's at least an indication that someone else has solved similar problems. Existence proofs make some things easier ;-) Is there any detailed programmer level architectual doc set for pg? I know the best doc is the code, but the code in isolation is often the Slow Path to understanding with systems as complex as a DBMS IO layer. Ron -Original Message- From: Joshua D. Drake [EMAIL PROTECTED] Sent: Oct 5, 2005 1:18 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? The source is freely available for your perusal. Please feel free to point us in specific directions in the code where you may see some benefit. I am positive all of us that can, would put resources into fixing the issue had we a specific direction to attack. Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
I'm putting in as much time as I can afford thinking about pg related performance issues. I'm doing it because of a sincere desire to help understand and solve them, not to annoy people. If I didn't believe in pg, I would't be posting thoughts about how to make it better. It's probably worth some review (suggestions marked with a +: +I came to the table with a possibly better way to deal with external sorts (that now has branched into 2 efforts: short term improvements to the existing code, and the original from-the-ground-up idea). That suggestion was based on a great deal of prior thought and research, despite what some others might think. Then we were told that our IO limit was lower than I thought. +I suggested that as a Quick Fix we try making sure we do IO transfers in large enough chunks based in the average access time of the physical device in question so as to achieve the device's ASTR (ie at least 600KB per access for a 50MBps ASTR device with a 12ms average access time.) whenever circumstances allowed us. As far as I know, this experiment hasn't been tried yet. I asked some questions about physical layout and format translation overhead being possibly suboptimal that seemed to be agreed to, but specifics as to where we are taking the hit don't seem to have been made explicit yet. +I made the from left field suggestion that perhaps a pg native fs format would be worth consideration. This is a major project, so the suggestion was to at least some extent tongue-in-cheek. +I then made some suggestions about better code instrumentation so that we can more accurately characterize were the bottlenecks are. We were also told that evidently we are CPU bound far before one would naively expect to be based on the performance specifications of the components involved. Double checking among the pg developer community led to some differing opinions as to what the actual figures were and under what circumstances they were achieved. Further discussion seems to have converged on both accurate values and a better understanding as to the HW and SW needed; _and_ we've gotten some RW confirmation as to what current reasonable expectations are within this problem domain from outside the pg community. +Others have made some good suggestions in this thread as well. Since I seem to need to defend my tone here, I'm not detailing them here. That should not be construed as a lack of appreciation of them. Now I've asked for the quickest path to detailed understanding of the pg IO subsystem. The goal being to get more up to speed on its coding details. Certainly not to annoy you or anyone else. At least from my perspective, this for the most part seems to have been an useful and reasonable engineering discussion that has exposed a number of important things. Regards, Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
The constants related to inlining involve pcode, not actual assembly instructions, and are compiler version dependent as well as subject to change without notice by the GNU folks... from: http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/Optimize-Options.html#Optimize-Options -finline-limit=n By default, gcc limits the size of functions that can be inlined. This flag allows the control of this limit for functions that are explicitly marked as inline (i.e., marked with the inline keyword or defined within the class definition in c++). n is the size of functions that can be inlined in number of pseudo instructions (not counting parameter handling). The default value of n is 600. Increasing this value can result in more inlined code at the cost of compilation time and memory consumption. Decreasing usually makes the compilation faster and less code will be inlined (which presumably means slower programs). This option is particularly useful for programs that use inlining heavily such as those based on recursive templates with C++. Inlining is actually controlled by a number of parameters, which may be specified individually by using --param name=value. The -finline-limit=n option sets some of these parameters as follows: max-inline-insns is set to n. max-inline-insns-single is set to n/2. max-inline-insns-auto is set to n/2. min-inline-insns is set to 130 or n/4, whichever is smaller. max-inline-insns-rtl is set to n. Using -finline-limit=600 thus results in the default settings for these parameters. See below for a documentation of the individual parameters controlling inlining. Note: pseudo instruction represents, in this particular context, an abstract measurement of function's size. In no way, it represents a count of assembly instructions and as such its exact meaning might change from one release to an another. Further Down It Says... --param name=value In some places, GCC uses various constants to control the amount of optimization that is done. For example, GCC will not inline functions that contain more that a certain number of instructions. You can control some of these constants on the command-line using the --param option. The names of specific parameters, and the meaning of the values, are tied to the internals of the compiler, and are subject to change without notice in future releases. In each case, the value is an integer. The allowable choices for name are given in the following table: snip max-inline-insns-single Several parameters control the tree inliner used in gcc. This number sets the maximum number of instructions (counted in gcc's internal representation) in a single function that the tree inliner will consider for inlining. This only affects functions declared inline and methods implemented in a class declaration (C++). The default value is 300. max-inline-insns-auto When you use -finline-functions (included in -O3), a lot of functions that would otherwise not be considered for inlining by the compiler will be investigated. To those functions, a different (more restrictive) limit compared to functions declared inline can be applied. The default value is 300. max-inline-insns The tree inliner does decrease the allowable size for single functions to be inlined after we already inlined the number of instructions given here by repeated inlining. This number should be a factor of two or more larger than the single function limit. Higher numbers result in better runtime performance, but incur higher compile-time resource (CPU time, memory) requirements and result in larger binaries. Very high values are not advisable, as too large binaries may adversely affect runtime performance. The default value is 600. max-inline-slope After exceeding the maximum number of inlined instructions by repeated inlining, a linear function is used to decrease the allowable size for single functions. The slope of that function is the negative reciprocal of the number specified here. The default value is 32. min-inline-insns The repeated inlining is throttled more and more by the linear function after exceeding the limit. To avoid too much throttling, a minimum for this function is specified here to allow repeated inlining for very small functions even when a lot of repeated inlining already has been done. The default value is 130. max-inline-insns-rtl For languages that use the RTL inliner (this happens at a later stage than tree inlining), you can set the maximum allowable size (counted in RTL instructions) for the RTL inliner with this parameter. The default value is 600. -Original Message- From: Martijn van Oosterhout kleptog@svana.org Sent: Oct 4, 2005 8:24 AM To: Simon Riggs [EMAIL PROTECTED] Cc: Tom Lane [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort
Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff, are those _burst_ rates from HD buffer or _sustained_ rates from actual HD media? Rates from IO subsystem buffer or cache are usually considerably higher than Average Sustained Transfer Rate. Also, are you measuring _raw_ HD IO (bits straight off the platters, no FS or other overhead) or _cooked_ HD IO (actual FS or pg IO)? BTW, it would seem Useful to measure all of raw HD IO, FS HD IO, and pg HD IO as this would give us an idea of just how much overhead each layer is imposing on the process. We may be able to get better IO than we currently are for things like sorts by the simple expedient of making sure we read enough data per seek. For instance, a HD with a 12ms average access time and a ASTR of 50MBps should always read _at least_ 600KB/access or it is impossible for it to achieve it's rated ASTR. This number will vary according to the average access time and the ASTR of your physical IO subsystem, but the concept is valid for _any_ physical IO subsystem. -Original Message- From: Jeffrey W. Baker [EMAIL PROTECTED] Sent: Oct 3, 2005 4:42 PM To: josh@agliodbs.com Cc: Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote: Michael, Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, What on earth gives you that idea? Did you drop a zero? Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A Big-Name Proprietary Database doesn't get much more than that either. I find this claim very suspicious. I get single-threaded reads in excess of 1GB/sec with XFS and 250MB/sec with ext3. -jwb ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
Let's pretend we get a 24HD HW RAID solution like that J Baker says he has access to and set it up as a RAID 10. Assuming it uses two 64b 133MHz PCI-X busses and has the fastest HDs available on it, Jeff says he can hit ~1GBps of XFS FS IO rate with that set up (12*83.3MBps= 1GBps). Josh says that pg can't do more than 25MBps of DB level IO regardless of how fast the physical IO subsystem is because at 25MBps, pg is CPU bound. Just how bad is this CPU bound condition? How powerful a CPU is needed to attain a DB IO rate of 25MBps? If we replace said CPU with one 2x, 10x, etc faster than that, do we see any performance increase? If a modest CPU can drive a DB IO rate of 25MBps, but that rate does not go up regardless of how much extra CPU we throw at it... Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Oct 3, 2005 6:03 PM To: Jeffrey W. Baker [EMAIL PROTECTED] Cc: Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeffrey, I guess database reads are different, but I remain unconvinced that they are *fundamentally* different. After all, a tab-delimited file (my sort workload) is a kind of database. Unfortunately, they are ... because of CPU overheads. I'm basing what's reasonable for data writes on the rates which other high-end DBs can make. From that, 25mb/s or even 40mb/s for sorts should be achievable but doing 120mb/s would require some kind of breakthrough. On a single disk you wouldn't notice, but XFS scales much better when you throw disks at it. I get a 50MB/sec boost from the 24th disk, whereas ext3 stops scaling after 16 disks. For writes both XFS and ext3 top out around 8 disks, but in this case XFS tops out at 500MB/sec while ext3 can't break 350MB/sec. That would explain it. I seldom get more than 6 disks (and 2 channels) to test with. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
OK, change performance to single thread performance and we still have a valid starting point for a discussion. Ron -Original Message- From: Gregory Maxwell [EMAIL PROTECTED] Sent: Oct 3, 2005 8:19 PM To: Ron Peacetree [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote: [snip] Just how bad is this CPU bound condition? How powerful a CPU is needed to attain a DB IO rate of 25MBps? If we replace said CPU with one 2x, 10x, etc faster than that, do we see any performance increase? If a modest CPU can drive a DB IO rate of 25MBps, but that rate does not go up regardless of how much extra CPU we throw at it... Single threaded was mentioned. Plus even if it's purely cpu bound, it's seldom as trivial as throwing CPU at it, consider the locking in both the application, in the filesystem, and elsewhere in the kernel. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
*blink* Tapes?! I thought that was a typo... If our sort is code based on sorting tapes, we've made a mistake. HDs are not tapes, and Polyphase Merge Sort and it's brethren are not the best choices for HD based sorts. Useful references to this point: Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed) Tharp, ISBN 0-471-60521-2, starting p352 Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289) The winners of the Daytona version of Jim Gray's sorting contest, for general purpose external sorting algorithms that are of high enough quality to be offered commercially, also demonstrate a number of better ways to attack external sorting using HDs. The big take aways from all this are: 1= As in Polyphase Merge Sort, optimum External HD Merge Sort performance is obtained by using Replacement Selection and creating buffers of different lengths for later merging. The values are different. 2= Using multiple HDs split into different functions, IOW _not_ simply as RAIDs, is a big win. A big enough win that we should probably consider having a config option to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary work area(s). 3= If the Key is small compared record size, Radix or Distribution Counting based algorithms are worth considering. The good news is all this means it's easy to demonstrate that we can improve the performance of our sorting functionality. Assuming we get the abyssmal physical IO performance fixed... (because until we do, _nothing_ is going to help us as much) Ron -Original Message- From: Tom Lane [EMAIL PROTECTED] Sent: Oct 1, 2005 2:01 AM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. I don't believe we can simply legislate that there be only one merge pass. That would mean that, if we end up with N runs after the initial run-forming phase, we need to fit N tuples in memory --- no matter how large N is, or how small work_mem is. But it seems like a good idea to try to use an N-way merge where N is as large as work_mem will allow. We'd not have to decide on the value of N until after we've completed the run-forming phase, at which time we've already seen every tuple once, and so we can compute a safe value for N as work_mem divided by largest_tuple_size. (Tape I/O buffers would have to be counted too of course.) It's been a good while since I looked at the sort code, and so I don't recall if there are any fundamental reasons for having a compile-time- constant value of the merge order rather than choosing it at runtime. My guess is that any inefficiencies added by making it variable would be well repaid by the potential savings in I/O. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings