Re: [HACKERS] bitmap scans, btree scans, and tid order
Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: I see that Tom has already done the infrastructure work by adding getmulti, but getmulti isn't used by nodeIndexscan.c, only nodeBitmapIndexscan.c. Will btree index scans be executed by creating in-memory bitmaps in 8.1, or will some scans still be executed the usual way? We aren't going to remove the existing indexscan behavior, because bitmap scans lose the ordering of the underlying index. There are many situations where that ordering is important. (See for instance the recent changes to make MAX/MIN use that behavior.) Would you take a patch that retained the optimized executions of plans returning 1 tuple and also fixed the random heap problem? -jwb ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] bitmap scans, btree scans, and tid order
Jeffrey Baker wrote: Would you take a patch that retained the optimized executions of plans returning 1 tuple and also fixed the random heap problem? Can you elaborate on what you're proposing? Obviously sorted b+-tree output is important for a lot more than just min()/max(). I don't see an obvious way to produce sorted output from a bitmap tree index scan without requiring an additional sort step (which would be rather pointless -- the whole point of the optimization is to avoid an additional sort). -Neil ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] bitmap scans, btree scans, and tid order
Neil Conway wrote: Jeffrey Baker wrote: Would you take a patch that retained the optimized executions of plans returning 1 tuple and also fixed the random heap problem? Can you elaborate on what you're proposing? Obviously sorted b+-tree output is important for a lot more than just min()/max(). I don't see an obvious way to produce sorted output from a bitmap tree index scan without requiring an additional sort step (which would be rather pointless -- the whole point of the optimization is to avoid an additional sort). I understand the importance of returning tuples in index order for many plans (although I probably haven't thought of all the cases. min/max is the most obvious; order by + limit is another). The only problem I'm trying to solve is when an indexscan returns a large result, causing the heap to be visited in index order, which is to say random order, from the disk's perspective. When I investigated this last year, sorting the intermediate result of the index scan in disk order was good for a reduction by two-thirds in actual execution time, and sorting the scan result in chunks of 1000 tuples was enough to reduce the time by half. I'm considering one of the following courses of action: Change nodeIndexscan.c to call index_getmulti, and to handle multiple tuples returned. That code would sort the tuple array and store the tuples in the result in disk order. -or- Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. -jwb ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Cost of XLogInsert CRC calculations
-Original Message- From: Simon Riggs [mailto:[EMAIL PROTECTED] Sent: 12 May 2005 16:52 To: Mark Cave-Ayland (External) Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations (cut) It might be possible to speed things up by a factor of two using a two- byte lookup table rather than a one byte lookup. This would take up 65536 table entries, which is only 256KB. If we held that in shared memory we could calculate the entries at startup, or read them in from a file. It would only use the same space as if we had 255 connections, so not a great increase in memory usage overall. Need to look at the effect of retrieving everything from memory rather than from cache, so it probably wouldn't be twice as fast. Best Regards, Simon Riggs Hi everyone, As per this thread, I decided to spend a little time over the weekend looking at the performance of the existing CRC64 code in order to determine whether we could code the algorithm in a more efficient manner. In order to do this, I devised a couple of test programs using code cut/pasted from pg_crc.h (and pg_resetxlog) in order perform some timings. These tests were done on two systems: a P4 1.8GHz laptop with MingW (Win32) and gcc 3.2.3, and a Xeon 2.8GHz server running FC1 and gcc 3.3.2. The program used to time the CRC64 calculation simply did the following: 1) Create an 8k buffer of data 2) Fill the buffer with some repeatable values; in this case the algorithm used was the following: for (i = 0 ; i 8192 ; i++ ) { *data = (char )(i % 256); } 3) Calculate the CRC64 of the buffer 10,000 times and display the result, making a note of the start and end times. Using this information an estimate of the cost of a single CRC calculation can esaily be found. I noticed looking at the code that there were two versions of the CRC code, one using 32-bit arguments wrapped within a check for INT64_IS_BUSTED and another using 64-bit unsigned integers. Since I wasn't sure initially which one was being used, I decided to test both of them :) Both programs were compiled using gcc's -O2 flag for optimisation. The first thing to notice about the results was that I obtained different CRC values using both algorithms on Win32, which were both different to the results obtained under Linux: Win32 MingW: 1) INT64_IS_BUSTED code (32-bit): 0x541d4a27 (crc1) 0x8dfa57ae (crc0) 2) 64-bit code: 0x817f722b1f17b253 (crc0) Linux (FC1): 1) INT64_IS_BUSTED code (32-bit): 0x21d064a2 (crc1) 0xe7c74332 (crc0) 2) 64-bit code: 0x21d064a2e7c74332 (crc0) The second interesting thing to notice was the comparative speeds: Win32 MingW: 1) INT64_IS_BUSTED code (32-bit): ~57us per calculation 2) 64-bit code: ~130us per calculation Linux (FC1): 1) INT64_IS_BUSTED code (32-bit): ~29us per calculation 2) 64-bit code: ~76us per calculation Looking at the flags in pg_config.h from both source builds, it would appear that the 64-bit code is being used since the compiler defines HAVE_LONG_LONG_INT_64. So that means there could be a potential 100% performance increase by just using the existing INT64_IS_BUSTED code on x86 32 bit processors. I don't know given the rate of xlog records being written whether or not this translates to more than a couple of minutes in an insert of a million records or so. I did have a look at the assembly code produced by the INT64_IS_BUSTED code and it appears fairly tight; I'm no x86 expert but I think it would be hard to optimise what was produced by the compiler. If anyone else is interested in looking at these results then I will happily send the test programs off-list - however my main concern is that the Win32 CRC64 results just don't seem to be right :( Kind regards, Mark. WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] postgreSQL as deductive DBMS
Hello all, I have some ideas how to increase expressive power of the PostgreSQL query language. It is ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[HACKERS] postgreSQL as deductive DBMS
Hello all, I have some ideas how to increase expressive power of the PostgreSQL query language. It is not a secret that SQL is very poor to express many important queries, and we have to use means of procedural extensions of SQL to realize them. However this is not good idea to split query language into two parts (declarative and procedural) at least because query language must finally operate with the objects of data domain, not with the bits the objects consist of. Thus another alternative to increase expressive power of query language is to develop its declarative (i.e. nonprocedural) part. And here we come to deductive database (DDB) with its logic language Datalog. Every logic query is a set of inverse implications, which describe what to find (i.e. exactly declarative approach) and not how to find (i.e. exactly procedural approach). We can translate logic query into set of SQL commands and then run them to get result. Some samples with the DLQ compiler can be downloaded from www.datalab.kharkov.ua (DLQ is our original version of Datalog which was developed with the purpose to tie closely RDB and DDB). Now some words about what must be done to realize described feature. The simple quickest way but the way without future is to write language handler. Other more correct way is to slightly extend DML part of SQL and more essentially extend DDL. For example, we have relation Inheritance with two attributes ClassID and ParentID. Now we want to define all descendants or all ancestors. For this goal we define predicate inheritance_all with the next two rules (i.e. inverse implications): inheritance_all(ClassID, ParentID) :- inheritance(ClassID, ParentID); inheritance_all(ClassID, ParentID) :- inheritance(ClassID, X), inheritance_all(X, ParentID). We put this rules into database and call, for example, the next SQL commands: -- find all descendents SELECT * FROM ddb_name.inheritance_all(_, _) -- find all descendents from ParentID = 1 SELECT * FROM ddb_name.inheritance_all(_, 1) where ddb_name is the name of deductive database where our rules are kept, _ designates anonymous variable (see Prolog notation for details). Regards, Dmitriy ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Cost of XLogInsert CRC calculations
-Original Message- From: Mark Cave-Ayland [mailto:[EMAIL PROTECTED] Sent: 16 May 2005 09:04 To: 'Simon Riggs' Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations (cut) The program used to time the CRC64 calculation simply did the following: (cut) 2) Fill the buffer with some repeatable values; in this case the algorithm used was the following: for (i = 0 ; i 8192 ; i++ ) { *data = (char )(i % 256); } Sigh, so it would help if I had added the offset to the data pointer... ;) for (i = 0 ; i 8192 ; i++ ) { *(data + i) = (char )(i % 256); } This now gives the following (correct) result on both platforms: Win32: 1.8GHz P4, WinXP Linux: 2.8GHz Xeon, FC1 Win32 UINT64: 0x782104059a01660 (crc0) ~158us Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0)~58us FC1 UINT64: 0x782104059a01660 (crc0) ~76us FC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~29us Note that we still find that using the INT64_IS_BUSTED code is over 100% quicker than the UINT64 code for CRC64 calculation, and I believe it is not being used by default under Linux or Win32 for 32 bit platforms. Of course Tom's suggestion of going for CRC32 across the board would hopefully solve this entirely and bring the times down a little further too. Kind regards, Mark. WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] SO_KEEPALIVE
How come we don't set SO_KEEPALIVE in libpq? Is there any reason why we wouldn't want it on? -- /Dennis Björklund ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[HACKERS] farsi faq has not been added yet in website,
would you please fix it, why farsi faq is not in web site?With Regards,--taghi Discover Yahoo! Have fun online with music videos, cool games, IM & more. Check it out!
Re: [HACKERS] postgreSQL as deductive DBMS
Having written my thesis about deductive DBS I cannot resist giving my 2 cent. On Mon, May 16, 2005 at 01:42:24PM +0400, Dmitriy Letuchy wrote: Now some words about what must be done to realize described feature. The simple quickest way but the way without future is to write language handler. Other more correct way is to slightly extend DML part of SQL and more essentially extend DDL. For example, we have relation Inheritance with two attributes ClassID and ParentID. Now we want to define all descendants or all ancestors. For this goal we define predicate inheritance_all with the next two rules (i.e. inverse implications): inheritance_all(ClassID, ParentID) :- inheritance(ClassID, ParentID); inheritance_all(ClassID, ParentID) :- inheritance(ClassID, X), inheritance_all(X, ParentID). We put this rules into database and call, for example, the next SQL commands: -- find all descendents SELECT * FROM ddb_name.inheritance_all(_, _) How do you plan to execute this statement. As you mentioned above all logic queries can be rewritten into SQL ones, but you have to find a way to handle recursion. I would think the best way is to add recursion to SQL and then completely rewrite the statements. Also, and that's where it starts to become interesting, how do you plan to handle negation inside recursion? Michael -- Michael Meskes Email: Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: [EMAIL PROTECTED] Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL! ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] postgreSQL as deductive DBMS
Dimitry, Thus another alternative to increase expressive power of query language is to develop its declarative (i.e. nonprocedural) part. And here we come to deductive database (DDB) with its logic language Datalog. You may want to look at the work of Rada Chirkova, who has already written a PostgreSQL-parse-tree-to-DataLog converter: http://research.csc.ncsu.edu/selftune/Report_031005.pdf -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Cost of XLogInsert CRC calculations
Mark Cave-Ayland [EMAIL PROTECTED] writes: I didn't post the sources to the list originally as I wasn't sure if the topic were of enough interest to warrant a larger email. I've attached the two corrected programs as a .tar.gz - crctest.c uses uint32, whereas crctest64.c uses uint64. I did some experimentation and concluded that gcc is screwing up big-time on optimizing the CRC64 code for 32-bit Intel. It does much better on every other architecture though. Here are some numbers with gcc 3.2.3 on an Intel Xeon machine. (I'm showing the median of three trials in each case, but the numbers were pretty repeatable. I also tried gcc 4.0.0 on this machine and got similar numbers.) gcc -O1 crctest.c 0.328571 s gcc -O2 crctest.c 0.297978 s gcc -O3 crctest.c 0.306894 s gcc -O1 crctest64.c 0.358263 s gcc -O2 crctest64.c 0.773544 s gcc -O3 crctest64.c 0.770945 s When -O2 is slower than -O1, you know the compiler is blowing it :-(. I fooled around with non-default -march settings but didn't see much change. Similar tests on a several-year-old Pentium 4 machine, this time with gcc version 3.4.3: gcc -O1 -march=pentium4 crctest.c 0.486266 s gcc -O2 -march=pentium4 crctest.c 0.520237 s gcc -O3 -march=pentium4 crctest.c 0.520299 s gcc -O1 -march=pentium4 crctest64.c 0.928107 s gcc -O2 -march=pentium4 crctest64.c 1.247673 s gcc -O3 -march=pentium4 crctest64.c 1.654102 s Here are some comparisons showing that the performance difference is not inherent: IA64 (Itanium 2), gcc 3.2.3: gcc -O1 crctest.c 0.898595 s gcc -O2 crctest.c 0.599005 s gcc -O3 crctest.c 0.598824 s gcc -O1 crctest64.c 0.524257 s gcc -O2 crctest64.c 0.524168 s gcc -O3 crctest64.c 0.524140 s X86_64 (Opteron), gcc 3.2.3: gcc -O1 crctest.c 0.46 s gcc -O2 crctest.c 0.46 s gcc -O3 crctest.c 0.46 s gcc -O1 crctest64.c 0.41 s gcc -O2 crctest64.c 0.41 s gcc -O3 crctest64.c 0.41 s PPC64 (IBM POWER4+), gcc 3.2.3 gcc -O1 crctest.c 0.819492 s gcc -O2 crctest.c 0.819427 s gcc -O3 crctest.c 0.820616 s gcc -O1 crctest64.c 0.751639 s gcc -O2 crctest64.c 0.894250 s gcc -O3 crctest64.c 0.888959 s PPC (Mac G4), gcc 3.3 gcc -O1 crctest.c 0.949094 s gcc -O2 crctest.c 1.011220 s gcc -O3 crctest.c 1.013847 s gcc -O1 crctest64.c 1.314093 s gcc -O2 crctest64.c 1.015367 s gcc -O3 crctest64.c 1.011468 s HPPA, gcc 2.95.3: gcc -O1 crctest.c 1.796604 s gcc -O2 crctest.c 1.676023 s gcc -O3 crctest.c 1.676476 s gcc -O1 crctest64.c 2.022798 s gcc -O2 crctest64.c 1.916185 s gcc -O3 crctest64.c 1.904094 s Given the lack of impressive advantage to the 64-bit code even on 64-bit architectures, it might be best to go with the 32-bit code everywhere, but I also think we have grounds to file a gcc bug report. Anyone want to try it with non-gcc compilers? I attach a slightly cleaned-up version of Mark's original (doesn't draw compiler warnings or errors on what I tried it on). regards, tom lane bin54DRmb8Gwp.bin Description: crctest.tar.gz ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] SO_KEEPALIVE
Dennis Bjorklund [EMAIL PROTECTED] writes: How come we don't set SO_KEEPALIVE in libpq? Is there any reason why we wouldn't want it on? Is there any reason we *would* want it on? The server-side keepalive should be sufficient to get whatever useful impact it might have. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Best way to scan on-disk bitmaps
Tom Lane [EMAIL PROTECTED] writes: I think it would be easy to change the planner and btree to handle this (where easy means I remember where all the skeletons are buried). But I don't know the gist code hardly at all. Can anyone offer an informed opinion on whether gist can handle this now, and if not what it would take to handle it? Currently gist indexes only use the first column for page splits, making multi-key gist indexes basically useless. The problem is it's hard to imagine an API for a pickSplit function that could handle multi-key indexes with disparate data types and operator classes. I had an idea of a new way to deal with gist indexes that simplified the API and side-stepped the whole issue but you raised concerns that it might be too limiting. Unfortunately the mailing list archive seems to have missed this discussion. I've attached three messages from the discussion at the time. ---BeginMessage--- Oleg Teodor, If I understand the code correctly, GiST will only pass the first attribute of each index tuple to the user-defined PickSplit method when it wants to split a node. (see circa line 1269 of gist.c) Is this a wise design decision? Granted, in many situations the first attribute in the index is sufficient to make a reasonable decision about how to divide the node into two halves, but I don't think that is universally true. For example, consider a two column index whose first attribute has a small number of distinct values. It could well be that all the first attribute values in a node-to-be-split would be the same. Only passing the first attribute to PickSplit would result in an essentially random distribution of tuples among the split nodes, rather than allowing the GiST extension to make use of the second attribution to partition the nodes. That's an extreme example, but it is easy to construct more realistic scenarios (basically, any situation in which the cardinality of the first index attribute is low -- a reasonably common occurrence with a multi-column index, I believe). I'm not sure whether this would be a problem in practice. Speculation: repeated invocations of PickSplit are one of the main factors in deriving the ultimate shape of the GiST tree. Poor distribution of keys resulting from PickSplit would eventually result in unnecessarily loose bounding predicates in internal nodes, which would hurt performance. Comments welcome, Neil ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---End Message--- ---BeginMessage--- Greg Stark [EMAIL PROTECTED] writes: I'm not sure that GiST indexes behave the same way as btree indexes for the multi-column case. In a btree index the second column is entirely subordinate to the first column. In a GiST index the data is multi-dimensional, and all dimensions are equally important. In fact on further consideration I do have a proposal. If you look at the GiST implementations for various index types you'll see that many (all?) take the same approach for PickSplit. In fact they pretty much all have the same code copy/pasted to handle it. The approach they take is to have a function which calculates an abstract distance between any two entries. There's an algorithm that they use to pick the split based on this distance function. If you abandoned PickSplit and instead exposed this distance function as the external API then the behaviour for multi-column indexes is clear. You calculate the distance along all the axes and calculate the diagonal distance. I think abandoning PickSplit for the distance function might also mean you don't need a separate function for Penalty either, but I'm not sure on that. -- greg ---End Message--- ---BeginMessage--- Greg Stark [EMAIL PROTECTED] writes: The approach they take is to have a function which calculates an abstract distance between any two entries. There's an algorithm that they use to pick the split based on this distance function. If you abandoned PickSplit and instead exposed this distance function as the external API then the behaviour for multi-column indexes is clear. You calculate the distance along all the axes and calculate the diagonal distance. Hmm ... the problem with that is the assumption that different opclasses will compute similarly-scaled distances. If opclass A generates distances in the range (0,1e6) while B generates in the range (0,1), combining them with Euclidean distance won't work well at all. OTOH you can't blindly normalize, because in some cases maybe the data is such that a massive difference in distances is truly appropriate. I'm also a bit leery of the assumption that every GiST application can reduce its PickSplit logic to Euclidean distances. regards, tom lane ---End Message--- (hash and rtree are not at issue since they don't support multi-key indexes.) It seems like it would be trivial to make
Re: [HACKERS] pgFoundry
On Saturday 07 May 2005 16:23, Joshua D. Drake wrote: What does it mean to track the status of something? How would the status change except by discussion? What would be the point of announcing the status of something without allowing people to comment? No one said anything about not letting people comment or discuss. What I am suggesting is a better public presentation of what the heck is going on with PostgreSQL development. This has been a perennial problem and has existed at least since version 6.1.0, as that's when I first noticed it. Looking at the website alone doesn't show the dynamic development that is actually happening, and has never shown it. I very nearly passed PostgreSQL by because it looked unmaintained at the time from looking at the website. That cannot be said at this juncture, but at the same time that is not real indication of how dynamic things really are. From a sidelines point of view, a developer status summary page would allow one to follow development without having to read every message in HACKERS. At this point in my work, I am unable to follow development like I once did (one reason I stepped down as RPM maintainer) and have no real idea of the direction for 8.1 as a result. To put it much more bluntly: PostgreSQL development (both the process and the codebase) has one of the steepest learning curves around, steeper even than Plone, which is acknowledged by many to have an incredibly steep learning curve. The only way in my mind to get this dynamism on the website is to make the website part of the process at some level. If someone has to go through and digest for the website (hacker-bits, a la general bits) then that takes away developer resources (because someone has to be fairly close to the development process to do that sort of digestion). Rather, if developers are using a system that automatically pushes to a web interface (or even uses a web interface with a cli component) then the status is automatically generated and the work of updating status is distributed. I think you have a severely flawed idea of how free software development proceeds. Then you obviously aren't paying attention. Look at other major OSS projects. They have these things in place. Even the Linux kernel has a bugzilla (although I am not advocating bugzilla). Not to mention KDE, Gnome, Debian.. These projects also have reasonably defined milestones for particular releases and show status of those milestones during the release. Virtually all OSS projects I am involved with publish a generalized road map online. Some are more organized than others. PostgreSQL has a different culture, this is true. But it is somewhat of an intimidating culture for an outsider; once 'inside' (which takes 6 months or more unless you're another Tom Lane (I love going back and reading the way he just 'jumped in' to the project)) this is one of the friendliest development communities around. One might say the high bar of entry and the steep learning curve is partly the reason for that; culture changes take careful thought to implement, and a web-published development might easily change the whole culture of the project. The biggest flamewars I have seen here involve one of the following topics: 1.) GPL vs BSD 2.) MySQL 3.) Multithreaded versus multiprocess. Most other communities (with the notable exception of the GNUradio community, which is even more polite than this one, something I thought was not possible) have many more hot button topics than three. Like any other management process (sorry, those of you who think OSS means 'no management' - there is management here) the PostgreSQL process is unique due to the unique collection of members, and what works for one community won't necessarily work for another. Having said that, I'd love an 'at-a-glance' development status showing, possibly in graphical terms, where each subproject of the PostgreSQL core is at right now, updated frequently, as I've changed into more of a user role than a developer role; I can see the forest now, instead of all the RPM bushes and prairie grass. -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] SO_KEEPALIVE
On Mon, 16 May 2005, Tom Lane wrote: How come we don't set SO_KEEPALIVE in libpq? Is there any reason why we wouldn't want it on? Is there any reason we *would* want it on? The server-side keepalive should be sufficient to get whatever useful impact it might have. Wouldn't the client also want to know that the server is not there anymore? I talked to Gaetano Mendola (I think, but you never know on irc :-) and he had some clients that had been hanging around for 3 days after the server had been down and later up again (stuck in recv). Server-side keepalive is enough for the server to clean up when clients disapears, but this do nothing to help clients detect that the server is gone. So I don't see what server side keepalive has to do with it. -- /Dennis Björklund ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] alternate regression dbs?
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Try attached ... season to taste. The bulk of it is changes for dblink which has the dbname hardcoded. Joe, any objections here? Hmm, I can't find the message with the attachment, in my inbox or in the list archives. Can anyone point me to it? Joe ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pgFoundry
Lamar Owen wrote: Look at other major OSS projects. They have these things in place. Even the Linux kernel has a bugzilla (although I am not advocating bugzilla). Not to mention KDE, Gnome, Debian.. These projects also have reasonably defined milestones for particular releases and show status of those milestones during the release. Virtually all OSS projects I am involved with publish a generalized road map online. Some are more organized than others. PostgreSQL has a different culture, this is true. I don't think anybody is arguing for a radical change in culture - certainly I would not be so presumptuous after only a couple of years :-) But a roadmap could be useful in many ways. It need not tie anybody down, if positioned right, but can help people to see where things are going, and where the gaps are. This could in a sense be as simple as prioritising the TODO list. Right now anybody who wants to contribute and looks at the list has no idea if the item is considered important or even if it is still thought to be desirable. There are many changes that can be rung on this theme - you would probably want to keep the roadmap process as light as possible for the cultural reasons you mention. cheers andrew ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Cost of XLogInsert CRC calculations
Hi Tom, I didn't post the sources to the list originally as I wasn't sure if the topic were of enough interest to warrant a larger email. I've attached the two corrected programs as a .tar.gz - crctest.c uses uint32, whereas crctest64.c uses uint64. Kind regards, Mark. WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: 16 May 2005 15:01 To: Mark Cave-Ayland (External) Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations Mark Cave-Ayland [EMAIL PROTECTED] writes: Sigh, so it would help if I had added the offset to the data pointer... ;) Would you post the corrected program so people can try it on a few other architectures? No point in reinventing the wheel, even if it is a pretty trivial wheel. regards, tom lane crctest.tar.gz Description: Binary data ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] bitmap scans, btree scans, and tid order
On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote: Jeffrey Baker [EMAIL PROTECTED] writes: Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. This is a fallacy, and I think your concern is largely mistaken. Have you experimented with the cases you are worried about? Perhaps I have not stated the problem clearly. Believe me, I have experimented extensively with this problem. So here's the statement of the problem: during an index scan, the heap is visited in index order, which can be and frequently is random order on the disk. An index scan that currently takes 15 minutes on my database takes only 5 when the tuples are fetched in strict disk order, and takes 8 minutes if the tuples are fetched in disk order in groups of 1000. The problem exists when the scan is expected to return a lot of tuples, but the planner chooses an index scan anyway. In many cases, sequential scan would be faster than index scan + random heap visits. It's entirely possible that the current cost model needs adjustment to make the planner pick the bitmap scan in more cases. However, it is easy to demonstrate (either by thought-experiment or actual trial) that a bitmap scan isn't necessarily a win. I agree. There's certainly a threshold below which the bitmap would not make sense. It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. The overhead of maintaining the bitmap is far from zero, and you don't get anything unless the resulting pattern of heap page fetches is significantly cleaner than before. Bitmaps scans naturally come back in TID order. I realize this isn't 1:1 correspondence with disk order, but it's a good start. I also like the way the index scan and heap scan are decoupled in the bitmap code. It leaves room for more serious optimization of disk access, which is relatively difficult in the synchronous, one-at-a-time btree code. Therefore, a patch that eliminates cost-estimation in favor of trying to force one choice or the other is definitely not likely to be received favorably ... That's not the idea. -jwb ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pgFoundry
Andrew, down, if positioned right, but can help people to see where things are going, and where the gaps are. This could in a sense be as simple as prioritising the TODO list. Right now anybody who wants to contribute and looks at the list has no idea if the item is considered important or even if it is still thought to be desirable. There are many changes that can be rung on this theme - you would probably want to keep the roadmap process as light as possible for the cultural reasons you mention. The substantial problem here is that nobody *wants* to create a roadmap type document. If you can find a volunteer, it'd be worth discussing -- I can see a way we can make a roadmap without being deceptive about how we get features. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pgFoundry
Andrew Dunstan [EMAIL PROTECTED] writes: I don't think anybody is arguing for a radical change in culture - certainly I would not be so presumptuous after only a couple of years :-) But a roadmap could be useful in many ways. It need not tie anybody down, if positioned right, but can help people to see where things are going, and where the gaps are. This could in a sense be as simple as prioritising the TODO list. I think that even getting that done would turn into a flamew^H^H^H^Hhuge distraction. The way things really work around here is that individual developers have their own priorities and they work on what seems most important to them at the time. (In some cases those priorities may be set by their companies more than by the individuals, but that's irrelevant from the community's perspective.) ISTM any sort of project-wide prioritization would be either (1) meaningless or (2) a guaranteed-to-fail attempt to assert control over other contributors. But the TODO list could certainly be made more informative without getting into that swamp. I don't think it does very well at conveying the relative sizes of the work items, nor the extent to which there is consensus about how particular problems ought to be solved (the fact that something is on TODO does not necessarily mean that all the major contributors have bought into it...). And of course you're right that it tells nothing at all about whether progress is currently being made on a given item. The markers indicating that someone has expressed interest in an item don't mean they are actively doing anything with it. The real difficulty here is exactly what Lamar noted: who's going to do the work? Bruce seems to be swamped already, so we'd need a new volunteer to maintain a more useful TODO list, and there doesn't seem to be anyone who wants to do it and has the depth of familiarity with the project to do a good job of it. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pgFoundry
On Mon, 2005-05-16 at 12:50, Lamar Owen wrote: From a sidelines point of view, a developer status summary page would allow one to follow development without having to read every message in HACKERS. At this point in my work, I am unable to follow development like I once did (one reason I stepped down as RPM maintainer) and have no real idea of the direction for 8.1 as a result. I don't think anyone is against this idea, but as of yet no one has stepped forward with the time to keep such a thing updated. The only way in my mind to get this dynamism on the website is to make the website part of the process at some level. If someone has to go through and digest for the website (hacker-bits, a la general bits) then that takes away developer resources (because someone has to be fairly close to the development process to do that sort of digestion). Rather, if developers are using a system that automatically pushes to a web interface (or even uses a web interface with a cli component) then the status is automatically generated and the work of updating status is distributed. One idea I've tossed around is requiring patches to include release notes, and then display the release notes on the web site as a done so far type of list. It doesn't get you what is under active development, but would get you a more up-to-date picture of changes as a release evolves. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] alternate regression dbs?
Joe Conway wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Try attached ... season to taste. The bulk of it is changes for dblink which has the dbname hardcoded. Joe, any objections here? Hmm, I can't find the message with the attachment, in my inbox or in the list archives. Can anyone point me to it? my fault - I sent the original to the wrong list - meanwhile, Tom, who was copied on the original, replied to that :-) Anyway, see http://archives.postgresql.org/pgsql-patches/2005-05/msg00179.php cheers andrew ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Returning the name of a primary key
Hello all I need to write a function that retrieve the name of at least one table primary key, if it exists. The only argument passed to the function is the table name. I have thought something like this: char * give_pkey(char * table_char) TupleDesc tupdesc; Form_pg_attribute att; Form_pg_index ind; int i, c=0, temp=-1; tupdesc = (TupleDesc) RelationNameGetTupleDesc(tabla_char); ind = something that idicates which table is for (i=0; i(tupdesc-natts); i++) { att = tupdesc-attrs[i]; c = c + 1; /* Something that can compare each attribute to determine if it is a primary key ?*/ if ((ind-indisprimary) (temp=-1)) { temp = c; att = tupdesc-attrs[temp]; } } return pstrdup(NameStr(att-attname)); } Sorry, I don't have much experience in c programming, thanks in advance for any suggestions, regards, Juan P. Espino ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] bitmap scans, btree scans, and tid order
On P, 2005-05-15 at 23:58 -0700, Jeffrey Baker wrote: I'm considering one of the following courses of action: Change nodeIndexscan.c to call index_getmulti, and to handle multiple tuples returned. That code would sort the tuple array and store the tuples in the result in disk order. -or- Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. I think the 2nd option is probably the way to go. Probably not _all_ cases - it's probably cheper to not build a bitmap for cases where the index returns only a few (or just one) rows. OTOH, some scheme, where you fill sort_mem-sized memory structure with tuples, which are fetched in heap order but stored in that structure in index order could also be an interesting optimisastion for sort-order preserving btree-index scans. this could also be used for bigger sort as first step by saving each chunk to disk and then merging them back into result. in pseudocode one iteretion of this could go like this TIDARRAY = ARRAY[1000] I = 0 FOR TID in FETC_FROM_BTREE(0, 1000): ARRAY [I] := (TID, I) SORT_ARRAY_ON_TID()# this must be much faster than sorting # whole tuples for this method to be a win # over bitmap_scan + sort. # using some self-ordering structure like # btree/rbtree for TIDARRAY is also an option OUTARRAY = ARRAY[1000] FOR (TID, I) IN TIDARRAY: OUTARRAY[I] = FETCH_FROM_HEAP(TID) RETURN OUTARRAY This can probably not use bitmap scan code, but has the nice property of doing the disk accesses in file order but still returning the result in index order. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] bitmap scans, btree scans, and tid order
Jeffrey W. Baker [EMAIL PROTECTED] writes: On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote: This is a fallacy, and I think your concern is largely mistaken. Have you experimented with the cases you are worried about? Perhaps I have not stated the problem clearly. Believe me, I have experimented extensively with this problem. Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*? The current code is certainly capable of choosing either seqscan, bitmap scan, or traditional index scan depending on the given query. For example, regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000; QUERY PLAN -- Bitmap Heap Scan on tenk1 (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1) Recheck Cond: ((unique1 = 100) AND (unique1 = 1000)) - Bitmap Index Scan on tenk1_unique1 (cost=0.00..9.58 rows=930 width=0) (actual time=4.522..4.522 rows=901 loops=1) Index Cond: ((unique1 = 100) AND (unique1 = 1000)) Total runtime: 23.784 ms (5 rows) regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000; QUERY PLAN - Index Scan using tenk1_unique2 on tenk1 (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901 loops=1) Index Cond: ((unique2 = 100) AND (unique2 = 1000)) Total runtime: 20.331 ms (3 rows) (The reason these apparently-identical queries get different plans is that the unique2 column is physically ordered, so the plain indexscan gets a very high correlation discount.) There are enable_foo switches to let you force selection of any one of the three ways for testing purposes. It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. My feeling is that that would add a lot of complexity for dubious win. The reason it's dubious is that it would penalize some cases, for instance LIMIT-type queries where you aren't going to fetch many tuples anyway. I think that at least for the time being (8.1 time frame) we should leave traditional indexscans alone and concentrate on being sure we are getting the most we can out of the new bitmap scan code. Only after that dust has settled will we have hard facts with which to argue about whether compromise behaviors might be wins. I think the work that's most needed at the moment is to test the bitmap-scan cost model to see if it has much to do with reality... regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Returning the name of a primary key
On Mon, May 16, 2005 at 01:25:46PM -0500, Juan Pablo Espino wrote: I need to write a function that retrieve the name of at least one table primary key, if it exists. The only argument passed to the function is the table name. I have thought something like this: Why mess around with a C function, if you can do it in straight SQL? You can write a SQL function if need be. -- Alvaro Herrera (alvherre[a]surnet.cl) Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos / con todos los humanos acabaré (Bender) ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] bitmap scans, btree scans, and tid order
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. My feeling is that that would add a lot of complexity for dubious win. The reason it's dubious is that it would penalize some cases, for instance LIMIT-type queries where you aren't going to fetch many tuples anyway. I think that at least for the time being (8.1 time frame) we should leave traditional indexscans alone and concentrate on being sure we are getting the most we can out of the new bitmap scan code. Only after that dust has settled will we have hard facts with which to argue about whether compromise behaviors might be wins. I agree. I'll look at how my workload behaves with CVS code. I wasn't proposing this for 8.1 inclusion, and the TODO isn't marked for 8.1. I think the work that's most needed at the moment is to test the bitmap-scan cost model to see if it has much to do with reality... Alright. -jwb ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] pgFoundry
On Monday 16 May 2005 14:12, Robert Treat wrote: On Mon, 2005-05-16 at 12:50, Lamar Owen wrote: The only way in my mind to get this dynamism on the website is to make the website part of the process at some level. If someone has to go One idea I've tossed around is requiring patches to include release notes, and then display the release notes on the web site as a done so far type of list. It doesn't get you what is under active development, but would get you a more up-to-date picture of changes as a release evolves. Well, there is always the '-committers' list; if all CVS commits had/have meaningful commit changelog notices, then that could drive something. There are two things being talked about here: 1.) A forward-looking rad map; 2.) A status indication of where development is happening, and a history of past development. In my mind 2 is more important than 1, for all the reasons Tom already mentioned. -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pgFoundry
Robert Treat [EMAIL PROTECTED] writes: One idea I've tossed around is requiring patches to include release notes, and then display the release notes on the web site as a done so far type of list. It doesn't get you what is under active development, but would get you a more up-to-date picture of changes as a release evolves. We did do that (not very rigorously) during the 7.4 release cycle. I'm not sure why we fell out of the habit again for 8.0. It seems like a reasonable idea to me. I don't think I'd necessarily do it exactly the way it was done before though --- rather than keep the info in release.sgml, which is big and hard to edit already, maybe a separate plain-text file would be easier to work with. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Returning the name of a primary key
Juan Pablo Espino [EMAIL PROTECTED] writes: I need to write a function that retrieve the name of at least one table primary key, if it exists. The only argument passed to the function is the table name. I have thought something like this: You need to be searching the list of indexes, not the attributes per se. ATExecDropNotNull() might be a useful example. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pgFoundry
On Monday 16 May 2005 14:07, Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: This could in a sense be as simple as prioritising the TODO list. But the TODO list could certainly be made more informative without getting into that swamp. We need to prune the TODO list to make it more useful. This will be hard, this is true, but if a pruning discussion for each item on the list is held then the real interest in the item would stick out like a sore thumb. It's just too big and not really hierarchical. Are we too close to freeze and beta on 8.1 to have this sort of discussion? It doesn't need to be discussed close to a beta cycle, IMO, or it could easily turn into the huge distraction Tom speaks of. -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Cost of XLogInsert CRC calculations
On Mon, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote: This now gives the following (correct) result on both platforms: Win32: 1.8GHz P4, WinXP Linux: 2.8GHz Xeon, FC1 Win32 UINT64: 0x782104059a01660 (crc0) ~158us Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0)~58us FC1 UINT64: 0x782104059a01660 (crc0) ~76us FC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~29us Note that we still find that using the INT64_IS_BUSTED code is over 100% quicker than the UINT64 code for CRC64 calculation, and I believe it is not being used by default under Linux or Win32 for 32 bit platforms. Of course Tom's suggestion of going for CRC32 across the board would hopefully solve this entirely and bring the times down a little further too. I think perhaps that the difference in hardware is the reason for the difference in elapsed time, not the OS. The performance gain is disturbing. I think its supposed to be the other way around isn't it? Like having INT64 is supposed to be good... Perhaps the BIOS on your systems don't correctly support 64-bit, so mimicking it costs more. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] keepalive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, this weekend our DB server crashed ( 7.4.x ) and the DB engine was relocated on another server. So far so well. Unfortunatelly most of our clients are still on a recv trying to recv data from the old DB engine instance: # netstat -anp | grep ESTABLISHED | grep 54497 tcp0 0 193.251.135.46:5449710.10.50.47:5432ESTABLISHED 27331/psql # ps -eafwww | grep 27331 apache 27331 27327 0 May13 ?00:00:00 psql -t -c SELECT is_authorized_flag FROM v_ts_quota WHERE login='*' -h x.y.z.w yyy as you can see this connection is there since 3 days now. Digging on it I see these sockets are not using the keepalive options: netstat -anop | grep ESTABLISHED | grep 54497 tcp0 0 193.251.135.46:5449710.10.50.47:5432ESTABLISHED 27331/psql off (0.00/0/0) normaly on other connections I see: tcp0 0 127.0.0.1:199 127.0.0.1:32784 ESTABLISHED 1255/snmpd keepalive (7102.52/0/0) The same happens with JDBC connections ( the keepalive missing ). There is a reason for this or a way to instruct psql to use the keepalive option ? Regards Gaetano Mendola -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCiJNF7UpzwH2SGd4RAh8aAJ0WOBEzzjYf1gj1OaFGsFBE9mr4hgCfUhna D1F420Pa94lvrA04xA73tiE= =muzn -END PGP SIGNATURE- ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Cost of XLogInsert CRC calculations
On E, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote: -Original Message- From: Mark Cave-Ayland [mailto:[EMAIL PROTECTED] Sent: 16 May 2005 09:04 To: 'Simon Riggs' Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations Win32 UINT64: 0x782104059a01660 (crc0) ~158us Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~58us FC1 UINT64: 0x782104059a01660 (crc0) ~76us FC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~29us It would interesting to see how much of the time taken is actual algorithm and how much is getting at the data (i.e the fact that the page has to go through CPU at all, instead DMA'ing it to disk). In your test setup you probably have the whole thing in CPU cache already, but still it would be interesting to time the same thing with some simple algorithm, like XOR over the whole 8k page, for comparison. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Best way to scan on-disk bitmaps
About page splitting algorithm in GiST in multikey case. For the beginning, page is splitted by calling pickSplit method of key of first column (pickSplit method is defined for opclass and it is a user function), then it try to find equal values of first column in left and right pages ( gist.c lines 1264-1901 ). If it is, then GiST core will try to resort tuples with first equal keys between left and right pages using penalty method for second and higher column's key. If it's not, it leave pages untouched. But unions for parent page of second and higher column's keys will be formed. So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index can speed up queries like 'a=V1 and c=V2'. But it will not usable for queries ( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other operations. Number of supported operations by GiST is indefinite unlike, for example, btree which supported only five: , =, =, =, . -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] bitmap scans, btree scans, and tid order
Tom Lane [EMAIL PROTECTED] writes: Jeffrey W. Baker [EMAIL PROTECTED] writes: I see that Tom has already done the infrastructure work by adding getmulti, but getmulti isn't used by nodeIndexscan.c, only nodeBitmapIndexscan.c. Will btree index scans be executed by creating in-memory bitmaps in 8.1, or will some scans still be executed the usual way? We aren't going to remove the existing indexscan behavior, because bitmap scans lose the ordering of the underlying index. There are many situations where that ordering is important. (See for instance the recent changes to make MAX/MIN use that behavior.) Hm. There are other circumstances where the ordering doesn't matter. When there's another unrelated ORDER BY clause or merge join wrapped around the index scan for example. This suggests one new 8.1 optimization strategy may be to add strategic no-op OR clauses to cause 8.1 to use a bitmapOr node. For example something like this where flag isn't very selective (say 25%) might run more slowly than a sequential scan because of the random access pattern. SELECT id,name,flag FROM tab WHERE flag ORDER BY name But adding a no-op bitmapOr node like: SELECT id,name,flag FROM tab WHERE flag OR indexed_never_true_flag ORDER BY name Might run faster, perhaps even more quickly than the sequential scan because the bitmap avoids the random access pattern but doesn't have to read the whole table. -- greg ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] bitmap scans, btree scans, and tid order
On 5/16/05, Tom Lane [EMAIL PROTECTED] wrote: regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000; QUERY PLAN -- Bitmap Heap Scan on tenk1 (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1) Recheck Cond: ((unique1 = 100) AND (unique1 = 1000)) - Bitmap Index Scan on tenk1_unique1 (cost=0.00..9.58 rows=930 width=0) (actual time=4.522..4.522 rows=901 loops=1) Index Cond: ((unique1 = 100) AND (unique1 = 1000)) Total runtime: 23.784 ms (5 rows) regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000; QUERY PLAN - Index Scan using tenk1_unique2 on tenk1 (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901 loops=1) Index Cond: ((unique2 = 100) AND (unique2 = 1000)) Total runtime: 20.331 ms (3 rows) Tom (or anyone with some round tuits and CVS-tip savy), if you have a chance at some point would you post the non-bitmap version of the query for tenk2 from above? I'd be very interested to see if the bitmap forced TID order fetch actually does help, or if it's outweighed by the bitmap setup overhead. TIA -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[HACKERS] SQL Request Size
What is the maximum SQL statement length that the server will accept? The libpq message length identifier is 4 bytes, which indicates that the max length is 4GB. But thats not exactly the same thing... Most other systems have a SQL request size limit much smaller than this, though I can't find reference to this. Thanks, Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Cost of XLogInsert CRC calculations
Mark Cave-Ayland [EMAIL PROTECTED] writes: Sigh, so it would help if I had added the offset to the data pointer... ;) Would you post the corrected program so people can try it on a few other architectures? No point in reinventing the wheel, even if it is a pretty trivial wheel. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] bitmap scans, btree scans, and tid order
Jeffrey Baker [EMAIL PROTECTED] writes: Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. This is a fallacy, and I think your concern is largely mistaken. Have you experimented with the cases you are worried about? It's entirely possible that the current cost model needs adjustment to make the planner pick the bitmap scan in more cases. However, it is easy to demonstrate (either by thought-experiment or actual trial) that a bitmap scan isn't necessarily a win. The overhead of maintaining the bitmap is far from zero, and you don't get anything unless the resulting pattern of heap page fetches is significantly cleaner than before. Therefore, a patch that eliminates cost-estimation in favor of trying to force one choice or the other is definitely not likely to be received favorably ... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SQL Request Size
On E, 2005-05-16 at 21:18 +0100, Simon Riggs wrote: What is the maximum SQL statement length that the server will accept? The libpq message length identifier is 4 bytes, which indicates that the max length is 4GB. But thats not exactly the same thing... Most other systems have a SQL request size limit much smaller than this, though I can't find reference to this. I've had problems with a query generated by slony that was a few hundred kilobytes in size. It gave me query too complex error, probably overflow somewhere in planner/optimiser. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] performance of bitmap scans in nested loop joins
I wrote: ... Where we are losing is mostly on the actual manipulation of the bitmaps (particularly hash_seq_search which is done in tbm_begin_iterate; and it looks like memory allocation for the bitmap hashtables is nontrivial too). I had already had a TODO item to look into speeding up hash_seq_search ... will see what I can find. I got around to looking at this some more. It seems that the basic problem is that dynahash.c isn't optimized for very small hashtables. The test case I'm looking at (which may be an oversimplification of Sergey's problem) never has more than one entry in the hashtables it creates for in-memory bitmaps, and so the hashtable overhead is pretty significant. Particularly the overhead of creating and deleting 'em. The other uses of dynahash.c that are at all performance-critical seem to deal with hashtables that are (a) fairly large and (b) long-lived. So I'm thinking that hacking dynahash.c itself to improve this situation would probably be counterproductive overall. An idea that comes to mind is to allow tidbitmap.c to handle the cases of zero or one page entry directly, storing the page entry in a simple field of the TIDBitmap struct. Only when the bitmap needs entries for more than one heap page will we create a subsidiary hash table. (Note that we could store bits for more than one tuple, if they happen to be on the same heap page.) This would uglify the logic in tidbitmap.c a bit, but it doesn't seem completely preposterous. The main objection I can see is that it would only optimize the zero-or-one-page cases, which might be insufficient. Anyone have a better idea for speeding up operations on small bitmaps? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] SQL Request Size
Simon Riggs [EMAIL PROTECTED] writes: What is the maximum SQL statement length that the server will accept? There is no fixed limit, short of where you start to overflow memory and/or stack depth. http://archives.postgresql.org/pgsql-general/2001-02/msg00776.php regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] SO_KEEPALIVE
On E, 2005-05-16 at 19:22 +0200, Dennis Bjorklund wrote: On Mon, 16 May 2005, Tom Lane wrote: How come we don't set SO_KEEPALIVE in libpq? Is there any reason why we wouldn't want it on? Is there any reason we *would* want it on? The server-side keepalive should be sufficient to get whatever useful impact it might have. Wouldn't the client also want to know that the server is not there anymore? I talked to Gaetano Mendola (I think, but you never know on irc :-) and he had some clients that had been hanging around for 3 days after the server had been down and later up again (stuck in recv). stuck in recv is symptom of a reconnect bug when libpq first tries to test for a SSL connection but the connect has already gone away. (search for [HACKERS] oldish libpq bug still in RC2 in lists) Tom fixed it in no time once I showed him where to look and provided a test case. It should be fixed in 8.0. I don't know if the fix was backported to older libpq versions as well. Server-side keepalive is enough for the server to clean up when clients disapears, but this do nothing to help clients detect that the server is gone. So I don't see what server side keepalive has to do with it. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Best way to scan on-disk bitmaps
If people have GIST TODOs, please post them. --- Teodor Sigaev wrote: About page splitting algorithm in GiST in multikey case. For the beginning, page is splitted by calling pickSplit method of key of first column (pickSplit method is defined for opclass and it is a user function), then it try to find equal values of first column in left and right pages ( gist.c lines 1264-1901 ). If it is, then GiST core will try to resort tuples with first equal keys between left and right pages using penalty method for second and higher column's key. If it's not, it leave pages untouched. But unions for parent page of second and higher column's keys will be formed. So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index can speed up queries like 'a=V1 and c=V2'. But it will not usable for queries ( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other operations. Number of supported operations by GiST is indefinite unlike, for example, btree which supported only five: , =, =, =, . -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pgFoundry
What projects have big roadmaps? The only one I can think of is Mozilla, and we all know how well that worked (aka Firefox). In fact, you could argue that the Mozilla focus on the roadmap blinded them to focusing on user needs, and made the obsolete. I have modifed the TODO HTML so the completed items are in italics. --- Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: I don't think anybody is arguing for a radical change in culture - certainly I would not be so presumptuous after only a couple of years :-) But a roadmap could be useful in many ways. It need not tie anybody down, if positioned right, but can help people to see where things are going, and where the gaps are. This could in a sense be as simple as prioritising the TODO list. I think that even getting that done would turn into a flamew^H^H^H^Hhuge distraction. The way things really work around here is that individual developers have their own priorities and they work on what seems most important to them at the time. (In some cases those priorities may be set by their companies more than by the individuals, but that's irrelevant from the community's perspective.) ISTM any sort of project-wide prioritization would be either (1) meaningless or (2) a guaranteed-to-fail attempt to assert control over other contributors. But the TODO list could certainly be made more informative without getting into that swamp. I don't think it does very well at conveying the relative sizes of the work items, nor the extent to which there is consensus about how particular problems ought to be solved (the fact that something is on TODO does not necessarily mean that all the major contributors have bought into it...). And of course you're right that it tells nothing at all about whether progress is currently being made on a given item. The markers indicating that someone has expressed interest in an item don't mean they are actively doing anything with it. The real difficulty here is exactly what Lamar noted: who's going to do the work? Bruce seems to be swamped already, so we'd need a new volunteer to maintain a more useful TODO list, and there doesn't seem to be anyone who wants to do it and has the depth of familiarity with the project to do a good job of it. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Best way to scan on-disk bitmaps
Bruce Momjian wrote: If people have GIST TODOs, please post them. Concurrency :) ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SQL99 hierarchical queries stalled
On Mon, May 16, 2005 at 03:09:18PM -0700, [EMAIL PROTECTED] wrote: David-- My boss has given me approval to put up to 8 hours per week of SourceLabs' time in on the SQL99 hierarchical query implementation. That's great! :) (I'm free, of course, to supplement this with whatever of my own time I can spare.) I'm willing to take on the work. What's the next step? I suppose the first thing would be to look over the patches I mentioned and the SQL:2003 specification, then put together a preliminary patch and send it to -hackers. Cheers, David. -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] SO_KEEPALIVE
Hannu Krosing [EMAIL PROTECTED] writes: On E, 2005-05-16 at 19:22 +0200, Dennis Bjorklund wrote: Wouldn't the client also want to know that the server is not there anymore? I talked to Gaetano Mendola (I think, but you never know on irc :-) and he had some clients that had been hanging around for 3 days after the server had been down and later up again (stuck in recv). stuck in recv is symptom of a reconnect bug when libpq first tries to test for a SSL connection but the connect has already gone away. (search for [HACKERS] oldish libpq bug still in RC2 in lists) Tom fixed it in no time once I showed him where to look and provided a test case. It should be fixed in 8.0. I don't know if the fix was backported to older libpq versions as well. It was not ... but I'm not convinced that that bug explains Gaetano's problem. If you'll recall, that bug caused libpq to get into a tight loop chewing CPU. It should be pretty easy to tell the difference between that and sitting idle because there is nothing happening. On the other hand, it seems to me a client-side SO_KEEPALIVE would only be interesting for completely passive clients (perhaps one that sits waiting for NOTIFY messages?) A normal client will try to issue some kind of database command once in awhile, and as soon as that happens, there is a reasonably short timeout before connection failure is reported. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Best way to scan on-disk bitmaps
On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote: Bruce Momjian wrote: If people have GIST TODOs, please post them. Concurrency :) And WAL support. -- Alvaro Herrera (alvherre[a]surnet.cl) No necesitamos banderas No reconocemos fronteras (Jorge González) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Best way to scan on-disk bitmaps
Alvaro Herrera wrote: On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote: Bruce Momjian wrote: If people have GIST TODOs, please post them. Concurrency :) And WAL support. Already there: * Add WAL index reliability improvement to non-btree indexes and this too: * Add concurrency to GIST -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] SQL99 hierarchical queries stalled
David Fetter [EMAIL PROTECTED] writes: What's the next step? I suppose the first thing would be to look over the patches I mentioned and the SQL:2003 specification, then put together a preliminary patch and send it to -hackers. You can get useful feedback long before having anything that looks like a patch --- and I'd encourage you to do so. Send us design notes, rough data structures, etc. Quite honestly, hierarchical queries aren't the easiest thing in the world and wouldn't be my recommendation of the first ... or even the second ... backend hacking project for a new posthacker to tackle. If that's where you feel you must start, OK, but try to get as much feedback as soon as you can, sooner not later. I seem to recall some discussion of how to do this in the past; have you trolled the pghackers archives? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] SO_KEEPALIVE
On Mon, 16 May 2005, Tom Lane wrote: On the other hand, it seems to me a client-side SO_KEEPALIVE would only be interesting for completely passive clients (perhaps one that sits waiting for NOTIFY messages?) A normal client will try to issue some kind of database command once in awhile At least some of the clients was psql. -- /Dennis Björklund ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Returning the name of a primary key
On Mon, May 16, 2005 at 02:35:03PM -0400, Alvaro Herrera wrote: On Mon, May 16, 2005 at 01:25:46PM -0500, Juan Pablo Espino wrote: I need to write a function that retrieve the name of at least one table primary key, if it exists. The only argument passed to the function is the table name. I have thought something like this: Why mess around with a C function, if you can do it in straight SQL? You can write a SQL function if need be. http://lnk.nu/cvs.pgfoundry.org/2op.sql is an example of how to get the info in pure SQL. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Learning curves and such (was Re: [HACKERS] pgFoundry)
Lamar Owen [EMAIL PROTECTED] writes: To put it much more bluntly: PostgreSQL development (both the process and the codebase) has one of the steepest learning curves around, The backend hacking curve is certainly steep, but I wonder whether the problem isn't largely one of people biting off more than they can chew. I got my start by hacking some things in libpq, which is way more self-contained than any aspect of the backend; and then started hacking relatively small backend stuff. I think the reason I know as much as I do today is that I've always been willing to investigate minor bugs. No one of them was all *that* exciting, but over time I've had the opportunity to study a lot of the backend code. Nearby you can watch Neil Conway bootstrapping himself by doing minor code beautification projects --- again not that exciting in itself, but useful, and in any case the *real* reason he's doing it is to learn the backend code in general. (Right, Neil?) As against that I notice some new arrivals proposing to add deductive reasoning to Postgres: http://archives.postgresql.org/pgsql-hackers/2005-05/msg01045.php or implement SQL99 hierarchical queries: http://archives.postgresql.org/pgsql-hackers/2005-05/msg01089.php I might be wrong, but I'll bet lunch that neither of those projects will come to anything. You can't run before you learn to crawl. Maybe what we need is some documentation about how to get started as a Postgres hacker --- what to read, what sort of things to tackle for your first hack, etc. I think the people who have been successful around here are the ones who have managed to figure out the syllabus by themselves ... but surely we could try to teach those who come after. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: Learning curves and such (was Re: [HACKERS] pgFoundry)
Lamar, To put it much more bluntly: PostgreSQL development (both the process and the codebase) has one of the steepest learning curves around, You haven't looked at the OpenOffice.org code. wince -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Learning curves and such (was Re: [HACKERS] pgFoundry)
On Tue, May 17, 2005 at 01:32:03AM -0400, Tom Lane wrote: Maybe what we need is some documentation about how to get started as a Postgres hacker --- what to read, what sort of things to tackle for your first hack, etc. I think the people who have been successful around here are the ones who have managed to figure out the syllabus by themselves ... but surely we could try to teach those who come after. Didn't such a document exist at one point? I seem to remember reading about it on the old website... I think it might also be valuable to have a list of things that are good 'starter projects'. Maybe some indication of TODO items that are simpler. Possibly a list of bugs, too. It would probably also be useful to point out ways people can help that don't involve hacking C code (or at least not much). Things like docs, the website, advocacy, performance testing, etc. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly