Re: [HACKERS] Inconsistent behavior on Array Is Null?
Greg Stark [EMAIL PROTECTED] writes: array_lower() and array_upper() are returning NULL for a non-null input, the empty array, even though lower and upper bounds are known just as well as they are for any other sized array. They are behaving as if there's something unknown about the empty array that makes it hard to provide a lower bound or upper bound. Hm. My idea of the index lower bound is the smallest index for which there is an array member --- so I agree with Joe that it's not very well defined for an empty array. We could pick some arbitrary value, such as zero, for the LB and UB to be reported for an empty array, but I can't see any way that you could justify them except as arbitrary choices. I think that maybe we ought to question these two properties: * empty array is different from NULL ... really? Why? * storing a value into an element of a NULL array yields a NULL array instead of a singleton array. IIRC the latter is presently true because we couldn't figure out just what dimensionality to assign, but it might be easier to agree on that than to resolve these other arguments... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pre-loading a user table.
Why do you think it useful to preload something during InitPostgres, anyway? Any heavily used table will certainly be present in shared buffers already, and even more surely present in kernel buffers. And if you really want it preloaded you can issue dummy selects with a client right after startup. I really think locking a table into memory is a worthless feature if a good buffer manager is at work. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Tom Lane [EMAIL PROTECTED] writes: Hm. My idea of the index lower bound is the smallest index for which there is an array member --- so I agree with Joe that it's not very well defined for an empty array. Hm. The problem is that they have other implications. Like the upper bound is one less than the index an element added to the upper end will get, and similarly the lower bound is one more than the index that would be assigned to an element added on the low end. Currently there is a lower bound and upper bound in the implementation even for empty arrays. I have empty arrays in my table that have a lower bound of 0, and they behave slightly differently than arrays with lower bounds of 1. I think that maybe we ought to question these two properties: * empty array is different from NULL ... really? Why? * storing a value into an element of a NULL array yields a NULL array instead of a singleton array. Well that breaks other things. Then lots of functions have to become non-strict to work properly because they should have valid output when passed null values. Ick. I'm leaning towards suggesting that postgres should follow sql-99 here and normalize all array indexes to have a lower bound of 1. Then array_lower and array_upper become entirely unnecessary. Instead we just have array_length which is exactly equivalent to my idea of array_upper. -- greg ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] GiST future
Hi every one. I've played a bit with contrib/intarray and very much liked being able to index a whole array. Now I wonder why 7.4 array queries using SOME/ANY/ALL don't make use of this king of indexes. Is there any plan to integrate Oleg's job into the core as GiST indexes were? it would be nice to generalize things like SELECT col1 from table where col2 contains (select col3 from table2 where ...) and use those indexes. Or did I miss something in the docs? What does core team think? Best regards -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: [EMAIL PROTECTED] -- Make your life a dream, make your dream a reality. (St Exupery) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Greg Stark [EMAIL PROTECTED] writes: Well that breaks other things. Then lots of functions have to become non-strict to work properly because they should have valid output when passed null values. Ick. ... I'm leaning towards suggesting that postgres should follow sql-99 here and normalize all array indexes to have a lower bound of 1. That would break even more things, no? On the other hand, it'd get rid of the problem that we presently face with dump/restore of arrays that don't have lower bound 1. Because pg_dump doesn't do anything to mark such values, they'll end up with lower bound 1 after reload anyway. The fact that we haven't heard lots of squawks about that suggests to me that not many people are using such arrays at present ... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Greg Stark wrote: array_lower() and array_upper() are returning NULL for a non-null input, the empty array, even though lower and upper bounds are known just as well as they are for any other sized array. They are behaving as if there's something unknown about the empty array that makes it hard to provide a lower bound or upper bound. Sorry, but I still disagree. There *is* something unknown about the lower and upper bound of an empty array because there are no bounds. I know it's possible to work around this special case, but I'm saying it's odd to have an irregularity in the interface. What justification is there for breaking the invariant length = upper-lower+1 ? I don't see the spec defined CARDINALITY as a workaround. It defines length as the number of elements in the array. When the array is empty, that value is clearly 0. Nothing strange about it. Yes I read the examples you gave, but you a) had to work around the nit with a special case in your function and b) still have corner cases where one of the invariants I named fails, namely: test=# select array_upper(a||b, 1), array_upper(a,1)+array_length(b) from (select '{}'::int[] as a, array[1,2] as b) as x; array_upper | ?column? -+-- 2 | (1 row) OK, you got me with this corner case. But using what you described as the result int_aggregate would give you in this case (-1), you would get an even stranger answer (-1 + 2 = 1) that would still need to be worked around. Joe ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Tom Lane wrote: I think that maybe we ought to question these two properties: * empty array is different from NULL ... really? Why? I think this makes sense, similar to the difference between '' and NULL. * storing a value into an element of a NULL array yields a NULL array instead of a singleton array. Same argument. If you think of text as an array or characters, you get this analogy (sort of): regression=# create table s1(f1 int, f2 text); CREATE TABLE regression=# insert into s1 values(1, null); INSERT 164679 1 regression=# select f1, substr(f2, 1, 1) is null from s1; f1 | ?column? +-- 1 | t (1 row) regression=# update s1 set f2 = 'a' || substr(f2, 2); UPDATE 1 regression=# select f1, substr(f2, 1, 1) is null from s1; f1 | ?column? +-- 1 | t (1 row) Joe ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Problems Vacuum'ing
I had written: [snip] The problem is that attempts to vacuum these tables resulted in NNN dead row versions cannot be removed yet. Went through a lot of analysis (e.g.: Any hanging txns?) and trying different things with folks on the #PostgreSQL IRC channel, all to no avail. [snip] Okay, the mystery is *partially* solved. In IRC, when this was brought up again this morning due to my post to -hackers, two things happened: I did a vacuumdb on one of the tables and, much to my surprise, what wouldn't vacuum before I left work last nite *did* work this morning. And... JanniCash discovered this in utils/time/tqual.c (I believe): * OldestXmin is a cutoff XID (obtained from GetOldestXmin()). Tuples * deleted by XIDs = OldestXmin are deemed recently dead; they might * still be visible to some open transaction, so we can't remove them, * even if we see that the deleting transaction has committed. So the next thing I did was run a bunch of updates, quit the script, then ran a while loop from the (Unix) command-line, trying to vacuum the one table every 30 seconds. To summarize... Fri Apr 2 08:54:54 EST 2004 INFO: ethers: found 0 removable, 1834 nonremovable row versions in 93 pages DETAIL: 1466 dead row versions cannot be removed yet. ... Fri Apr 2 08:58:56 EST 2004 INFO: ethers: found 1466 removable, 368 nonremovable row versions in 93 pages DETAIL: 0 dead row versions cannot be removed yet. Which is all well-and-good (tho, my ignorance readily conceded, four minutes seems a mite... long), *except*: If I shut-down the WebObjects application which, again, never accesses the db in question, much-less any of its tables, this time-out doesn't seem to apply. (I tried it.) Any explanation for this behaviour? Thanks, Jim -- Jim Seymour | PGP Public Key available at: [EMAIL PROTECTED] | http://www.uk.pgp.net/pgpnet/pks-commands.html http://jimsun.LinxNet.com| ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Tom Lane wrote: Greg Stark [EMAIL PROTECTED] writes: I'm leaning towards suggesting that postgres should follow sql-99 here and normalize all array indexes to have a lower bound of 1. That would break even more things, no? On the other hand, it'd get rid of the problem that we presently face with dump/restore of arrays that don't have lower bound 1. Because pg_dump doesn't do anything to mark such values, they'll end up with lower bound 1 after reload anyway. The fact that we haven't heard lots of squawks about that suggests to me that not many people are using such arrays at present ... The more I think about it, the more I like it. Does everyone else agree that a lower bound of 1 complies with the spec? Joe ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Greg Stark wrote: I'm leaning towards suggesting that postgres should follow sql-99 here and normalize all array indexes to have a lower bound of 1. Then array_lower and array_upper become entirely unnecessary. Instead we just have array_length which is exactly equivalent to my idea of array_upper. Now we finally have something to agree on ;-) I do think this is the way to go, but it is a significant hit to backward compatibility. Same is true for supporting NULL elements of arrays -- maybe we should bite the bullet and make both changes at the same time? Joe ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR for replication?
Andrew, I may be completely missing the point here, but it looks to me as though the PITR archival mechanism is also most of a native replication facility. Is there anyone reason this couldn't be extended to replication, and if so, is anyone planning on using it as such? I believe that Command Prompt's Mammoth Replicator works on something like log shipping.There are both advantages and disadvantages to the approach, meaning that we will still need other replication strategies, like Slony and DBMirror. Also, the jump from PITR -- Log Shipping is not a trivial one. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Prepared select
How can I use a prepared select statement asmentioned in the documentation on SQL PREPARE.Preparing the statement is easy, the problem is using the plan to get a cursor.My assumption is the SQL OPEN command is not documented or there is some otherlibpq API to make this happen. Thanks
Re: [HACKERS] Better support for whole-row operations and composite
Tom Lane wrote: We have a number of issues revolving around the fact that composite types (row types) aren't first-class objects. I think it's past time to fix that. Here are some notes about doing it. I am not sure all these ideas are fully-baked ... comments appreciated. [Sorry for the delay in responding] Nice work, and in general it makes sense to me. A few comments below. We will be able to make generic I/O routines for composite types, comparable to those used now for arrays. Not sure what a convenient external format would look like. (Possibly use the same conventions as for a 1-D array?) So you mean like an array, but with possibly mixed datatypes? '{1 , abc def, 2.3}' Seems to make sense. Another option might be to use the ROW keyword, something like: ROW[1 , 'abc', 2.3] We could also think about allowing functions that are declared as accepting RECORD (ie, polymorphic-across-row-types functions). They would use the same methods already used by polymorphic functions to find out the true types of their inputs. (Might be best to invent a separate pseudotype, say ANYRECORD, rather than overloading RECORD for this purpose.) Check. I really like this idea. TupleDescGetSlot: no-op, returns NULL TupleGetDatum: ignore slot, return tuple t_data pointer as datum This will work because heap_formtuple and BuildTupleFromCStrings can return a HeapTuple whose t_data part is already a valid row Datum, simply by setting the appropriate length and type fields in it. (If the tuple is ever stored to disk as a regular table row, these fields will be overwritten with xmin/cmin info at that time.) Is this the way you did things in your recent commit? To convert a row Datum into something that can be passed to heap_getattr, one could use a local variable of type HeapTupleData and set its t_data field to the datum's pointer value. t_len is copied from the datum contents, while the other fields of HeapTupleData can just be set to zeroes. I think I understand this, but an example would help. * We have to be able to re-use an already-existing cache entry if it matches a requested TupleDesc. For anonymous record types, how will that lookup be done efficiently? Can the hash key be an array of attribute oids? If an ALTER TABLE command does something that requires examining or changing every row of a table, it would presumably have to do the same to all entries in any composite-type column of the table's rowtype. To avoid surprises and interesting debates about who has permissions to do this, it might be wise to restrict on-disk composite columns to be only of standalone composite types (ie, those made with CREATE TYPE AS). This restriction would also avoid debates about whether table constraints apply to composite-type columns. I agree. As an aside, it would be quite useful to have support for arrays of tuples. Any idea on how to do that without needing to define an explicit array type for each tuple type? Joe ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR for replication?
On Fri, Apr 02, 2004 at 08:47:09AM -0800, Josh Berkus wrote: I believe that Command Prompt's Mammoth Replicator works on something like log shipping.There are both advantages and disadvantages to the approach, meaning that we will still need other replication strategies, like Slony and DBMirror. I wonder what would you use DBMirror for once Slony-I is complete? -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) El destino baraja y nosotros jugamos (A. Schopenhauer) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PITR for replication?
Greg Stark J. Andrew Rogers [EMAIL PROTECTED] writes: I may be completely missing the point here, but it looks to me as though the PITR archival mechanism is also most of a native replication facility. Is there anyone reason this couldn't be extended to replication, and if so, is anyone planning on using it as such? My memory is fuzzy on this point, but I seem to recall that this is (was?) how replication is more-or-less done for many of the big commercial RDBMS. You're right...it is the basis of a log shipping replication facility. I'm keen no to get too far ahead of ourselves. Let's eat the mammoth one bite at a timeor one patch at a time. Well replication is one of those things that means different things to different people. IMHO, this is one of the simpler, more reliable, mechanisms and would be nice to have. And you're right that it shouldn't require much more work, in fact it's probably easier than a lot of other cases that PITR requires. I agree. PITR is intended initially as a recovery mechanism. Replication has other purposes as well, such as locating data close to where it is required (in master-multi-slave replication scenarios), which is certainly not an objective or even a likely end point of the PITR work. The PostgreSQL community is lucky enough to have some very competent people working on those other approaches and I would recommend everybody checks out that work. For a long time Oracle has supported this mechanism for running warm standby databases. Basically you maintain a second database and every time an archived log is finished you ship it over and immediately restore it on the standby machine. Whenever you have a failure you can quickly fail over to the standby database which is all ready to go and up-to-date within minutes of your failure. This facility is one of the intended features, if all goes well. But it is an advanced feature, not the bread and butter. Since 8i Oracle has also supported a more advanced version where you can open the standby database in read-only mode. This mode requires more thought, but is certainly possible, in time. Best Regards Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] GiST future
[EMAIL PROTECTED] wrote: Now I wonder why 7.4 array queries using SOME/ANY/ALL don't make use of this king of indexes. Is there any plan to integrate Oleg's job into the core as GiST indexes were? I hope to get to it someday, but have higher priorities if I can find some time between now and the 7.5 release. If you want to pitch in and implement it, by all means, go for it. Joe ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Problems Vacuum'ing
On Fri, Apr 02, 2004 at 12:02:22PM -0500, Jim Seymour wrote: [...] Which is all well-and-good (tho, my ignorance readily conceded, four minutes seems a mite... long), *except*: If I shut-down the WebObjects application which, again, never accesses the db in question, much-less any of its tables, this time-out doesn't seem to apply. (I tried it.) Ok, so the WebObjects app keeps an idle open transaction? (i.e. it issues a BEGIN as soon as the previous transaction is finished.) I'm not sure I read the code correctly -- ISTM it would only matter when you try to vacuum a shared table, which this is not ... -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) El hombre nunca sabe de lo que es capaz hasta que lo intenta (C. Dickens) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GiST future
Joe Conway [EMAIL PROTECTED] writes: [EMAIL PROTECTED] wrote: Now I wonder why 7.4 array queries using SOME/ANY/ALL don't make use of this king of indexes. Is there any plan to integrate Oleg's job into the core as GiST indexes were? I hope to get to it someday, but have higher priorities if I can find some time between now and the 7.5 release. If you want to pitch in and implement it, by all means, go for it. GiST indexes aren't really ready for prime time (no WAL logging and poor concurrent behavior being the main gripes). I'm hesitant to make any of the core SQL features depend on them until those issues are fixed. Oleg and Teodor have made noises about doing something about those problems, but I'm not sure it's at the top of their to-do lists ... regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] GiST future
On Fri, 2 Apr 2004, Joe Conway wrote: [EMAIL PROTECTED] wrote: Now I wonder why 7.4 array queries using SOME/ANY/ALL don't make use of this king of indexes. Is there any plan to integrate Oleg's job into the core as GiST indexes were? I hope to get to it someday, but have higher priorities if I can find some time between now and the 7.5 release. If you want to pitch in and implement it, by all means, go for it. current GiST lacks of concurrency and recovery support. It's doable and there are some plans (me and Teodor) to work on it, but we still have no time. That means, it's too early to integrate gist based contribs into core. Also, current GiST interface should be enhanced to be able support some very interesting data types like digital trees, similarity tree, etc. Joe ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [GENERAL] Large DB
Manfred Koizar [EMAIL PROTECTED] writes: What I have in mind is a kind of Double Vitter algorithm. Whatever we do to get our sample of rows, in the end the sampled rows come from no more than sample_size different blocks. So my idea is to first create a random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. That assumption is faulty, though --- consider wholly-empty pages. A bigger problem is that this makes the sampling quite nonuniform, because rows that are on relatively low-density pages would be more likely to become part of the final sample than rows that are on pages with lots of tuples. Thus for example your sample would tend to favor rows with wide values of variable-width columns and exclude narrower values. (I am not certain that the existing algorithm completely avoids this trap, but at least it tries.) regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Problems Vacuum'ing
On Fri, Apr 02, 2004 at 12:02:22PM -0500, Jim Seymour wrote: [...] Which is all well-and-good (tho, my ignorance readily conceded, four minutes seems a mite... long), *except*: If I shut-down the WebObjects application which, again, never accesses the db in question, much-less any of its tables, this time-out doesn't seem to apply. (I tried it.) Ok, so the WebObjects app keeps an idle open transaction? (i.e. it issues a BEGIN as soon as the previous transaction is finished.) I'm not sure I read the code correctly -- I really couldn't say. I don't know what the WebObjects app is doing. I know, or *believe* I know, it's only doing queries. (It may be doing temp tables internally, or some-such.) Its interface to pgsql is via the JDBC that comes with pgsql. I don't know what the Java code it generated looks like. ISTM it would only matter when you try to vacuum a shared table, which this is not ... That's what I would've thought. Thanks for the follow-up. I was beginning to wonder if anybody'd noticed ;). Jim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Tom Lane [EMAIL PROTECTED] writes: That would break even more things, no? On the other hand, it'd get rid of the problem that we presently face with dump/restore of arrays that don't have lower bound 1. Because pg_dump doesn't do anything to mark such values, they'll end up with lower bound 1 after reload anyway. The fact that we haven't heard lots of squawks about that suggests to me that not many people are using such arrays at present ... You have to be using not only arrays, but the new 7.4 functions provided to manipulate them. In fact I think you have to be using array_prepend specifically. But even there since it's not a mutator it's really not that surprising that the elements of the brand new array it's returning should have new indexes. In fact I suspect there are more people with hidden bugs where they depend on arrays starting at 1. This type of bug is insidious since it's hard to test for, your application might never generate an array with a lower bound other than 1 until someone adds some new code using array_prepend somewhere and all of the sudden you get strange behaviours from unrelated code. I can have the honour of being the first squawker like you describe, but my problem was only evidence that having such non-normalized arrays at all was surprising. I was using int_aggregate.c which generates non-standard arrays with lower bounds of 0. My code assumed array_upper()+1 == length. After I dumped and restored all my counts were off by one. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Better support for whole-row operations and composite types
Joe Conway [EMAIL PROTECTED] writes: Tom Lane wrote: We will be able to make generic I/O routines for composite types, comparable to those used now for arrays. Not sure what a convenient external format would look like. (Possibly use the same conventions as for a 1-D array?) So you mean like an array, but with possibly mixed datatypes? '{1 , abc def, 2.3}' Seems to make sense. The unresolved question in my mind is how to represent NULL elements. However, we have to solve that sooner or later for arrays too. Any thoughts? Another option might be to use the ROW keyword, something like: ROW[1 , 'abc', 2.3] This is a separate issue, just as the ARRAY[] constructor has different uses from the array I/O representation. I do want some kind of runtime constructor, but ROW[...] doesn't get the job done because it doesn't provide any place to specify the rowtype name. Maybe we could combine ROW[...] with some sort of cast notation? ROW[1 , 'abc', 2.3] :: composite_type_name CAST(ROW[1 , 'abc', 2.3] AS composite_type_name) Does SQL99 provide any guidance here? TupleDescGetSlot: no-op, returns NULL TupleGetDatum: ignore slot, return tuple t_data pointer as datum This will work because heap_formtuple and BuildTupleFromCStrings can return a HeapTuple whose t_data part is already a valid row Datum, simply by setting the appropriate length and type fields in it. (If the tuple is ever stored to disk as a regular table row, these fields will be overwritten with xmin/cmin info at that time.) Is this the way you did things in your recent commit? Almost. I ended up keeping TupleDescGetSlot as a live function, but its true purpose is only to ensure that the tupledesc gets registered with the type cache (see BlessTupleDesc() in CVS tip). The slot per se never gets used. I believe that CVS tip is source-code-compatible with existing SRFs, even though I adjusted all the ones in the distribution to stop using the TupleTableSlot stuff. The main point though is that row Datums now contain sufficient info embedded in them to allow runtime type lookup the same as we do for arrays. To convert a row Datum into something that can be passed to heap_getattr, one could use a local variable of type HeapTupleData and set its t_data field to the datum's pointer value. t_len is copied from the datum contents, while the other fields of HeapTupleData can just be set to zeroes. I think I understand this, but an example would help. There are several in the PL sources now, for instance plpgsql does this with an incoming rowtype argument: if (!fcinfo-argnull[i]) { HeapTupleHeader td; OidtupType; int32tupTypmod; TupleDesctupdesc; HeapTupleData tmptup; td = DatumGetHeapTupleHeader(fcinfo-arg[i]); /* Extract rowtype info and find a tupdesc */ tupType = HeapTupleHeaderGetTypeId(td); tupTypmod = HeapTupleHeaderGetTypMod(td); tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod); /* Build a temporary HeapTuple control structure */ tmptup.t_len = HeapTupleHeaderGetDatumLength(td); ItemPointerSetInvalid((tmptup.t_self)); tmptup.t_tableOid = InvalidOid; tmptup.t_data = td; exec_move_row(estate, NULL, row, tmptup, tupdesc); } This is okay because the HeapTupleData is not needed after the call to exec_move_row. * We have to be able to re-use an already-existing cache entry if it matches a requested TupleDesc. For anonymous record types, how will that lookup be done efficiently? Can the hash key be an array of attribute oids? Right, that's the way I did it. See src/backend/utils/cache/typcache.c As an aside, it would be quite useful to have support for arrays of tuples. Any idea on how to do that without needing to define an explicit array type for each tuple type? Hmm, messy ... I wonder now whether we still really need a separate pg_type entry for every array type. The original motivation for doing that has been at least partly subsumed by storing element type OIDs inside the arrays themselves. I wonder if we could go over to a scheme where, say, atttypid is the base type ID and attndims being nonzero is what you check to find out it's really an array of atttypid. Not sure how we could map that idea into function and expression args/results, though. Plan B would be to go ahead and create array types. Not sure I would want to do this for table rowtypes, but if we did it only for CREATE TYPE AS then it doesn't sound like an unreasonable amount of overhead. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Problems Vacuum'ing
On Fri, Apr 02, 2004 at 02:51:30PM -0500, Jim Seymour wrote: On Fri, Apr 02, 2004 at 12:02:22PM -0500, Jim Seymour wrote: Ok, so the WebObjects app keeps an idle open transaction? (i.e. it issues a BEGIN as soon as the previous transaction is finished.) I'm not sure I read the code correctly -- I really couldn't say. I don't know what the WebObjects app is doing. I know, or *believe* I know, it's only doing queries. (It may be doing temp tables internally, or some-such.) Its interface to pgsql is via the JDBC that comes with pgsql. I don't know what the Java code it generated looks like. Turn on query logging and see if the BEGIN is issued right after the COMMIT/ROLLBACK, or whether it waits and issues it right before SELECT/CREATE TEMP TABLE. It doesn't matter if it's only doing queries; if it does them inside a transaction, it would be enough to keep VACUUM from working properly. -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) La gente vulgar solo piensa en pasar el tiempo; el que tiene talento, en aprovecharlo ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: What I have in mind is a kind of Double Vitter algorithm. [...] random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. That assumption is faulty, though --- consider wholly-empty pages. A bigger problem is that this makes the sampling quite nonuniform, because rows that are on relatively low-density pages would be more likely to become part of the final sample than rows that are on pages with lots of tuples. This sounds like you are assuming that I want to take exactly one tuple out of each block of the block sample. This is not the case. In the second round I plan to apply the same (or a better) Vitter method as it is done now. The main difference is that blocks will be adressed indirectly through the array of block numbers obtained in the first round. Thus for example your sample would tend to favor rows with wide values of variable-width columns and exclude narrower values. (I am not certain that the existing algorithm completely avoids this trap, but at least it tries.) I'm reading 7.4 source code and I fail to see how it does this. If the relation starts with an atypical distribution of wide/narrow or dead/alive tuples, a wrong value for tuplesperpage is used for the rest of the sampling. Tuples immediately following one or more dead tuples have a better chance of being selected. This may be called as random as anything else and not favouring a special property. OTOH after long runs of dead tuples consecutive tuples are likely to be selected. Your comment about nonuniformity above exactly describes the current algorithm: Once the initial sample is fetched and tuplesperpage is determined, targpos is computed without any further feedback. If targpos points to a sparsely populated area (with wide tuples or with many dead tuples) tuples in this area are more likely to get into the sample than tuples in densely populated areas (with many small active tuples). I think that cutting down the number of blocks to be looked at does not affect these problems. Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Function to kill backend
Hi! When debugging on win32, I've created myself a little function that I feel should be added to the backend proper. While it adds a lot of vlaue on win32, I think it adds quite a bit of value on non-win32 platforms as well... The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. The advantage over using the kill command from a shell account is, well, you don't need shell access to the db server. On win32, that's going to be more common than on Unix - plus, if you want to signal a specific backend, you need a special tool (can't do from tas kmanager/service manager etc - service manager can only do the postmaster, and task manager can only do kill -9). I also think a function like this could be good to have for e.g. pgadmin, to implement some more management functionality. For example, in MSSQL I can go into a view called current activity, pick a bad user, right-click and cancel query or terminate session. To do this remote, a funciton like this is required. pg_stat_activity can be used to get a list of sessions and their pids. The function should probably be complemented with a pg_get_postmasterpid or something along that way, to be able to send signals to th epostmaster itself. So, would such a function be accepted into the backend code? And if so, any preferences on where you want it put? //Magnus ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Function to kill backend
Magnus Hagander wrote: Hi! When debugging on win32, I've created myself a little function that I feel should be added to the backend proper. While it adds a lot of vlaue on win32, I think it adds quite a bit of value on non-win32 platforms as well... The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. If' we're going to have this shouldn't it be a proper command? And maybe an internal shutdown command to go with it? cheers andrew ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Function to kill backend
Andrew Dunstan [EMAIL PROTECTED] writes: Magnus Hagander wrote: The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. If' we're going to have this shouldn't it be a proper command? And maybe an internal shutdown command to go with it? I don't like the idea at all, but if we were to have something it would definitely need to be a lot more constrained than send-any-signal-to-any-postgres-process ... even for a superuser, that's a mighty fat-gauge foot-gun. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Function to kill backend
Hi! When debugging on win32, I've created myself a little function that I feel should be added to the backend proper. While it adds a lot of vlaue on win32, I think it adds quite a bit of value on non-win32 platforms as well... The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. If' we're going to have this shouldn't it be a proper command? And maybe an internal shutdown command to go with it? I guess it could be. I guess: 1) I don't know how to do one of those ;-) Which is why I didn't even think it. 2) Won't that clutter up the namespace more, by introducing more keywords that you can't use for other things? 3) Will it still be possible to do the kind of things Rod mentioned, e.g. SELECT pg_kill_backend(procpid, 'TERM') FROM pg_stat_activity WHERE current_query LIKE 'IDLE%'; ? //Magnus ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Function to kill backend
-Original Message- From: Magnus Hagander [mailto:[EMAIL PROTECTED] Sent: Friday, April 02, 2004 2:34 PM To: Andrew Dunstan; [EMAIL PROTECTED] Subject: Re: [HACKERS] Function to kill backend Hi! When debugging on win32, I've created myself a little function that I feel should be added to the backend proper. While it adds a lot of vlaue on win32, I think it adds quite a bit of value on non-win32 platforms as well... The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. If' we're going to have this shouldn't it be a proper command? And maybe an internal shutdown command to go with it? I guess it could be. I guess: 1) I don't know how to do one of those ;-) Which is why I didn't even think it. 2) Won't that clutter up the namespace more, by introducing more keywords that you can't use for other things? 3) Will it still be possible to do the kind of things Rod mentioned, e.g. SELECT pg_kill_backend(procpid, 'TERM') FROM pg_stat_activity WHERE current_query LIKE 'IDLE%'; ? Sybase had something like that: Syb_kill pid to kill a dangling process. And the undocumented: Syb_terminate pid to absolutely, positively kill it (Syb_kill only worked sometimes). In general, I think this approach is a bit worrisome. It reminds one of the famous tagline: Tip: Don't kill -9 the postmaster. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Large DB
Manfred Koizar [EMAIL PROTECTED] writes: On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: A bigger problem is that this makes the sampling quite nonuniform, because rows that are on relatively low-density pages would be more likely to become part of the final sample than rows that are on pages with lots of tuples. This sounds like you are assuming that I want to take exactly one tuple out of each block of the block sample. This is not the case. No, I understood that you wanted to resample, but [ ... thinks for awhile ... ] hmm, now I can't construct a failure case either. I must have done the math wrong before. There's still a risk of not being able to collect N rows out of N blocks, if you are unfortunate enough to select a lot of wholly-empty pages. But that seems like a low-probability scenario; besides such a table would be so desperately in need of VACUUM FULL that the possible low quality of the stats hardly matters. You should not need to use the Vitter algorithm for the block-level selection, since you can know the number of blocks in the table in advance. You can just use the traditional method of choosing each block or not with probability (k/K), where k = number of sample blocks still needed, K = number of blocks from here to the end. You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. I do like this, since it eliminates the current method's bias towards estimating the number of live rows from the density found near the start of the table only. At the end you'd know the total number of live rows on all the pages you read, and it's reasonable to extrapolate that total to the full table size. Question: if the table size is less than N blocks, are you going to read every block or try to reduce the number of blocks sampled? If you don't adjust the sample size then I think this would perform worse for intermediate-size tables than the current method does ... perhaps not so much at sample size = 3000, but at larger sizes it would hurt. A lot of people are setting the stats target to 100 which means a sample size of 3 --- how do the page-access counts look in that case? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Joe Conway [EMAIL PROTECTED] writes: Greg Stark wrote: I'm leaning towards suggesting that postgres should follow sql-99 here and normalize all array indexes to have a lower bound of 1. Then array_lower and array_upper become entirely unnecessary. Instead we just have array_length which is exactly equivalent to my idea of array_upper. Now we finally have something to agree on ;-) I do think this is the way to go, but it is a significant hit to backward compatibility. Same is true for supporting NULL elements of arrays -- maybe we should bite the bullet and make both changes at the same time? In fact on further thought I think they *have* to be done together. I forgot that your code did something else cool allowing updates to extend arrays by directly updating elements outside the current bounds. Ie: slo= update test set a = '{}'; UPDATE 1 slo= update test set a[1] = 1; UPDATE 1 slo= select * from test; a - {1} (1 row) But if we normalize array indexes to start at 1 then this makes it hard to fill in the array starting at higher values. For example: slo= update test set a = '{}'; UPDATE 1 slo= update test set a[5] = 5; UPDATE 1 slo= select a[5] from test; a --- 5 (1 row) This really ought to work, it obviously shouldn't allow you to set a[5] and then surreptitiously move it to a[1]. But nor should it generate an error, since I may well have a specific meaning for a[5] and may be planning to fill in a[1]..a[4] later. The logical thing to do, I think, is pre-fill the entries a[1]..a[4] with null. This could be implemented by actually storing the NULLs or else storing some notation that's used to adjust the base of the index to save space. One thing that can't be made to work like it does now is extending the array on the low end indefinitely: slo= update test set a[1] = 1; UPDATE 1 slo= update test set a[0] = 0; UPDATE 1 slo= update test set a[-1] = -1; UPDATE 1 slo= select * from test; a -- {-1,0,1} (1 row) If this all looks familiar it's because Perl, and other languages, also behave this way: bash-2.05b$ perl -e '@a = (); $a[10]=10; print join(,,@a),\n' ,,10 bash-2.05b$ perl -e '@a = (); $a[-1]=-1; print join(,,@a),\n' Modification of non-creatable array value attempted, subscript -1 at -e line 1. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Joe Conway [EMAIL PROTECTED] writes: Greg Stark wrote: array_lower() and array_upper() are returning NULL for a non-null input, the empty array, even though lower and upper bounds are known just as well as they are for any other sized array. They are behaving as if there's something unknown about the empty array that makes it hard to provide a lower bound or upper bound. Sorry, but I still disagree. There *is* something unknown about the lower and upper bound of an empty array because there are no bounds. So I understand your point of view now. But I think the intuitive meaning here for lower/upper bound as the lowest/highest index where an element is present is only a part of the picture. lower and upper bound are also related to other properties like where array_prepend and array_append place things. And of course the array length. So to give a practical example, say I was implementing a stack using an array. I push things on by extending the array by storing in arr[array_upper(arr)+1]. (Incidentally I don't think we actually have any way to shorten an array, do we?) As soon as I pop off the last element I lose my stack pointer. I need a special case in my code to handle pushing elements in when the array is empty. In reality array_append() would work fine. It's only array_upper() that's going out of its way to make things weird. There's still an upper bound, array_append() knows it, array_upper() just hides this value from the user. I don't see the spec defined CARDINALITY as a workaround. It defines length as the number of elements in the array. When the array is empty, that value is clearly 0. Nothing strange about it. The workaround is having to have that case handled with a special case if statement. If array_lower(), array_upper(), array_length()/CARDINALITY are all defined in a consistent way it doesn't seem like there ought to be any special cases in the implementations. There should be a simple rigid mathematical relationship between them. namely upper-lower+1 = length test=# select array_upper(a||b, 1), array_upper(a,1)+array_length(b) from (select '{}'::int[] as a, array[1,2] as b) as x; array_upper | ?column? -+-- 2 | (1 row) OK, you got me with this corner case. But using what you described as the result int_aggregate would give you in this case (-1), you would get an even stranger answer (-1 + 2 = 1) that would still need to be worked around. No actually, 1 would be the correct answer, the original array would have indexes ranging from [0,-1] and the new array would have indexes ranging from [0,1], ie, two elements. The only strangeness is the unusual lower bound which isn't the default for postgres arrays constructed from string literals. Personally I prefer the zero-based indexes but as neither SQL-foo nor backwards compatibility agree with me here I'll give that up as a lost cause :) -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Problems Vacuum'ing
Alvaro Herrera [EMAIL PROTECTED] wrote: [snip] Turn on query logging and see if the BEGIN is issued right after the COMMIT/ROLLBACK, or whether it waits and issues it right before SELECT/CREATE TEMP TABLE. It doesn't matter if it's only doing queries; if it does them inside a transaction, it would be enough to keep VACUUM from working properly. Will the following do as well? (Thanks to Jochem for the pointer for how to determine open transactions and the pg_stat_activity hint.) Logged into work. WebObects application is running. Database I'm working with partly populated from earlier work. postgres=# select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+---+---+- | | 1245358 | 18020 | ExclusiveLock | t | | 1245364 | 267 | ExclusiveLock | t (2 rows) postgres=# select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start ---+--+-+--++---+- 17142 | postgres | 267 |1 | postgres | | 17144 | qantel | 18020 | 103 | webobjects | | (2 rows) sysagent= delete from ethers; DELETE 368 sysagent= delete from host_mac_hist; DELETE 169 sysagent= vacuum full analyze verbose ethers; INFO: vacuuming public.ethers INFO: ethers: found 0 removable, 368 nonremovable row versions in 4 pages DETAIL: 368 dead row versions cannot be removed yet. Nonremovable row versions range from 64 to 88 bytes long. There were 55 unused item pointers. Total free space (including removable row versions) is 3724 bytes. 0 pages are or will become empty, including 0 at the end of the table. 1 pages containing 3628 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index ethers_hostname_key now contains 368 row versions in 275 pages DETAIL: 0 index row versions were removed. 223 index pages have been deleted, 223 are currently reusable. CPU 0.01s/0.03u sec elapsed 0.23 sec. INFO: ethers: moved 0 row versions, truncated 4 to 4 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing public.ethers INFO: ethers: 4 pages, 0 rows sampled, 0 estimated total rows VACUUM sysagent= vacuum full analyze verbose host_mac_hist; INFO: vacuuming public.host_mac_hist INFO: host_mac_hist: found 0 removable, 169 nonremovable row versions in 2 pages DETAIL: 169 dead row versions cannot be removed yet. Nonremovable row versions range from 64 to 80 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 3556 bytes. 0 pages are or will become empty, including 0 at the end of the table. 1 pages containing 3532 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: host_mac_hist: moved 0 row versions, truncated 2 to 2 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing public.host_mac_hist INFO: host_mac_hist: 2 pages, 0 rows sampled, 0 estimated total rows VACUUM Shut down WebObjects. Clear both tables. Do full vacuum. Re-populate tables. Then... postgres=# select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+--+---+- | | 1245558 | 3110 | ExclusiveLock | t (1 row) postgres=# select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start ---+--+-+--+--+---+- 17142 | postgres |3110 |1 | postgres | | (1 row) sysagent= delete from ethers; DELETE 368 sysagent= delete from host_mac_hist; DELETE 169 sysagent= vacuum full analyze verbose ethers; INFO: vacuuming public.ethers INFO: ethers: found 10030 removable, 0 nonremovable row versions in 98 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 0 to 0 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 760736 bytes. 98 pages are or will become empty, including 98 at the end of the table. 0 pages containing 0 free bytes are potential move destinations. CPU 0.01s/0.01u sec elapsed 0.01 sec. INFO: index ethers_hostname_key now contains 0 row versions in 275 pages DETAIL: 10030 index row versions were removed. 271 index pages have been deleted, 271 are currently reusable. CPU 0.00s/0.08u sec elapsed 0.08 sec. INFO: ethers: truncated 98 to 0 pages INFO: analyzing public.ethers INFO: ethers: 0 pages, 0 rows sampled, 0 estimated total rows VACUUM sysagent= vacuum full analyze verbose host_mac_hist; INFO: vacuuming public.host_mac_hist INFO: host_mac_hist: found 169 removable, 0 nonremovable row versions in 2 pages DETAIL: 0
Re: [HACKERS] Function to kill backend
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Magnus Hagander wrote: The function is pg_kill_backend(pid,signal). superuser-only, of course. Which simply sends a signal to the specified backend - querycancel, terminate, etc. If' we're going to have this shouldn't it be a proper command? And maybe an internal shutdown command to go with it? I don't like the idea at all, but if we were to have something it would definitely need to be a lot more constrained than send-any-signal-to-any-postgres-process ... even for a superuser, that's a mighty fat-gauge foot-gun. What sort of constraints do you have in mind? cheers andrew ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Problems Vacuum'ing
[EMAIL PROTECTED] (Jim Seymour) writes: Again the difference: With WebObjects running, deleting rows and trying to vacuum immediately, even full, fails. Shut-down WebObjects and I can. WebObjects is evidently holding an open transaction. Ergo, anything deleted after the start of that transaction isn't vacuumable. You need to do something about the client-side logic that is holding an open transaction without doing anything ... regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane [EMAIL PROTECTED] wrote: You should not need to use the Vitter algorithm for the block-level selection, since you can know the number of blocks in the table in advance. You can just use the traditional method of choosing each block or not with probability (k/K), where k = number of sample blocks still needed, K = number of blocks from here to the end. Sounds reasonable. I have to play around a bit more to get a feeling where the Vitter method gets more efficient. You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. You mean once a block is sampled we inspect it in any case? This was not the way I had planned to do it, but I'll keep this idea in mind. Question: if the table size is less than N blocks, are you going to read every block or try to reduce the number of blocks sampled? Don't know yet. people are setting the stats target to 100 which means a sample size of 3 --- how do the page-access counts look in that case? rel | page size | reads --+- 300 | 300 3000 | 3000 5000 | 4999 10K | 9.9K 30K | 25.8K 300K | 85K 1M | 120K 10M | 190K 100M | 260K 1G | 330K This is exactly the table I posted before (for sample size 3000) with every entry multiplied by 10. Well, not quite exactly, but the differences are far behind the decimal point. So for our purposes, for a given relation size the number of pages accessed is proportional to the sample size. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Problems Vacuum'ing
Tom Lane [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Jim Seymour) writes: Again the difference: With WebObjects running, deleting rows and trying to vacuum immediately, even full, fails. Shut-down WebObjects and I can. WebObjects is evidently holding an open transaction. Ergo, anything deleted after the start of that transaction isn't vacuumable. You need to do something about the client-side logic that is holding an open transaction without doing anything ... It certainly isn't holding open a transaction in the database I'm working with. It's unclear to me it's holding any transaction open, anywhere. This is all that showed up: postgres=# select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+---+---+- | | 1245358 | 18020 | ExclusiveLock | t | | 1245364 | 267 | ExclusiveLock | t (2 rows) postgres=# select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start ---+--+-+--++---+- 17142 | postgres | 267 |1 | postgres | | 17144 | qantel | 18020 | 103 | webobjects | | (2 rows) I don't know what those are, but they list no database or relation. I get this just be running psql (this time at home): jseymour= select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+-+---+- | |8938 | 307 | ExclusiveLock | t (1 row) jseymour= select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start ---+--+-+--+--+---+- 17144 | jseymour | 307 | 101 | jseymour | | (1 row) Without having touched a thing. Jim ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Large DB
Manfred Koizar [EMAIL PROTECTED] writes: You'd run the Vitter algorithm separately to decide whether to keep or discard each live row you find in the blocks you read. You mean once a block is sampled we inspect it in any case? This was not the way I had planned to do it, but I'll keep this idea in mind. Well, once we've gone to the trouble of reading in a block we definitely want to count the tuples in it, for the purposes of extrapolating the total number of tuples in the relation. Given that, I think the most painless route is simply to use the Vitter algorithm with the number-of-tuples-scanned as the count variable. You could dump the logic in acquire_sample_rows that tries to estimate where to read the N'th tuple from. If you like I can send you the Vitter paper off-list (I have a PDF of it). The comments in the code are not really intended to teach someone what it's good for ... regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Function to kill backend
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: it would definitely need to be a lot more constrained than send-any-signal-to-any-postgres-process ... even for a superuser, that's a mighty fat-gauge foot-gun. What sort of constraints do you have in mind? I'd limit it to SIGINT (query cancel) and SIGTERM (fast shutdown), and I'm not even real sure about SIGTERM. That facility is designed to work in the case of shutting down all backends together --- I'm not sure I want to promise that it behaves pleasantly to SIGTERM one backend and leave the rest going. Nor do I see a real good use-case for it. Also, no killing processes that aren't regular backends (eg, the bgwriter, the stats processes, and most especially the postmaster). Another point is that killing by PID is not necessarily what you want to do --- kill by transaction ID might be a better API, especially for query-cancel cases. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Large DB
On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane [EMAIL PROTECTED] wrote: If you like I can send you the Vitter paper off-list (I have a PDF of it). The comments in the code are not really intended to teach someone what it's good for ... Yes, please. [Would have sent this off-list. But I'm blacklisted.] Servus Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Problems Vacuum'ing
[EMAIL PROTECTED] (Jim Seymour) writes: Tom Lane [EMAIL PROTECTED] wrote: WebObjects is evidently holding an open transaction. It certainly isn't holding open a transaction in the database I'm working with. Which database the transaction is in isn't real relevant... the logic is done globally so that it will be correct when vacuuming shared tables. It's unclear to me it's holding any transaction open, anywhere. Sure it is, assuming that PID 18020 is the session we're talking about. postgres=# select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+---+---+- | | 1245358 | 18020 | ExclusiveLock | t ^^^ This process has an open transaction number 1245358. That's what an exclusive lock on a transaction means. 17142 | postgres | 267 |1 | postgres | | 17144 | qantel | 18020 | 103 | webobjects | | These entries didn't make a lot of sense to me since the other examples you mentioned did not seem to be getting executed in the 'postgres' database --- but I assume PID 18020 is the one you are referring to as webobjects. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Better support for whole-row operations and composite
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: So you mean like an array, but with possibly mixed datatypes? '{1 , abc def, 2.3}' Seems to make sense. The unresolved question in my mind is how to represent NULL elements. However, we have to solve that sooner or later for arrays too. Any thoughts? Good point. What's really ugly is that the external representation of string types differs depending on whether quotes are needed or not. If strings were *always* surrounded by quotes, we could just use the word NULL, without the quotes. Another option might be to use the ROW keyword, something like: ROW[1 , 'abc', 2.3] This is a separate issue, just as the ARRAY[] constructor has different uses from the array I/O representation. I do want some kind of runtime constructor, but ROW[...] doesn't get the job done because it doesn't provide any place to specify the rowtype name. Maybe we could combine ROW[...] with some sort of cast notation? ROW[1 , 'abc', 2.3] :: composite_type_name CAST(ROW[1 , 'abc', 2.3] AS composite_type_name) Does SQL99 provide any guidance here? The latter seems to agree with 6.12 (cast specification) of SQL2003. I'd think we'd want the former supported anyway as an extension to standard. Almost. I ended up keeping TupleDescGetSlot as a live function, but its true purpose is only to ensure that the tupledesc gets registered with the type cache (see BlessTupleDesc() in CVS tip). The slot per se never gets used. I believe that CVS tip is source-code-compatible with existing SRFs, even though I adjusted all the ones in the distribution to stop using the TupleTableSlot stuff. Almost compatible. I found that, to my surprise, PL/R compiles with no changes after your commit. However it no segfaults (as I expected) on composite type arguments. Should be easy to fix though (I think, really haven't looked at it hard yet). The main point though is that row Datums now contain sufficient info embedded in them to allow runtime type lookup the same as we do for arrays. Sounds good to me. There are several in the PL sources now, for instance plpgsql does this with an incoming rowtype argument: Perfect -- thanks. As an aside, it would be quite useful to have support for arrays of tuples. Any idea on how to do that without needing to define an explicit array type for each tuple type? Hmm, messy ... I wonder now whether we still really need a separate pg_type entry for every array type. The original motivation for doing that has been at least partly subsumed by storing element type OIDs inside the arrays themselves. I wonder if we could go over to a scheme where, say, atttypid is the base type ID and attndims being nonzero is what you check to find out it's really an array of atttypid. Not sure how we could map that idea into function and expression args/results, though. Hmmm. I had thought maybe we could use a single datatype (anyarray?) with in/out functions that would need to do the right thing based on the element type. This would also allow, for example, arrays-of-arrays, which is the way that SQL99/2003 seem to allow for multidimensional arrays. Plan B would be to go ahead and create array types. Not sure I would want to do this for table rowtypes, but if we did it only for CREATE TYPE AS then it doesn't sound like an unreasonable amount of overhead. I was hoping we wouldn't need to do that. Joe ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Problems Vacuum'ing
On Fri, Apr 02, 2004 at 07:35:20PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Jim Seymour) writes: Again the difference: With WebObjects running, deleting rows and trying to vacuum immediately, even full, fails. Shut-down WebObjects and I can. WebObjects is evidently holding an open transaction. Ergo, anything deleted after the start of that transaction isn't vacuumable. You need to do something about the client-side logic that is holding an open transaction without doing anything ... But, if I read the code correctly, the oldest xmin vacuum cares about for a non-shared relation should be local to the database, shouldn't it? If this is so, why does it matter that he has open transaction on a different database? -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) www.google.com: interfaz de lĂnea de comando para la web. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Function to kill backend
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: it would definitely need to be a lot more constrained than send-any-signal-to-any-postgres-process ... even for a superuser, that's a mighty fat-gauge foot-gun. What sort of constraints do you have in mind? I'd limit it to SIGINT (query cancel) and SIGTERM (fast shutdown), and I'm not even real sure about SIGTERM. That facility is designed to work in the case of shutting down all backends together --- I'm not sure I want to promise that it behaves pleasantly to SIGTERM one backend and leave the rest going. Nor do I see a real good use-case for it. Also, no killing processes that aren't regular backends (eg, the bgwriter, the stats processes, and most especially the postmaster). Another point is that killing by PID is not necessarily what you want to do --- kill by transaction ID might be a better API, especially for query-cancel cases. Seems like useful functionality. Right now, how does an administrator kill another backend from psql? They can't. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Inconsistent behavior on Array Is Null?
Greg Stark wrote: This really ought to work, it obviously shouldn't allow you to set a[5] and then surreptitiously move it to a[1]. But nor should it generate an error, since I may well have a specific meaning for a[5] and may be planning to fill in a[1]..a[4] later. The logical thing to do, I think, is pre-fill the entries a[1]..a[4] with null. This could be implemented by actually storing the NULLs or else storing some notation that's used to adjust the base of the index to save space. I agree. I had always envisioned something exactly like that once we supported NULL elements. As far as the implementation goes, I think it would be very similar to tuples -- a null bitmask that would exist if any elements are NULL. A related question is how to deal with non-existing array elements. Until now, you could do: regression=# select f[0] from (select array[1,2]) as t(f); f --- (1 row) Even though index 0 does not exist, you get a NULL value returned instead of an ERROR. I'd think if we hardwire a lower bound of 1, this should produce an ERROR. Similarly: regression=# select f[3] from (select array[1,2]) as t(f); f --- (1 row) Should this produce an ERROR instead of returning NULL once existing array elements can be NULL? One thing that can't be made to work like it does now is extending the array on the low end indefinitely: slo= update test set a[1] = 1; UPDATE 1 slo= update test set a[0] = 0; UPDATE 1 slo= update test set a[-1] = -1; UPDATE 1 Right. In the new world order we're describing, the latter two examples would have to produce errors. Joe ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Problems Vacuum'ing
On Fri, 2 Apr 2004, Alvaro Herrera wrote: On Fri, Apr 02, 2004 at 07:35:20PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Jim Seymour) writes: Again the difference: With WebObjects running, deleting rows and trying to vacuum immediately, even full, fails. Shut-down WebObjects and I can. WebObjects is evidently holding an open transaction. Ergo, anything deleted after the start of that transaction isn't vacuumable. You need to do something about the client-side logic that is holding an open transaction without doing anything ... But, if I read the code correctly, the oldest xmin vacuum cares about for a non-shared relation should be local to the database, shouldn't it? AFAICS it's the oldest transaction at the start of any of the transactions in this database, not the oldest transaction of any transaction in this database. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Problems Vacuum'ing
Tom Lane [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Jim Seymour) writes: Tom Lane [EMAIL PROTECTED] wrote: WebObjects is evidently holding an open transaction. It certainly isn't holding open a transaction in the database I'm working with. Which database the transaction is in isn't real relevant... the logic is done globally so that it will be correct when vacuuming shared tables. It had occurred to me, early on, that if anything had an open transaction, that would perhaps cause what I was seeing. So I killed-off WebObjects. Ran my tests. Psql'd as yet another user, to another database, and did something like begin; insert into foo (bar) values ('Hello'); And then ran my tests. Vacuum'ing worked completely. It's unclear to me it's holding any transaction open, anywhere. Sure it is, assuming that PID 18020 is the session we're talking about. postgres=# select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+---+---+- | | 1245358 | 18020 | ExclusiveLock | t ^^^ But I see entries like that if I just *start* *up* psql, without doing anything: Script started on Fri 02 Apr 2004 09:42:58 PM EST $ psql Password: Welcome to psql 7.4.2, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help on internal slash commands \g or terminate with semicolon to execute query \q to quit jseymour= select * from pg_locks where transaction is not null; relation | database | transaction | pid | mode | granted --+--+-+--+---+- | |8941 | 1480 | ExclusiveLock | t (1 row) jseymour= select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start ---+--+-+--+--+---+- 17144 | jseymour |1480 | 101 | jseymour | | (1 row) jseymour= \q $ exit script done on Fri 02 Apr 2004 09:43:27 PM EST What does that entry for pid 1480, transaction 8941 mean? This process has an open transaction number 1245358. That's what an exclusive lock on a transaction means. 17142 | postgres | 267 |1 | postgres | | 17144 | qantel | 18020 | 103 | webobjects | | These entries didn't make a lot of sense to me since the other examples you mentioned did not seem to be getting executed in the 'postgres' database --- but I assume PID 18020 is the one you are referring to as webobjects. I ran the pg_locks and pg_stat_activity selects as user postgres. The postgres db has nothing to do with either the WebObjects application nor the script that's been populating the db I've been experimenting with. The point there was to show that the WebObjects application had nothing open other than whatever it is seems to be there when anything connects to a database (?) with psql (?). Regards, Jim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Better support for whole-row operations and composite types
Joe Conway [EMAIL PROTECTED] writes: Tom Lane wrote: ... I believe that CVS tip is source-code-compatible with existing SRFs, even though I adjusted all the ones in the distribution to stop using the TupleTableSlot stuff. Almost compatible. I found that, to my surprise, PL/R compiles with no changes after your commit. However it no segfaults (as I expected) on composite type arguments. Should be easy to fix though (I think, really haven't looked at it hard yet). Let me know what you find out --- if I missed a trick on compatibility, there's still plenty of time to fix it. ... I wonder if we could go over to a scheme where, say, atttypid is the base type ID and attndims being nonzero is what you check to find out it's really an array of atttypid. Not sure how we could map that idea into function and expression args/results, though. Hmmm. I had thought maybe we could use a single datatype (anyarray?) with in/out functions that would need to do the right thing based on the element type. If we have just one datatype, how will the parser determine the type of a foo[subscript] expression? After thinking a bit, I don't see how to do that except by adding an out-of-line decoration to the underlying type, somewhat like we do for setof or atttypmod. This is doable as far as the backend itself is concerned, but the compatibility implications for clients and user-written extensions seem daunting :-( regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Problems Vacuum'ing
Alvaro Herrera [EMAIL PROTECTED] writes: But, if I read the code correctly, the oldest xmin vacuum cares about for a non-shared relation should be local to the database, shouldn't it? It's the oldest xmin of any transaction that's local to your database, but those xmin values themselves were computed globally --- so what matters is the oldest transaction that was running when any local transaction started. In this case I expect it's the VACUUM's own transaction that's seeing the other guy as determining its xmin. We could fix this by making every transaction compute, and advertise in the PGPROC array, both local and global xmin values. In previous iterations of this discussion we concluded that the extra cycles (which would be spent in *every* transaction start) could not be justified by making VACUUM better able to reclaim space in the face of misbehaving clients. That conclusion might be wrong, but it's not instantly obvious that it is... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Function to kill backend
Bruce Momjian [EMAIL PROTECTED] writes: Seems like useful functionality. Right now, how does an administrator kill another backend from psql? They can't. The question to ask is should they be able to? I think any such facility is inherently a security risk, since it means that a remote attacker who's managed to break into your superuser account can randomly zap other backends. Now admittedly there's plenty of other mischief he can do with superuser privs, but that doesn't mean we should hand him a pre-loaded, pre-sighted cannon. Having to log into the database server locally to execute such operations doesn't seem that bad to me. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Problems Vacuum'ing
[EMAIL PROTECTED] (Jim Seymour) writes: But I see entries like that if I just *start* *up* psql, without doing anything: Sure. You are doing something when you execute select from pg_locks ... that command executes inside a transaction, just like any other Postgres operation. The problem you're facing is that WebObjects is creating a transaction that persists for a long period of time. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Function to kill backend
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Seems like useful functionality. Right now, how does an administrator kill another backend from psql? They can't. The question to ask is should they be able to? I think any such facility is inherently a security risk, since it means that a remote attacker who's managed to break into your superuser account can randomly zap other backends. Now admittedly there's plenty of other mischief he can do with superuser privs, but that doesn't mean we should hand him a pre-loaded, pre-sighted cannon. Having to log into the database server locally to execute such operations doesn't seem that bad to me. If they can read/write your data (as superuser), killing backends is the least worry. I can see it as useful as part of pg_stat_activity output. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Better support for whole-row operations and composite
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: Almost compatible. I found that, to my surprise, PL/R compiles with no changes after your commit. However it no segfaults (as I expected) on composite type arguments. Should be easy to fix though (I think, really haven't looked at it hard yet). Let me know what you find out --- if I missed a trick on compatibility, there's still plenty of time to fix it. I still haven't had time to look closely, and well may have been doing something non-standard all along, but in any case this is the current failing code: else if (function-arg_is_rel[i]) { /* for tuple args, convert to a one row data.frame */ TupleTableSlot *slot = (TupleTableSlot *) arg[i]; HeapTuple tuples = slot-val; TupleDesc tupdesc = slot-ttc_tupleDescriptor; PROTECT(el = pg_tuple_get_r_frame(1, tuples, tupdesc)); } The problem was (I think -- I'll check a little later) that slot-ttc_tupleDescriptor is now '\0'. Hmmm. I had thought maybe we could use a single datatype (anyarray?) with in/out functions that would need to do the right thing based on the element type. If we have just one datatype, how will the parser determine the type of a foo[subscript] expression? After thinking a bit, I don't see how to do that except by adding an out-of-line decoration to the underlying type, somewhat like we do for setof or atttypmod. This is doable as far as the backend itself is concerned, but the compatibility implications for clients and user-written extensions seem daunting :-( I'll think-about/play-with this some more, hopefully this weekend. Thanks, Joe ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Better support for whole-row operations and composite types
Joe Conway [EMAIL PROTECTED] writes: I still haven't had time to look closely, and well may have been doing something non-standard all along, but in any case this is the current failing code: /* for tuple args, convert to a one row data.frame */ TupleTableSlot *slot = (TupleTableSlot *) arg[i]; HeapTupletuples = slot-val; TupleDesctupdesc = slot-ttc_tupleDescriptor; Um. Well, the arg is not a TupleTableSlot * anymore, so this is guaranteed to fail. This isn't part of what I thought the documented SRF API was though. If you take the arg[i] value and pass it to GetAttributeByName or GetAttributeByNum it will work (with some compiler warnings) and AFAICS we never documented more than that. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Better support for whole-row operations and composite
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: /* for tuple args, convert to a one row data.frame */ TupleTableSlot *slot = (TupleTableSlot *) arg[i]; HeapTuple tuples = slot-val; TupleDesc tupdesc = slot-ttc_tupleDescriptor; Um. Well, the arg is not a TupleTableSlot * anymore, so this is guaranteed to fail. This isn't part of what I thought the documented SRF API was though. I'm sure you're correct. The SRF API was for user defined functions, not procedural languages anyway. I'll look at how the other procedural languages handle tuple arguments. If you take the arg[i] value and pass it to GetAttributeByName or GetAttributeByNum it will work (with some compiler warnings) and AFAICS we never documented more than that. OK, thanks, Joe ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Better support for whole-row operations and composite types
Joe Conway [EMAIL PROTECTED] writes: ... The SRF API was for user defined functions, not procedural languages anyway. I'll look at how the other procedural languages handle tuple arguments. It was a dozen-or-so-lines change in each of the PL languages AFAIR. You will probably also want to look at what you do to return tuple results. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Better support for whole-row operations and composite
Tom Lane wrote: Joe Conway [EMAIL PROTECTED] writes: ... The SRF API was for user defined functions, not procedural languages anyway. I'll look at how the other procedural languages handle tuple arguments. It was a dozen-or-so-lines change in each of the PL languages AFAIR. You will probably also want to look at what you do to return tuple results. OK, thanks. Just for reference, what is arg[i] if it isn't a (TupleTableSlot *) anymore -- is it just a HeapTuple? Joe ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Better support for whole-row operations and composite types
Joe Conway [EMAIL PROTECTED] writes: Just for reference, what is arg[i] if it isn't a (TupleTableSlot *) anymore -- is it just a HeapTuple? No, it's a HeapTupleHeader pointer. You need to reconstruct a HeapTuple on top of that to work with heap_getattr and most other core backend routines. Also don't forget to ensure that you detoast the datum; this is not useful at the moment but will be important Real Soon Now. I added standard argument-fetch macros to fmgr.h to help with the detoasting bit. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Large DB
[time to move this to -hackers] On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: The first step, however, (acquire_sample_rows() in analyze.c) has to read more rows than finally end up in the sample. It visits less than O(nblocks) pages but certainly more than O(1). A vague feeling tries to tell me that the number of page reads is somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which grow like O(ln(n)). Good guess. Vitter's paper says the expected time to sample n rows from a table of size N is O(n * (1 + log(N/n))). Well, for what I tried to find out my wild guess seems to be wrong. I don't doubt that Vitter's formula is correct, but it assumes that access to any tuple has the same cost. This does not apply to our problem, however. With 100 tuples per page, we access the first sample_size tuples at a cost of 0.01 sequential page reads per tuple. Later we use less and less tuples per page which results in higher per-tuple-cost. Near the end of a large relation we can expect to access only one tuple per page and more and more pages are skipped, so that prefetching doesn't help any more. Playing around with some real numbers (for 100 tuples/page and a sample size of 3000) I got: rel | page size | reads --+- 30 |30 300 | 300expectation is something like 299.9995 500 | 499 1K | 990 3K | 2.6K 30K |8K 100K | 12K 1M | 19K 10M | 26K 100M | 33K This growth rate is steeper than O(log(nblocks)). I have an idea how this could be done with O(1) page reads. What I have in mind is a kind of Double Vitter algorithm. Whatever we do to get our sample of rows, in the end the sampled rows come from no more than sample_size different blocks. So my idea is to first create a random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. I have to think harder though, what to do about those 400 pages that are not accessed when the sample size is 3000 ... The hard part is getting a genuinely random sample when we don't know N in advance. We do however know the table size in blocks, so if you're willing to make assumptions about constant tuple density you could do something different. (But the tuple density assumption is exactly the weak spot of what we've got, so I'm unconvinced that would be a big step forward.) Starting the scan at some random blocks should help against the common case of unusual distribution of dead tuples near the start of the relation. And I plan to factor information about dead tuple hits into an increasingly better estimation of dead/live tuple ratio. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org