Re: [HACKERS] PL/Python warnings in CVS HEAD
On Tue, 2007-03-06 at 00:18 -0500, Tom Lane wrote: Sounds like #ifdef time to me --- but it seems a bit strange; wouldn't the Python guys have taken a bit more care for compatibility of user-supplied code? Yeah, I was a bit surprised as well. I won't claim to have any familiarity with the Python C API, though. Py_Deprecated() is simply defined as: /* Py_DEPRECATED(version) * Declare a variable, type, or function deprecated. * Usage: *extern int old_var Py_DEPRECATED(2.3); *typedef int T1 Py_DEPRECATED(2.4); *extern int x() Py_DEPRECATED(2.5); */ #if defined(__GNUC__) ((__GNUC__ = 4) || \ (__GNUC__ == 3) (__GNUC_MINOR__ = 1)) #define Py_DEPRECATED(VERSION_UNUSED) __attribute__((__deprecated__)) #else #define Py_DEPRECATED(VERSION_UNUSED) #endif Perhaps they provide a compatibility hack that you didn't spot? Quite possibly. Anyone have any suggestions? -Neil ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote: Simon Riggs wrote: But it would break the idea of letting a second seqscan follow in the first's hot cache trail, no? No, but it would make it somewhat harder to achieve without direct synchronization between scans. It could still work; lets see. I'm not sure thats an argument against fixing the problem with the buffer strategy though. We really want both, not just one or the other. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] user-defined tree methods in GIST
I need some help in building new index structure using GIST. I have to include a procedure for restructuring the tree immediately after i have created the index using gist. How do i do it? Actually I am trying to implement the slim-tree index using gist. To increase the performance, a slim-down algorithm runs after creation of the slim-tree. Where can i give a call to this slim-down procedure after the tree has been built using gist? Regards Sharat.
Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces
Mike Rylander wrote: The patch adds support for default XML namespaces in xml2 by providing a mechanism for supplying a prefix to a named namespace URI. How does it support multiple namespaces in one document? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces
On 3/6/07, Mike Rylander [EMAIL PROTECTED] wrote: Attatched you'll find a patch that I've been kicking around for a while that I'd like to propose for inclusion in 8.3. I attempted to submit this through the original xml2 author (as far back as the 7.4 days) but got no response. It's really fairly trivial, but I will be using the features it provides in production soon, so I'd like to see it applied against the contrib xml2 module. The patch adds support for default XML namespaces in xml2 by providing a mechanism for supplying a prefix to a named namespace URI. It then wraps the namespace-capable functions in backward-compatible equivalents so that old code will not break. 1) And what about non-default namespaces? 2) What if my XPath query has different prefix, that also should be mapped to the same URI? (Not frequent case, but this really can occur -- e.g. XML doc has prefix 'local' for URI='http://127.0.0.1', but XPath should have 'loc' for the same URI.) -- Best regards, Nikolay ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Aggressive freezing in lazy-vacuum
Tom Lane [EMAIL PROTECTED] wrote: This is a stand-alone patch for aggressive freezing. I'll propose to use OldestXmin instead of FreezeLimit as the freeze threshold in the circumstances below: I think it's a really bad idea to freeze that aggressively under any circumstances except being told to (ie, VACUUM FREEZE). When you freeze, you lose history information that might be needed later --- for forensic purposes if nothing else. I don't think we can supply such a historical database functionality here, because we can guarantee it just only for INSERTed tuples even if we pay attention. We've already enabled autovacuum as default, so that we cannot predict when the next vacuum starts and recently UPDATEd and DELETEd tuples are removed at random times. Furthermore, HOT will also accelerate removing expired tuples. Instead, we'd better to use WAL or something like audit logs for keeping history information. You need to show a fairly amazing performance gain to justify that, and I don't think you can. Thank you for your advice. I found that aggressive freezing for already dirty pages made things worse, but for pages that contain other tuples being frozen or dead tuples was useful. I did an acceleration test for XID wraparound vacuum. I initialized the database with $ ./pgbench -i -s100 # VACUUM FREEZE accounts; # SET vacuum_freeze_min_age = 6; and repeated the following queries. CHECKPOINT; UPDATE accounts SET aid=aid WHERE random() 0.005; SELECT count(*) FROM accounts WHERE xmin 2; VACUUM accounts; After the freeze threshold got at vacuum_freeze_min_age (run = 3), the VACUUM became faster with aggressive freezing. I think it came from piggybacking multiple freezing operations -- the number of unfrozen tuples were kept lower values. * Durations of VACUUM [sec] run| HEAD | freeze ---++ 1 |5.8 | 8.2 2 |5.2 | 9.0 3 | 118.2 | 102.0 4 | 122.4 | 99.8 5 | 121.0 | 79.8 6 | 122.1 | 77.9 7 | 123.8 | 115.5 ---++ avg| 121.5 | 95.0 3-7| * Numbers of unfrozen tuples run| HEAD | freeze ---++ 1 | 50081 | 50434 2 | 99836 | 100072 3 | 100047 | 86484 4 | 100061 | 86524 5 | 99766 | 87046 6 | 99854 | 86824 7 | 99502 | 86595 ---++ avg| 99846 | 86695 3-7| Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PL/Python warnings in CVS HEAD
On 3/6/07, Neil Conway [EMAIL PROTECTED] wrote: On Tue, 2007-03-06 at 00:18 -0500, Tom Lane wrote: Perhaps they provide a compatibility hack that you didn't spot? Quite possibly. Anyone have any suggestions? Python guys discuss the situation here: http://www.python.org/dev/peps/pep-0353/ -- marko ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] A Survey on Defect Management Practices in Free/Open Source Software
Dear PostgreSQL Contributors, I seek help from designers, developers, testers,defect fixers,project managers or playing any other key role in Free/Open Source software development or maintenence in carrying out a study on practices and problems of defect management in various Free/Open Source Software projects. The insights gained from the study can further help us to extract publicly accessible defect data and determine impact of defect management practices on software quality. Please spend a few minutes of your precious time to fill up the Questionnaire. The most of the questions follow multiple choice formats and are quite easy to answer. To have the Online Questionnaire, please visit: http://anu.puchd.ac.in/phpESP/public/survey.php?name=FOSS_Defect_Survey (You can also copy and paste this link into your browser, and hit the 'Return' key.) I hope you will find all the questions interesting and thought-provoking. Your answers will be kept anonymous.The data thus collected will only be used for research purpose.It would be nice if you may further refer this mail to others actively engaged with Free/Open Source Software development. If you have any query or suggestions then feel free to contact. Thank You With regards, Anu Gupta Senior Lecturer Department of Computer Science and Applications, Panjab University, Chandigarh. INDIA In case of any problem in accessing/using the above mentioned link please contact: E-mail: [EMAIL PROTECTED] [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Aggressive freezing in lazy-vacuum
ITAGAKI Takahiro [EMAIL PROTECTED] writes: I don't think we can supply such a historical database functionality here, because we can guarantee it just only for INSERTed tuples even if we pay attention. We've already enabled autovacuum as default, so that we cannot predict when the next vacuum starts and recently UPDATEd and DELETEd tuples are removed at random times. Furthermore, HOT will also accelerate removing expired tuples. Instead, we'd better to use WAL or something like audit logs for keeping history information. Well comparing the data to WAL is precisely the kind of debugging that I think Tom is concerned with. The hoped for gain here is that vacuum finds fewer pages with tuples that exceed vacuum_freeze_min_age? That seems useful though vacuum is still going to have to read every page and I suspect most of the writes pertain to dead tuples, not freezing tuples. This strikes me as something that will be more useful once we have the DSM especially if it ends up including a frozen map. Once we have the DSM vacuum will no longer be visiting every page, so it will be much easier for pages to get quite old and only be caught by a vacuum freeze. The less i/o that vacuum freeze has to do the better. If we get a freeze map then agressive freezing would help keep pages out of that map so they never need to be vacuumed just to freeze the tuples in them. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] user-defined tree methods in GIST
On Tue, Mar 06, 2007 at 01:50:34PM +0530, sharath kumar wrote: I need some help in building new index structure using GIST. I have to include a procedure for restructuring the tree immediately after i have created the index using gist. How do i do it? Actually I am trying to implement the slim-tree index using gist. To increase the performance, a slim-down algorithm runs after creation of the slim-tree. Where can i give a call to this slim-down procedure after the tree has been built using gist? I don't quite understand what a slim-tree index is, but by my reading it's an algorithm to make picksplit work better. I don't think you need to touch the index machinery at all. Also, I don't think it's a good plan to try and manipulate the index wholesale after it's built. Once it's built, queries are going want to use it and any changes are are going to need to be WAL logged and made crash safe which seems quite a lot of work for something you can probably do during picksplit. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
[HACKERS] Auto creation of Partitions
Hi, This is to get feedback to meet the following TODO: - Simplify ability to create partitioned tables This would allow creation of partitioned tables without requiring creation of rules for INSERT/UPDATE/DELETE, and constraints for rapid partition selection. Options could include range and hash partition selection. There was some discussion on the pgsql mailing lists, which lead to the above TODO: http://archives.postgresql.org/pgsql-hackers/2006-09/msg00189.php http://archives.postgresql.org/pgsql-hackers/2006-08/msg01874.php We can have the following syntax to support auto creation of partitions in Postgresql: CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...), PARTITION partition_name CHECK(...) ... ]; Here expr will be one of the column names as specified for the master table. Once we finalize the syntax, the above statement would end up carrying out the following activities (disclaimer: we might add or remove some activities based on the discussion here). i ) Create master table. ii) Create children tables based on the number of partitions specified and make them inherit from the master table. iii) Auto generate rules (or triggers?) using the checks mentioned for the partitions, to handle INSERTs/DELETEs/UPDATEs to navigate them to the appropriate child. Note that checks specified directly on the master table will get inherited automatically. iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. v) If possible add CHECK (false) to the master table to avoid any activity on it. Some questions remain as to: 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. 3) Some implementation specific issue e.g. whether SPI_execute would be a good way of creating these rules. Comments appreciated, Regards, Nikhils EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Latest plans for Utilities with HOT
On Mon, 2007-03-05 at 22:25 +0530, Pavan Deolasee wrote: Simon Riggs wrote: On Mon, 2007-03-05 at 21:39 +0530, Pavan Deolasee wrote: Currently each tuple is moved individually. You'd need to inspect the whole HOT chain on a page, calculate space for that and then try to move them all in one go. I was originally thinking that would be a problem, but its not so bad - but it may cause us to end repair_frag() earlier than we otherwise would depending upon the game of Tetris plays out. Umm.. I still need to look deeper to understand the VACUUM FULL code, but ISTM that we can move tuple chains just the way its done today, without bothering to keep HOT-update chains intact. The tuples may actually got into different pages and have equal number of index entries. To my mind, this is not such a big problem because we shouldn't expect too many HOT-update chains while running VACUUM FULL. Isn't that true ? Well, its true enough to be a great argument. So what you're saying is: we do nothing and it just works. At least not too badly, and at very least: no worse than it does today. [Oh dear! I just finished writing prototype of VACUUM FULL-with-reindex when I read this, so either way it looks like nothing more needed on this utility. 1 down, 3 to go.] -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Updated propsoal for read-only queries on PITR slaves (SoC 2007)
Jim Nasby wrote: On Mar 1, 2007, at 8:45 AM, Florian G. Pflug wrote: Another possibility would be to move this setting into the recovery.conf. The problems with this approach is that the recovery.conf file is deleted after the information it contains is incorporated into pg_control. Thus, the readonly setting would need to be stored in pg_control too, making it impossible for the user to change it later (e.g, after interrupting and restarting WAL replay which is possible with 8.2) I think it would be best to very clearly divide setting up a cluster as a read-only slave from doing an actual recovery. One obvious way to do this would be to require that all read-only GUCs have to live in postgresql.conf and not recovery.conf. There's probably some other more elegant solutions as well. The main argument for putting this into recovery.conf ist that it changes the behaviour only during recovery. Much like restore_command ist part of the recovery.conf. But I agree that overall postgresql.conf seems saner. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Auto creation of Partitions
NikhilS wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. Since this would basically be a case of the updatable rules problem, you should review those discussions in the past to check whether the issues mentioned there don't interfere with that plan. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. Moreover, how are later changes of those attributes propagated? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] GIST and TOAST
Teodor Sigaev [EMAIL PROTECTED] writes: A closer reading, however, shows that at least for cases like intarray, btree_gist, etc., the detoasting of an index value is being done in the gist decompress function, so the value seen via GISTENTRY in the other functions should already have been detoasted once. Right, any stored value form index should be decompressed by GiST decompress support method. The problem is that this is the only place in the code where we make wholesale assumptions that a datum that comes from a tuple (heap tuple or index tuple) isn't toasted. There are other places but they're always flagged with big comments explaining *why* the datum can't be toasted and they're minor localized instances, not a whole subsystem. This was one of the assumptions that the packed varlena code depended on: that anyone looking at a datum from a tuple would always detoast it even if they had formed the tuple themselves and never passed it through the toaster. The *only* place this has come up as a problem is in GIST. I would say we could just exempt all the GIST data types from packed varlenas except that doesn't even solve the problem completely. There's at least one place, _int_gist.c, where the entry type is just a plain array. So unless I exempt *all* arrays the arrays it gets out of the index tuples it forms will be packed and need to be detoasted. So I'm leaning towards going through all of the GIST modules and replacing all the (Type*)DatumGetPointers formulations with actually DatumGetType and all the (Type*)PG_GETARG_POINTER() formulations with PG_GETARG_TYPE(). And having those macros call PG_DETOAST_DATUM(). How would you feel about that? There are two downsides I see: It's an extra check against the toast flag bits which is extra cpu. But this is how the rest of the Postgres source works and we don't think the extra cpu cost is significant. There may be places that assume they won't leak detoasted copies of datums. If you could help point those places out they should just need PG_FREE_IF_COPY() calls or in some cases a pg_detoast_datum_copy() call earlier in the correct memeory context. This again is how the rest of the postgres source handles this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote: NikhilS wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? We will not (I know its a hard thing to do :) ), the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. Since this would basically be a case of the updatable rules problem, you should review those discussions in the past to check whether the issues mentioned there don't interfere with that plan. The rules mentioned here will be to specify that all the inserts/updates/deletes should go into proper children tables instead of the parent. I do not see the updateable rules problem with regards to this, but will check out the archives for discussion on this related to partitioning. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. Moreover, how are later changes of those attributes propagated? Once created, this will be a normal inheritance relationship between the tables and all the existing commands will apply to both the parent and the child. The basic idea here is to automate as many things as possible at partition creation time. The user is free to make additional changes to the involved tables later too. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
NikhilS [EMAIL PROTECTED] writes: the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. Missed the start of this thread. A while back I had intended to add WITH INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for adding to the inheritance structure. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto creation of Partitions
On 3/6/07, NikhilS [EMAIL PROTECTED] wrote: Hi, On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. Missed the start of this thread. A while back I had intended to add WITH INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for adding to the inheritance structure. Yeah, this one aims to do pretty much the above as part of the auto creation of the inheritance-based partitions. And to add, maybe if there is consensus/demand for the WITH INDEXES idea mentioned above too, I could work on it as well. Regards, Nikhils Regards, Nikhils -- Gregory Stark EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote: NikhilS [EMAIL PROTECTED] writes: the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. Missed the start of this thread. A while back I had intended to add WITH INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for adding to the inheritance structure. Yeah, this one aims to do pretty much the above as part of the auto creation of the inheritance-based partitions. Regards, Nikhils -- Gregory Stark EnterpriseDB http://www.enterprisedb.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] GIST and TOAST
On 2007-03-06, Gregory Stark [EMAIL PROTECTED] wrote: Teodor Sigaev [EMAIL PROTECTED] writes: A closer reading, however, shows that at least for cases like intarray, btree_gist, etc., the detoasting of an index value is being done in the gist decompress function, so the value seen via GISTENTRY in the other functions should already have been detoasted once. Right, any stored value form index should be decompressed by GiST decompress support method. The problem is that this is the only place in the code where we make wholesale assumptions that a datum that comes from a tuple (heap tuple or index tuple) isn't toasted. The places in the intarray code that you tried to fix in your patch at the start of this thread are not dealing with data that came from a tuple, but from data that came from a decompress method. It's expected that the decompress method does the detoasting. So I think you've mis-analyzed the problem. That's especially true since you are claiming that the existing code is already buggy when in fact no such bugs have been reported (and clearly intarray has been running with toasted array values for years). -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Aggressive freezing in lazy-vacuum
ITAGAKI Takahiro [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: I think it's a really bad idea to freeze that aggressively under any circumstances except being told to (ie, VACUUM FREEZE). When you freeze, you lose history information that might be needed later --- for forensic purposes if nothing else. I don't think we can supply such a historical database functionality here, because we can guarantee it just only for INSERTed tuples even if we pay attention. We've already enabled autovacuum as default, so that we cannot predict when the next vacuum starts and recently UPDATEd and DELETEd tuples are removed at random times. I said nothing about expired tuples. The point of not freezing is to preserve information about the insertion time of live tuples. And your test case is unconvincing, because no sane DBA would run with such a small value of vacuum_freeze_min_age. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GIST and TOAST
The problem is that this is the only place in the code where we make wholesale assumptions that a datum that comes from a tuple (heap tuple or index tuple) isn't toasted. There are other places but they're always flagged with big comments explaining *why* the datum can't be toasted and they're minor localized instances, not a whole subsystem. This was one of the assumptions that the packed varlena code depended on: that anyone looking at a datum from a tuple would always detoast it even if they had formed the tuple themselves and never passed it through the toaster. The *only* place this has come up as a problem is in GIST. I'm afraid that we have some lack of understanding. Flow of operation with indexed tuple in gist is: - read tuple - get n-th attribute with a help of index_getattr - call user-defined decompress method which should, at least, detoast value - result value is passed to other user-defined method Any new value, produced by user-defined method of GiST, before packing into tuple should be compressed by user-defined compress method. Compress method should not toast value - that is not its task. New values are always modified by compress method before insertion. See gistinsert:gist.c and gistFormTuple:gistutil.c. So, index_form_tuple should toast value, but value is already compressed and live in memory. Detoasting of value should be done by decompress method and live in memory, and so, only after that value can be passed to other user-defined method. As I understand, packing/unpacking varlena header is doing during toasting/detoastiong. So, I'm not understand the problem here. What is more, GiST API doesn't limit type of keys passed between user-defined GiST methods. It just says that new value should be a type on which opclass was defined and output of compress method should be a type pointed by STORAGE option in CREATE OPERATOR CLASS. There may be places that assume they won't leak detoasted copies of datums. If you could help point those places out they should just need PG_FREE_IF_COPY() GiST code works in separate memory context to prevent memory leaks. See gistinsert/gistbuildCallback/gistfindnext. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] user-defined tree methods in GIST
I need some help in building new index structure using GIST. I have to include a procedure for restructuring the tree immediately after i have created the index using gist. How do i do it? IMHO, you don't have a chance to do it without patching postgres or developing your own low-level function. Try to play with SP-GiST (http://www.cs.purdue.edu/spgist/). SP-GiST is a modification of GiST for Space Partitioning Trees. But they patch will not work with 8.2 and up because of lack of concurrency. 8.2 doesn't support indexes without concurrency. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Mar 5, 2007, at 8:34 PM, Greg Smith wrote: On Thu, 22 Feb 2007, Jim C. Nasby wrote: It would also be extremely useful to make checkpoint stats visible somewhere in the database (presumably via the existing stats mechanism)... I'm thinking just tracking how many pages had to be flushed during a checkpoint would be a good start. I'm in the middle of testing an updated version of the patch, once I nail down exactly what needs to be logged I'd planned to open a discussion on which of those things would be best served by pg_stats instead of a log. I decided specifically to aim for the logs instead for the checkpoint data because if you're in a situation where are inserting fast enough that the checkpoints are spaced closely together, you'd end up having to poll pg_stats all the time for make sure you catch them all, which becomes even less efficient than just logging the data. It's always good to be able to log stuff for detailed troubleshooting, so that's a good place to start. The flipside is that it's much easier to machine-parse a table rather than trying to scrape the logs. And I don't think we'll generally care about each individual checkpoint; rather we'll want to look at things like 'checkpoints/hour' and 'checkpoint written pages/hour'. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Calculated view fields (8.1 != 8.2)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I'm observing this behavior in 8.2: kalman=# create table test ( a integer, b integer ); CREATE TABLE kalman=# kalman=# CREATE OR REPLACE FUNCTION sp_test ( INTEGER ) kalman-# RETURNS INTEGER AS' kalman'# DECLARE kalman'#a_id_contractALIAS FOR $1; kalman'# BEGIN kalman'#RAISE NOTICE ''here''; kalman'# kalman'#return 3; kalman'# END; kalman'# ' LANGUAGE 'plpgsql'; CREATE FUNCTION kalman=# kalman=# kalman=# CREATE OR REPLACE VIEW v_test AS kalman-# SELECT kalman-# a AS a, kalman-# sp_test(b) AS b kalman-# FROM kalman-# test c kalman-# ; CREATE VIEW kalman=# kalman=# insert into test values (2,3); INSERT 0 1 kalman=# kalman=# select * from v_test; NOTICE: here a | b - ---+--- 2 | 3 (1 row) kalman=# select a from v_test; NOTICE: here a - --- 2 (1 row) In version 8.1 the function sp_test is not evaluated in case of select a from v_test. If I mark the function as STABLE or IMMUTABLE then even with version 8.2 the function is not evaluated. Is this the intended behavior? I didn't see something about it in the release note. Regards Gaetano Mendola -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF7Y1D7UpzwH2SGd4RAv//AJ0dcDPyYIndVMs7pEhzXjVNwKqdLQCeJQnL oaZVL2JgS/J9lPf+B80+FuY= =qaCE -END PGP SIGNATURE- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Trivial HugeTLB Benchmark
On Sun, 2007-03-04 at 10:14 -0800, Tom Lane wrote: If you did this only once, the results are not really trustworthy; you need to average several similar runs before you can have much confidence. pgbench's inter-run variation is usually upwards of 10%, so trying to draw conclusions about half-percentage-point differences without averaging is a waste of time. Good point, thanks Also, if scaling factor number of clients then what you're mostly measuring is update-contention behavior. Try it with -s 10 and -c 5; and don't forget to reinitialize the database for each run of tests to be sure it's fair. I did another 18 runs, 9 each for huge pages and normal shared memory. The database was reinitialized before every third run with pgbench -i -s 10. The runs themselves were done with pgbench -s 10 -c 5 -t 1 Normal shared memory: tps = 1835.929043 (including connections establishing) tps = 1697.455165 (including connections establishing) tps = 1378.393001 (including connections establishing) tps = 1834.802729 (including connections establishing) tps = 1630.100895 (including connections establishing) tps = 1415.504943 (including connections establishing) tps = 1864.908838 (including connections establishing) tps = 1726.295622 (including connections establishing) tps = 1323.679649 (including connections establishing) Average: 1634.19 tps Huge pages: tps = 1867.400381 (including connections establishing) tps = 1715.269338 (including connections establishing) tps = 1259.314139 (including connections establishing) tps = 1880.803336 (including connections establishing) tps = 1885.351404 (including connections establishing) tps = 1603.302855 (including connections establishing) tps = 1884.888431 (including connections establishing) tps = 1563.452093 (including connections establishing) tps = 1361.896887 (including connections establishing) Average: 1669.08 That works out to approximately a 2.1% performance boost for huge pages. It still doesn't seem very compelling but I could try to put together a patch for inclusion if people were interested in such a thing. -Ryan This electronic mail transmission and any accompanying attachments contain confidential information intended only for the use of the individual or entity named above. Any dissemination, distribution, copying or action taken in reliance on the contents of this communication by anyone other than the intended recipient is strictly prohibited. If you have received this communication in error please immediately delete the e-mail and either notify the sender at the above e-mail address or by telephone at 250.386.5323. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ERROR: operator does not exist: integer !=- integer
I get it. scan.l converts '!=' to '': 644 /* Convert != operator to for compatibility */ 645 if (strcmp(yytext, !=) == 0) 646 yylval.str = pstrdup(); 647 else 648 yylval.str = pstrdup(yytext); Joshua D. Drake [EMAIL PROTECTED] Well yes it will work if you add a space, but technically the problem is the query should be written like this: 1 -1 or 1 -1 Joshua D. Drake ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] NTA access on Solaris
With copyout_max_cached being 8K: ^^^ 4K Working set 16K 32K 64K 128K256K512K1M 2M 128M Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1 Sherry ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] GIST and TOAST
Andrew - Supernews [EMAIL PROTECTED] writes: The places in the intarray code that you tried to fix in your patch at the start of this thread are not dealing with data that came from a tuple, but from data that came from a decompress method. It's expected that the decompress method does the detoasting. So I think you've mis-analyzed the problem. That's especially true since you are claiming that the existing code is already buggy when in fact no such bugs have been reported (and clearly intarray has been running with toasted array values for years). I'm not claiming, I'm asking, because I can't tell. And it's not clear _int_gist.c has been running with toasted array values for years because it's limited to arrays of 100 integers (or perhaps 200 integers, there's a factor of 2 in the test). That's not enough to trigger toasting unless there are other large columns in the same table. I do know that with packed varlenas I get a crash in g_int_union among other places. I can't tell where the datum came from originally and how it ended up stored in packed format. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] NTA access on Solaris
On a 1P system system with 512K L2, it is more obvious why we shouldn't bypass L2 for small reads: The same readtest as my previous mail invoked as following: ./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100 With copyout_max_cached being 128K: Working set 16K 32K 64K 128K256K512K1M 2M 128M Seconds 4.2 4.0 4.1 4.1 5.7 7.0 7.1 7.0 7.5 With copyout_max_cached being 8K: Working set 16K 32K 64K 128K256K512K1M 2M 128M Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1 Sherry On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote: - Forwarded message from Sherry Moore [EMAIL PROTECTED] - Date: Mon, 5 Mar 2007 21:34:19 -0800 From: Sherry Moore [EMAIL PROTECTED] To: Tom Lane [EMAIL PROTECTED] Cc: Luke Lonergan [EMAIL PROTECTED], Mark Kirkwood [EMAIL PROTECTED], Pavan Deolasee [EMAIL PROTECTED], Gavin Sherry [EMAIL PROTECTED], PGSQL Hackers pgsql-hackers@postgresql.org, Doug Rady [EMAIL PROTECTED], Sherry Moore [EMAIL PROTECTED] Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. 81 tread ./readtest -h Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s size: Working set size (may specify K,M,G suffix) -n iter:Number of test iterations -f filename:Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): 82 tread ./readtest -s 16k -f boot_archive 46445262 83 tread ./readtest -s 128M -f boot_archive 118294230 84 tread ./readtest -s 16k -f boot_archive -n 100 4230210856 85 tread ./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): 89 tread ./readtest -s 16k -f boot_archive 43606882 90 tread ./readtest -s 128M -f boot_archive 100547909 91 tread ./readtest -s 16k -f boot_archive -n 100 4251823995 92 tread ./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's
[HACKERS] NTA access on Solaris
- Forwarded message from Sherry Moore [EMAIL PROTECTED] - Date: Mon, 5 Mar 2007 21:34:19 -0800 From: Sherry Moore [EMAIL PROTECTED] To: Tom Lane [EMAIL PROTECTED] Cc: Luke Lonergan [EMAIL PROTECTED], Mark Kirkwood [EMAIL PROTECTED], Pavan Deolasee [EMAIL PROTECTED], Gavin Sherry [EMAIL PROTECTED], PGSQL Hackers pgsql-hackers@postgresql.org, Doug Rady [EMAIL PROTECTED], Sherry Moore [EMAIL PROTECTED] Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. 81 tread ./readtest -h Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s size: Working set size (may specify K,M,G suffix) -n iter:Number of test iterations -f filename:Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): 82 tread ./readtest -s 16k -f boot_archive 46445262 83 tread ./readtest -s 128M -f boot_archive 118294230 84 tread ./readtest -s 16k -f boot_archive -n 100 4230210856 85 tread ./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): 89 tread ./readtest -s 16k -f boot_archive 43606882 90 tread ./readtest -s 128M -f boot_archive 100547909 91 tread ./readtest -s 16k -f boot_archive -n 100 4251823995 92 tread ./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym #include stdlib.h #include stdio.h #include ctype.h #include unistd.h #include fcntl.h #include sys/param.h #include sys/time.h #include sys/mman.h #include errno.h #include thread.h #include signal.h #include strings.h #include libgen.h #define KB(a) (a*1024) #define MB(a) (KB(a)*1024) static void usage(char *s) { fprintf(stderr, Usage: %s [-v] [-N] -s size -n iter [-d delta] [-c count]\n, s); fprintf(stderr, \t-v:\t\tVerbose mode\n \t-N:\t\tNormalize results by number of reads\n
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. 81 tread ./readtest -h Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s size: Working set size (may specify K,M,G suffix) -n iter:Number of test iterations -f filename:Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): 82 tread ./readtest -s 16k -f boot_archive 46445262 83 tread ./readtest -s 128M -f boot_archive 118294230 84 tread ./readtest -s 16k -f boot_archive -n 100 4230210856 85 tread ./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): 89 tread ./readtest -s 16k -f boot_archive 43606882 90 tread ./readtest -s 128M -f boot_archive 100547909 91 tread ./readtest -s 16k -f boot_archive -n 100 4251823995 92 tread ./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym #include stdlib.h #include stdio.h #include ctype.h #include unistd.h #include fcntl.h #include sys/param.h #include sys/time.h #include sys/mman.h #include errno.h #include thread.h #include signal.h #include strings.h #include libgen.h #define KB(a) (a*1024) #define MB(a) (KB(a)*1024) static void usage(char *s) { fprintf(stderr, Usage: %s [-v] [-N] -s size -n iter [-d delta] [-c count]\n, s); fprintf(stderr, \t-v:\t\tVerbose mode\n \t-N:\t\tNormalize results by number of reads\n \t-s size:\tWorking set size (may specify K,M,G suffix)\n \t-n iter:\tNumber of test iterations\n \t-f filename:\tName of the file to read from\n \t-d [+|-]delta:\tDistance between subsequent reads\n \t-c count:\tNumber of reads\n \t-h:\t\tPrint this help\n ); exit(1); } #define ABS(x) ((x) = 0 ? (x) : -(x)) static void format_num(size_t v, size_t *new, char *code) { if (v % (1024 * 1024 * 1024) == 0) { *new = v / (1024 * 1024 *
Re: [HACKERS] Auto creation of Partitions
NikhilS wrote: Hi, On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote: NikhilS wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? We will not (I know its a hard thing to do :) ), the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. I think a way can be devised to maintain the primary key and unique constraints. If a search is done on the parent table, the planner knows to rewrite the query as a union (or union all) of all child tables that relate to the where clause, or all child tables if the where clause is not on the column/s used to partition, then this concept should be able to be converted to indexes as well, so that when a primary or unique index from a child table is inserted to, then each of the related child indexes is consulted to ensure uniqueness. This would only apply if the partitioning was not done by the primary or unique column. 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. Since this would basically be a case of the updatable rules problem, you should review those discussions in the past to check whether the issues mentioned there don't interfere with that plan. The rules mentioned here will be to specify that all the inserts/updates/deletes should go into proper children tables instead of the parent. I do not see the updateable rules problem with regards to this, but will check out the archives for discussion on this related to partitioning. I would think that a trigger would be a better choice as I see the need (or at least the possibility) for more than just a rewrite. When a value is inserted that is outside of a condition currently covered by an existing child table then a new child will need to be spawned to contain the new data. So say we partition by year and month of a date column? As new dates are added new child tables would be created each month. Or is this beyond the current plan and left to manual creation? Will ALTER TABLE be extended to handle partitions? This will allow partitioning existing tables (maybe without blocking access?) and allow things like ALTER TABLE mytable ADD PARTITION (mydate within 200703) and ALTER TABLE mytable DROP PARTITION (mydate within 199912) or would dropping be covered by DELETE FROM mytable where mydate = 199912 ? Could such a syntax be devised for date columns? (month of mydate) or similar to auto create partitions based on the year and month of a date column? or will we just do CHECK(mydatecol = 1/3/07 and mydatecol = 31/3/07) for each month of data? Also (day of mydatecol) to partition based on year and day of year. Another syntax possibility - range(myserialcol of 50) where new child tables are created every 50 rows? Maybe I'm looking at auto-maintenance which is beyond any current planning? -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] GIST and TOAST
Teodor Sigaev [EMAIL PROTECTED] writes: I'm afraid that we have some lack of understanding. Flow of operation with indexed tuple in gist is: - read tuple - get n-th attribute with a help of index_getattr - call user-defined decompress method which should, at least, detoast value - result value is passed to other user-defined method So when does index_form_tuple get called? So, index_form_tuple should toast value, but value is already compressed and live in memory. Detoasting of value should be done by decompress method and live in memory, and so, only after that value can be passed to other user-defined method. Does every data type define a compress/decompress method? Even if it's not a data type that normally gets very large? As I understand, packing/unpacking varlena header is doing during toasting/detoastiong. So, I'm not understand the problem here. Well we cheated a bit and had heap/index_form_tuple convert the data to packed format. This saves having to push small tuples through the toaster. So now tuples can magically become toasted as soon as they go into a tuple even if they never get pushed through the toaster. There may be places that assume they won't leak detoasted copies of datums. If you could help point those places out they should just need PG_FREE_IF_COPY() GiST code works in separate memory context to prevent memory leaks. See gistinsert/gistbuildCallback/gistfindnext. So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of using DatumGetPointer and PG_GETARG_POINTER and having to manually cast everywhere, no? It seems like there's a lot of extra pain to maintain the code in the present style with all the manual casts. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
On Mon, Mar 05, 2007 at 09:04:46PM -0600, Jim Nasby wrote: FWIW, this sounds like a subset of the Query By Example stuff that someone is working on. I don't have a URL handy since I'm on a plane, but I think google can find it. It's now called ObelisQ http://pgfoundry.org/projects/qbe Cheers, D On Mar 3, 2007, at 8:12 AM, ranbeer makin wrote: Here is a description of what the SKYLINE operator is: --- Suppose you wish to purchase books and you are looking for books with high rating and low price. However, both the criteria of selecting books are complementary since books of higher rating are generally more expensive. For finding such books, you'll query the database system of the book store which will return a set of interesting books. The word 'interesting' implies all the books which are as good or better in both the dimensions (rating and price) and better in at least one dimension. This set of interesting points forms the Skyline. Skyline operator finds points which are not dominated by other data points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. For specifying the Skyline queries, we extend SQL SELECT statement by an optional SKYLINE OF clause as given below: SELECT ... FROM ... WHERE... GROUP BY ... HAVING... SKYLINE OF [DISTINCT] d1 [MIN | MAX | DIFF], .., dm [MIN | MAX | DIFF] ORDER BY... Where, d1, d2 , , dm denote the dimensions of the Skyline, and MIN, MAX, DIFF specify whether the value in that dimension should be minimized, maximized, or simply be different. When DIFF is specified, two tuples are compared only if the value of the attribute on which DIFF is applied is different. When DISTINCT clause is specified and if there are two or more tuples with the same values of skyline attributes, then only one of them is retained in the skyline set. Otherwise, all of them are retained. Let's consider the above example of purchasing books with high rating and low price. Book Name Rating (out of 5) Price (Rs) Prodigal Daughter 3 250 The city of Joy 5 400 Vanishing Acts 2 250 The Notebook 4 300 Fountain Head 5 350 Dear John 5 500 Table1. Sample of book database Now, in order to get books with high rating and low price, you simply can issue the following query: SELECT * FROM Books SKYLINE OF rating MAX, price MIN; The Skyline set returned will be: Book Name Rating (out of 5) Price (Rs) Prodigal Daughter 3 250 The Notebook 4 300 Fountain Head 5 350 Table2. Skyline set From this set, you can now make your choice of books, by weighing your personal preferences for price and rating of the books. For more information, you can refer to: S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421.430, 2001 --- Thanks. On 3/3/07, Martijn van Oosterhout kleptog@svana.org wrote: On Sat, Mar 03, 2007 at 07:02:41PM +0530, ranbeer makin wrote: We at International Institute of Information Technology (IIIT) Hyderabad, India, have extended the Postgres database system with the skyline operation. For this work, we were guided by our Prof. Kamalakar Karlapalem (http://www.iiit.ac.in/~kamal/). snip Can this piece of work contribute to PostgreSQL? If yes, then we'll send out a detailed report of this project including changes made, issues involved/need to be solved, limitations, future work, and the source code etc. Well, that kind of depends. I have no idea what Skyline means so telling us what it is would be a good start Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/ kleptog/ From each according to his ability. To each according to his ability to litigate. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFF6XrkIB7bNG8LQkwRAqw8AJ0UKAy41OMxdgLUdY1G+e7R6/jGPwCZAQY4 9uCKFUW65UBIx7fpogR75Yo= =6Yc0 -END PGP SIGNATURE- -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! Consider donating to PostgreSQL: http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if
Re: [HACKERS] GIST and TOAST
Gregory Stark [EMAIL PROTECTED] writes: Andrew - Supernews [EMAIL PROTECTED] writes: So I think you've mis-analyzed the problem. That's especially true since you are claiming that the existing code is already buggy when in fact no such bugs have been reported (and clearly intarray has been running with toasted array values for years). I'm not claiming, I'm asking, because I can't tell. And it's not clear _int_gist.c has been running with toasted array values for years because it's limited to arrays of 100 integers (or perhaps 200 integers, there's a factor of 2 in the test). That's not enough to trigger toasting unless there are other large columns in the same table. Actually I just realized the other large columns in the table would be irrelevant. It's not whether it's toasted in the table that matters, only if it gets compressed by index_form_tuple that does. And it can't since 400 bytes isn't large enough to trigger compression. Unless someone's using multi-column intarray gist indexes with very large arrays which I'm not convinced anyone is. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Synchronized Scan update
JD, Don't get me wrong, I want things to be easily understandable as well but the reason you site above pretty much makes us need to remove most of the postgresql.conf, including all bgwriter, vacuum cost delay, and autovac settings. Not to mention commit delay and others ;). Wouldn't that be nice! The explosion of GUC settings is primarily a result of not enough information. The reason there are 7 bgwriter settings, for example, is that we have no idea what those settings should be and are hoping that people will tinker with them and tell us. Someday when I can fully profile bgwriter, we'll just have one setting: bgwriter_aggressive, set to a number between 0 and 9. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] GIST and TOAST
So when does index_form_tuple get called? The single call of index_form_tuple in GiST core is in gistFormTuple which initially compress any indexed value with a help of their compress methods. Only tuples formed by gistFormTuple could be inserted in index. Does every data type define a compress/decompress method? Even if it's not a data type that normally gets very large? Yes, any GiST opclass should have such methods. In trivial case it just returns input value. As I remember, only R-Tree emulation over boxes, contrib/seg and contrib/cube have simple compress method. Well we cheated a bit and had heap/index_form_tuple convert the data to packed format. This saves having to push small tuples through the toaster. So now tuples can magically become toasted as soon as they go into a tuple even if they never get pushed through the toaster. Ok, it should be safe for GiST except some possible memory management issue. index_form_tuple in GiST works in GiST's memory context which is short-live. Is it possible issue for your patch? BTW, that's connected to GIN too. So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of using DatumGetPointer and PG_GETARG_POINTER and having to manually cast everywhere, no? It seems like there's a lot of extra pain to maintain the code in the present style with all the manual casts. Of course, I agree. Just PG_FREE_IF_COPY is extra call in support methods. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces
On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote: Mike Rylander wrote: The patch adds support for default XML namespaces in xml2 by providing a mechanism for supplying a prefix to a named namespace URI. How does it support multiple namespaces in one document? It supports one default (unprefixed) namespace URI per document, which ISTM is the overwhelmingly common case (and the itch that I must scratch). -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces
On 3/6/07, Nikolay Samokhvalov [EMAIL PROTECTED] wrote: On 3/6/07, Mike Rylander [EMAIL PROTECTED] wrote: Attatched you'll find a patch that I've been kicking around for a while that I'd like to propose for inclusion in 8.3. I attempted to submit this through the original xml2 author (as far back as the 7.4 days) but got no response. It's really fairly trivial, but I will be using the features it provides in production soon, so I'd like to see it applied against the contrib xml2 module. The patch adds support for default XML namespaces in xml2 by providing a mechanism for supplying a prefix to a named namespace URI. It then wraps the namespace-capable functions in backward-compatible equivalents so that old code will not break. 1) And what about non-default namespaces? I'm not sure I understand. If the namespace already has a prefix then it works fine. This patch simply gives a known non-prefixed namespace URI a prefix so one can write XPath that looks like //marc:[EMAIL PROTECTED]'245']/marc:[EMAIL PROTECTED]'a'] instead of //*[local-name()='datafield' and @tag='245']/*[local-name()='subfied' and @code='a'] A little two node example is painful enough, now imagine a non-trivial example with backtracking conditionals... :P 2) What if my XPath query has different prefix, that also should be mapped to the same URI? (Not frequent case, but this really can occur -- e.g. XML doc has prefix 'local' for URI='http://127.0.0.1', but XPath should have 'loc' for the same URI.) Both prefixes work fine as multiple prefixes can map to the same URI. -- Best regards, Nikolay -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GIST and TOAST
And it's not clear _int_gist.c has been running with toasted array values for years because it's limited to arrays of 100 integers (or perhaps 200 integers, there's a factor of 2 in the test). That's not enough to trigger toasting unless there are other large columns in the same table. That's was intended limitation to prevent indexing of huge arrays. gist__int_ops compression method is orientated for small and isn't effective on big ones. I do know that with packed varlenas I get a crash in g_int_union among other places. I can't tell where the datum came from originally and how it ended up stored in packed format. Can you provide your patch (in current state) and test suite? Or backtrace at least. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GIST and TOAST
Does every data type define a compress/decompress method? Even if it's not a data type that normally gets very large? Yes, any GiST opclass should have such methods. In trivial case it just returns input value. As I remember, only R-Tree emulation over boxes, contrib/seg and contrib/cube have simple compress method. Hm, if they just return the original datum without detoasting it then it could be an issue. I'll check. Well we cheated a bit and had heap/index_form_tuple convert the data to packed format. This saves having to push small tuples through the toaster. So now tuples can magically become toasted as soon as they go into a tuple even if they never get pushed through the toaster. Ok, it should be safe for GiST except some possible memory management issue. index_form_tuple in GiST works in GiST's memory context which is short-live. Is it possible issue for your patch? BTW, that's connected to GIN too. index_form_tuple doesn't leak memory. packed varlena format just has a shorter header so it can store the header and then copy the data to the new location. It doesn't have to create a copy of the data (except in the tuple, obviously). But it means index_getattr can return a toasted tuple. I see several calls to index_getattr that immediately put the datum into a GISTENTRY and call support functions like the union function. For example gistMakeUnionItVec does this. So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of using DatumGetPointer and PG_GETARG_POINTER and having to manually cast everywhere, no? It seems like there's a lot of extra pain to maintain the code in the present style with all the manual casts. Of course, I agree. Just PG_FREE_IF_COPY is extra call in support methods. Well if you're doing everything in short-lived memory contexts then we don't even need this. btree procedures do it because the btree code expects to be able to do comparisons without having to set up short-lived memory contexts. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
[EMAIL PROTECTED] (Jim Nasby) writes: FWIW, this sounds like a subset of the Query By Example stuff that someone is working on. I don't have a URL handy since I'm on a plane, but I think google can find it. The pgFoundry project is here... http://pgfoundry.org/projects/qbe And yes, indeed, this sounds quite a lot like what Meredith Patterson presented at the Toronto conference. -- (reverse (concatenate 'string ofni.secnanifxunil @ enworbbc)) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. I will never tell the hero Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool. Chances are, that incompetent old fool is standing behind the curtain. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto creation of Partitions
[EMAIL PROTECTED] (Peter Eisentraut) writes: NikhilS wrote: iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it on to the children tables. How will you maintain a primary key in such a table, considering that indexes can't span multiple tables? On the one hand, I seem to recall seeing multiple-table-spanning indices being on the TODO list. On the other, it may be that this sort of partitioning is only usable for scenarios where it is only needful to maintain uniqueness on a partition-by-partition basis. 1) Whether we should use triggers/rules for step number (iii) above. Maybe rules is the way to go. Since this would basically be a case of the updatable rules problem, you should review those discussions in the past to check whether the issues mentioned there don't interfere with that plan. 2) What other attributes (access permissions e.g.) of the master along with the ones specified in (iv) should be passed on to the children. Moreover, how are later changes of those attributes propagated? I hear rumour of there being a more comprehensive proposal on this in the works... -- (reverse (concatenate 'string ofni.secnanifxunil @ enworbbc)) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. I will never tell the hero Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool. Chances are, that incompetent old fool is standing behind the curtain. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] GIST and TOAST
input value. As I remember, only R-Tree emulation over boxes, contrib/seg and contrib/cube have simple compress method. Hm, if they just return the original datum without detoasting it then it could be an issue. I'll check. seg and box aren't a varlena types, but cube is and it seems broken :(. g_cube_decompress and g_cube_compress don't detoast values. I'll fix that. index_form_tuple doesn't leak memory. packed varlena format just has a shorter header so it can store the header and then copy the data to the new location. It doesn't have to create a copy of the data (except in the tuple, obviously). Nice, now that's clear. But it means index_getattr can return a toasted tuple. I see several calls to index_getattr that immediately put the datum into a GISTENTRY and call support functions like the union function. For example gistMakeUnionItVec does this. From gistMakeUnionItVec: datum = index_getattr(itvec[j], i + 1, giststate-tupdesc, IsNull); if (IsNull) continue; gistdentryinit(giststate, i, evec-vector + evec-n, datum, ) gistdentryinit calls user-defined decompress method. The reason of confusion is: there is three similar functions/macros: gistentryinit, gistcentryinit and gistdentryinit :) That names was choosen by authors initially developed GiST in pgsql. Well if you're doing everything in short-lived memory contexts then we don't even need this. Sure -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Plan invalidation vs. unnamed prepared statements
[ cc'd to pgsql-jdbc which seems the group most likely to be affected by any protocol change ] So I've been working on a plan cache module per my earlier proposal, and I've run up against a problem with getting exec_parse_message to use it. The problem is that the current rather hackish handling of unnamed prepared statements doesn't fit in. Per the documentation, unnamed statements are supposed to be optimized for the case of executing a query only once and then discarding it. In the current code this largely just means that we avoid copying the parse/plan trees into the normal PreparedStatement cache, preferring to let them sit in the context where they were generated (which means that any detritus generated by the parser/planner can't be recovered until we discard the unnamed statement, but that seems a good tradeoff in this situation). To use the plan cache for unnamed statements, there's going to have to be more overhead (more tree-copying) in this code path; moreover having the unnamed statement's plan in the cache will result in distributed overhead for checking it to see if it's still valid. This overhead is largely going to be wasted if the statement is always discarded immediately after use. I can think of several options for dealing with this: A. Just accept the extra overhead, thereby preserving the current behavior of unnamed statements, and gaining the benefit that plan invalidation will work correctly in the few cases where an unnamed statement's plan lasts long enough to need replanning. B. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, forcibly discard the unnamed statement after execution. This would get rid of a lot of overhead but would mean a significant change in the protocol-level behavior. It's hard to guess how many clients might be broken by it --- conceivably not any, but that seems too optimistic :-( C. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, don't analyze or plan at Parse-message time, but postpone *all* that work until Bind; and always discard the plan after Execute. We could still do raw parsing at Parse time, since that's independent of database state, but all but the most trivial syntactic errors would now occur at Bind not Parse time, as well as the majority of the time expenditure. This still amounts to a change in the protocol semantics, although it's a lot more subtle than plan B. Also there's a problem if the client does Describe Statement before Bind: we still have to run parse analysis before we can answer, and if we then throw that away, we have no very good way to guarantee that the statement still has the same description when it's subsequently executed; plus we end up doing parse analysis twice. D. Don't store the unnamed statement in the plan cache, and just ignore the possibility that its plan might become stale before use. That's exactly what happens now, but considering that the whole point of the plan inval work is to seal off such pitfalls, I can't say that I care for this alternative. Comments? I'm leaning to plan A but wanted to see if anyone would support plan B or sees a way to fix plan C. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
And yes, indeed, this sounds quite a lot like what Meredith Patterson presented at the Toronto conference. This would be good to have, though, since Meredith's work has some problematic IP encumbrances. Question, though: is the SKYLINE syntax part of a standard anywhere? -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
On Wed, Mar 07, 2007 at 02:43:39AM +1030, Shane Ambler wrote: I think a way can be devised to maintain the primary key and unique constraints. If a search is done on the parent table, the planner knows to rewrite the query as a union (or union all) of all child tables that relate to the where clause, or all child tables if the where clause is not on the column/s used to partition, then this concept should be able to be converted to indexes as well, so that when a primary or unique index from a child table is inserted to, then each of the related child indexes is consulted to ensure uniqueness. But that's where it breaks down: you not only need to check that the row you're inserting is unique, you need to make sure that other people trying to insert the same value see it. After all, a unique index does contain multiple instances of any particular value, it's just that the DB has verified that none of these are visible to the same transaction. If there is a collision a lock is taken that waits until the colliding insert commits or rejects. With b-tree it's easy, every value can only appear in one place, so it's a very simple lock. The reason why GiST doesn't support unique indexes is due to the fact that any particular value could appear in many places, so to stop concurrent inserts you need to lock *every* page in the tree that another backend might want to insert the record. I fear this will also be a problem for indexes over multiple tables, you will need to lock the page in every index that some other DB might want to insert the value into. All while avoiding deadlocks. I think the problems are related. If you can solve it for multiple tables, you can solve it for GiST indexes also. This would only apply if the partitioning was not done by the primary or unique column. That's the easy case, no problem there. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Gaetano Mendola [EMAIL PROTECTED] writes: [ 8.2 evaluates volatile functions in the targetlist of a view ] If I mark the function as STABLE or IMMUTABLE then even with version 8.2 the function is not evaluated. Is this the intended behavior? Yes; people complained that we needed to be more careful about the number of times volatile functions get evaluated. I didn't see something about it in the release note. listitem para Do not flatten subqueries that contain literalvolatile/ functions in their target lists (Jaime Casanova) /para para This prevents surprising behavior due to multiple evaluation of a literalvolatile/ function (such as functionrandom()/ or functionnextval()/). It might cause performance degradation in the presence of functions that are unnecessarily marked as literalvolatile/. /para /listitem regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] GIST and TOAST
Teodor Sigaev [EMAIL PROTECTED] writes: And it's not clear _int_gist.c has been running with toasted array values for years because it's limited to arrays of 100 integers (or perhaps 200 integers, there's a factor of 2 in the test). That's not enough to trigger toasting unless there are other large columns in the same table. That's was intended limitation to prevent indexing of huge arrays. gist__int_ops compression method is orientated for small and isn't effective on big ones. Right, so it's possible nobody see any toasted arrays with _int_gist.c since they never get very large. It looks like index_form_tuple will never compress anything under 512b so I guess it's safe currently. I do know that with packed varlenas I get a crash in g_int_union among other places. I can't tell where the datum came from originally and how it ended up stored in packed format. Can you provide your patch (in current state) and test suite? Or backtrace at least. It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion in there to cause it to generate a core dump. You can download the core dump and binary from http://community.enterprisedb.com/varlena/core._int http://community.enterprisedb.com/varlena/postgres._int The last patch (without the assertion) is at: http://community.enterprisedb.com/varlena/patch-varvarlena-14.patch.gz What I'm seeing is this: (gdb) f 3 #3 0xb7fd924b in inner_int_union (a=0x84e41f4, b=0xb64220d0) at _int_tool.c:81 81 CHECKARRVALID(b); The array is actually garbage: (gdb) p *b $2 = {vl_len_ = 141, ndim = 0, dataoffset = 5888, elemtype = 0} What's going on is that the va_1byte header is 141 which is 0x80 | 13. So it's actually only 13 bytes with a 1 byte header or a 12 byte array: (gdb) p *(varattrib*)b $3 = {va_1byte = {va_header = 141 '\215', va_data = }, va_external = { va_header = 141 '\215', va_padding = \000\000, va_rawsize = 0, va_extsize = 5888, va_valueid = 0, va_toastrelid = 0}, va_compressed = { va_header = 141, va_rawsize = 0, va_data = }, va_4byte = { va_header = 141, va_data = }} (gdb) bt #0 0xb7e6a947 in raise () from /lib/tls/libc.so.6 #1 0xb7e6c0c9 in abort () from /lib/tls/libc.so.6 #2 0x082fec97 in ExceptionalCondition ( conditionName=0xb7fdd3b9 !(!((b)-dataoffset != 0)), errorType=0xb7fdd371 FailedAssertion, fileName=0xb7fdd347 _int_tool.c, lineNumber=81) at assert.c:51 #3 0xb7fd924b in inner_int_union (a=0x84e41f4, b=0xb64220d0) at _int_tool.c:81 #4 0xb7fd547d in g_int_picksplit (fcinfo=0xbf9e43f0) at _int_gist.c:403 #5 0x08304d9c in FunctionCall2 (flinfo=0xbf9e5a94, arg1=139342312, arg2=3214821160) at fmgr.c:1154 #6 0x08094fd3 in gistUserPicksplit (r=0xb6078d4c, entryvec=0x84e31e8, attno=0, v=0xbf9e4728, itup=0x84e2ddc, len=142, giststate=0xbf9e4b94) at gistsplit.c:306 #7 0x08095deb in gistSplitByKey (r=0xb6078d4c, page=0xb6420220 , itup=0x84e2ddc, len=142, giststate=0xbf9e4b94, v=0xbf9e4728, entryvec=0x84e31e8, attno=0) at gistsplit.c:548 #8 0x080874bd in gistSplit (r=0xb6078d4c, page=0xb6420220 , itup=0x84e2ddc, len=142, giststate=0xbf9e4b94) at gist.c:943 #9 0x080850fa in gistplacetopage (state=0xbf9e49e0, giststate=0xbf9e4b94) at gist.c:329 #10 0x080871eb in gistmakedeal (state=0xbf9e49e0, giststate=0xbf9e4b94) at gist.c:873 #11 0x08084f21 in gistdoinsert (r=0xb6078d4c, itup=0x84e2ce4, freespace=819, giststate=0xbf9e4b94) at gist.c:278 #12 0x08084cf5 in gistbuildCallback (index=0xb6078d4c, htup=0x84c8c30, values=0xbf9e4a98, isnull=0xbf9e4a78 , tupleIsAlive=1 '\001', state=0xbf9e4b94) at gist.c:201 #13 0x080fc81f in IndexBuildHeapScan (heapRelation=0xb60d6860, indexRelation=0xb6078d4c, indexInfo=0x84cd620, callback=0x8084c27 gistbuildCallback, callback_state=0xbf9e4b94) at index.c:1548 #14 0x08084bdd in gistbuild (fcinfo=0xbf9e60e8) at gist.c:150 #15 0x08305630 in OidFunctionCall3 (functionId=782, arg1=3054332000, arg2=3053948236, arg3=139253280) at fmgr.c:1460 #16 0x080fc363 in index_build (heapRelation=0xb60d6860, indexRelation=0xb6078d4c, indexInfo=0x84cd620, isprimary=0 '\0') at index.c:1296 #17 0x080fb86a in index_create (heapRelationId=21361, indexRelationName=0x84a531c text_idx, indexRelationId=21366, indexInfo=0x84cd620, accessMethodObjectId=783, tableSpaceId=0, classObjectId=0x84cd60c, coloptions=0x84cd6ac, reloptions=0, isprimary=0 '\0', isconstraint=0 '\0', allow_system_table_mods=0 '\0', skip_build=0 '\0', concurrent=0 '\0') at index.c:794 #18 0x0815f3e4 in DefineIndex (heapRelation=0x84a5354, indexRelationName=0x84a531c text_idx, indexRelationId=0, accessMethodName=0x84a5380 gist, tableSpaceName=0x0, attributeList=0x84a5448, predicate=0x0, rangetable=0x0, options=0x0, unique=0 '\0', primary=0 '\0', isconstraint=0 '\0', is_alter_table=0 '\0', check_rights=1 '\001', skip_build=0 '\0', quiet=0 '\0', concurrent=0
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. However, no DBA tuning is a huge advantage, I agree with that. If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. Can you guys give me some pointers about how to walk through the shared buffers, reading the pages that I need, while being sure not to read a page that's been evicted, and also not potentially causing a performance regression somewhere else? Given the partitioning of the buffer lock that Tom did it might not be that horrible for many cases, either, since you'd only need to scan through one partition. We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Regards, Jeff Davis ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
Josh Berkus josh@agliodbs.com writes: Question, though: is the SKYLINE syntax part of a standard anywhere? There's certainly not anything like that in SQL2003. I'm also kind of wondering if the main use-cases couldn't be met with suitable multi-input custom aggregates, which is something we already have as of 8.2. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Plan invalidation vs. unnamed prepared statements
Tom Lane [EMAIL PROTECTED] writes: B. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, forcibly discard the unnamed statement after execution. This would get rid of a lot of overhead but would mean a significant change in the protocol-level behavior. It's hard to guess how many clients might be broken by it --- conceivably not any, but that seems too optimistic :-( Can we forcibly discard it if *any* messages are received that might invalidate a plan? So basically it would work fine unless anyone in the system does any DDL at all? I guess that has the downside of introducing random unpredictable failures. Or stash the query string and replan it (possibly in the query cache this time) if someone executes it a second time? Can't say I like either of those options much, just trying to brainstorm. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) I agree that it's a good idea stick with the current implementation which is, as far as I can see, meeting all of my performance goals. Regards, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Plan invalidation vs. unnamed prepared statements
Gregory Stark [EMAIL PROTECTED] writes: Can we forcibly discard it if *any* messages are received that might invalidate a plan? So basically it would work fine unless anyone in the system does any DDL at all? I guess that has the downside of introducing random unpredictable failures. Ugh :-( Or stash the query string and replan it (possibly in the query cache this time) if someone executes it a second time? I think that's either my plan A or C. The main problem with uncontrolled replanning is that there's no way to detect a change in the query properties. For example suppose the query is SELECT * FROM foo and we've already told the client (via Describe Statement) that that returns two integer columns. If an inval now arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse, ALTER COLUMN TYPE), we've got a problem. If we just blindly replan then we'll return tuples that do not match the previously given row description, which will certainly break most clients. The plan caching module has enough infrastructure to detect and complain about these sorts of situations, and it also knows how to manage lock acquisition so that once we've decided a plan is still good, the tables won't change underneath us while we use the plan. I don't see any way to make comparable guarantees without the overhead that goes with the cache manager. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto creation of Partitions
Martijn van Oosterhout wrote: On Wed, Mar 07, 2007 at 02:43:39AM +1030, Shane Ambler wrote: I think a way can be devised to maintain the primary key and unique constraints. If a search is done on the parent table, the planner knows to rewrite the query as a union (or union all) of all child tables that relate to the where clause, or all child tables if the where clause is not on the column/s used to partition, then this concept should be able to be converted to indexes as well, so that when a primary or unique index from a child table is inserted to, then each of the related child indexes is consulted to ensure uniqueness. But that's where it breaks down: you not only need to check that the row you're inserting is unique, you need to make sure that other people trying to insert the same value see it. This sounds like what is really needed is a way to lock a certain condition, namely the existance or non-existance of a record with certain values in certain fields. This would not only help this case, it would also help RI triggers, because those wouldn't have to acquire a share lock on the referenced rows anymore. As you pointed out, this would also make unique GiST indices possible No real idea how to do this, though :-( greetings, Florian Pfluge ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. However, no DBA tuning is a huge advantage, I agree with that. If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. Can you guys give me some pointers about how to walk through the shared buffers, reading the pages that I need, while being sure not to read a page that's been evicted, and also not potentially causing a performance regression somewhere else? You could take a look at BufferSync, for example. It walks through the buffer cache, syncing all dirty buffers. FWIW, I've attached a function I wrote some time ago when I was playing with the same idea for vacuums. A call to the new function loops through the buffer cache and returns the next buffer that belong to a certain relation. I'm not sure that it's correct and safe, and there's not much comments, but should work if you want to play with it... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/storage/buffer/bufmgr.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.214 diff -c -r1.214 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 5 Jan 2007 22:19:37 - 1.214 --- src/backend/storage/buffer/bufmgr.c 22 Jan 2007 16:38:37 - *** *** 97,102 --- 97,134 static void AtProcExit_Buffers(int code, Datum arg); + Buffer + ReadAnyBufferForRelation(Relation reln) + { + static int last_buf_id = 0; + int new_buf_id; + volatile BufferDesc *bufHdr; + + /* Make sure we will have room to remember the buffer pin */ + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + new_buf_id = last_buf_id; + do + { + if (++new_buf_id = NBuffers) + new_buf_id = 0; + + bufHdr = BufferDescriptors[new_buf_id]; + LockBufHdr(bufHdr); + + if ((bufHdr-flags BM_VALID) RelFileNodeEquals(bufHdr-tag.rnode, reln-rd_node)) + { + PinBuffer_Locked(bufHdr); + last_buf_id = new_buf_id; + return BufferDescriptorGetBuffer(bufHdr); + } + UnlockBufHdr(bufHdr); + } while(new_buf_id != last_buf_id); + last_buf_id = new_buf_id; + return InvalidBuffer; + } + + /* * ReadBuffer -- returns a buffer containing the requested * block of the requested relation. If the blknum ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
NikhilS wrote: We will not (I know its a hard thing to do :) ), the intention is to use this information from the parent and make it a property of the child table. This will avoid the step for the user having to manually specify CREATE INDEX and the likes on all the children tables one-by-one. But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto creation of Partitions
Florian, This sounds like what is really needed is a way to lock a certain condition, namely the existance or non-existance of a record with certain values in certain fields. This would not only help this case, it would also help RI triggers, because those wouldn't have to acquire a share lock on the referenced rows anymore. That's called predicate locking and it's very, very hard to do. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
Tom, I'm also kind of wondering if the main use-cases couldn't be met with suitable multi-input custom aggregates, which is something we already have as of 8.2. Actually, given that skyline of is *only* for aggregate sorting (as far as I can tell) it doesn't present the complications which QBE did for using a function interface. Ranbeer, would it be possible to use an aggregate function syntax instead of the SKYLINE OF syntax extension? This allows us to sidestep the issue of non-standard additions to the reserved word list. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto creation of Partitions
Peter Eisentraut [EMAIL PROTECTED] writes: But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [JDBC] [HACKERS] Plan invalidation vs. unnamed prepared statements
I think C is how the JDBC driver is written. We name the statements if they have been used more than prepareThreshold times. So we have a mechanism by which to allow statements to be cached, or not. Dave On 6-Mar-07, at 1:14 PM, Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: Can we forcibly discard it if *any* messages are received that might invalidate a plan? So basically it would work fine unless anyone in the system does any DDL at all? I guess that has the downside of introducing random unpredictable failures. Ugh :-( Or stash the query string and replan it (possibly in the query cache this time) if someone executes it a second time? I think that's either my plan A or C. The main problem with uncontrolled replanning is that there's no way to detect a change in the query properties. For example suppose the query is SELECT * FROM foo and we've already told the client (via Describe Statement) that that returns two integer columns. If an inval now arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse, ALTER COLUMN TYPE), we've got a problem. If we just blindly replan then we'll return tuples that do not match the previously given row description, which will certainly break most clients. The plan caching module has enough infrastructure to detect and complain about these sorts of situations, and it also knows how to manage lock acquisition so that once we've decided a plan is still good, the tables won't change underneath us while we use the plan. I don't see any way to make comparable guarantees without the overhead that goes with the cache manager. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] GIST and TOAST
It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion in there to cause it to generate a core dump. Wow, catch that, see attached patch. g_int_decompress doesn't returns detoasted array in case it was empty. Previously it was safe because empty array never has been toasted. Should I commit it or you'll include in your patch? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ *** ./contrib/intarray.orig/./_int_gist.c Tue Mar 6 20:59:23 2007 --- ./contrib/intarray/./_int_gist.cTue Mar 6 21:41:54 2007 *** *** 232,238 --- 232,247 CHECKARRVALID(in); if (ARRISVOID(in)) + { + if (in != (ArrayType *) DatumGetPointer(entry-key)) { + retval = palloc(sizeof(GISTENTRY)); + gistentryinit(*retval, PointerGetDatum(in), + entry-rel, entry-page, entry-offset, FALSE); + PG_RETURN_POINTER(retval); + } + PG_RETURN_POINTER(entry); + } lenin = ARRNELEMS(in); ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) The scenario that I'm worried about is that you have a table that's slightly larger than RAM. If you issue many seqscans on that table, one at a time, every seqscan will have to read the whole table from disk, even though say 90% of it is in cache when the scan starts. This can be alleviated by using a large enough sync_scan_offset, but a single setting like that is tricky to tune, especially if your workload is not completely constant. Tune it too low, and you don't get much benefit, tune it too high and your scans diverge and you lose all benefit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto creation of Partitions
Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. +1 -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Plan invalidation vs. unnamed prepared statements
Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: Can we forcibly discard it if *any* messages are received that might invalidate a plan? So basically it would work fine unless anyone in the system does any DDL at all? I guess that has the downside of introducing random unpredictable failures. Ugh :-( Or stash the query string and replan it (possibly in the query cache this time) if someone executes it a second time? I think that's either my plan A or C. The main problem with uncontrolled replanning is that there's no way to detect a change in the query properties. For example suppose the query is SELECT * FROM foo and we've already told the client (via Describe Statement) that that returns two integer columns. If an inval now arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse, ALTER COLUMN TYPE), we've got a problem. If we just blindly replan then we'll return tuples that do not match the previously given row description, which will certainly break most clients. It will always be a good question what user expects as a result of 'SELECT * FROM...'. For example, client may use ODBC or some other interface for DB communication. One the first step he retrieves information about the table and it's datatypes, on the second tries to fetch rows (using interface functions). Client application won't even guess that table could be changed between these two steps. It's impossible to avoid such situations, because we can't know how the user retrieves information about results he will expect. The plan caching module has enough infrastructure to detect and complain about these sorts of situations, and it also knows how to manage lock acquisition so that once we've decided a plan is still good, the tables won't change underneath us while we use the plan. I don't see any way to make comparable guarantees without the overhead that goes with the cache manager. It's a required overhead. Result should be valid on the execution time, not on prepare. Cache manager is the best for this. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
Josh Berkus wrote: Tom, I'm also kind of wondering if the main use-cases couldn't be met with suitable multi-input custom aggregates, which is something we already have as of 8.2. Actually, given that skyline of is *only* for aggregate sorting (as far as I can tell) it doesn't present the complications which QBE did for using a function interface. There is people on a Venezuelan university working on SKYLINE OF and other operators on Postgres. I had some looks at their work because they asked for help in the spanish list. Not only they added the SKYLINE OF clause, but they also had some mods to the ORDER BY clause, and a couple of other grammar changes as well. While SKYLINE OF itself could probably be folded as aggregates, the other stuff is not likely to be amenable to such treatment. Also, keep in mind that there were plenty of changes in the executor. This stuff is not likely to be very easy to implement efficiently using our extant executor machinery; note that Ranbeer mentioned implementation of block nested loop and other algorithms. Not sure how easy would be to fold that stuff into the optimizer for multi-input aggregates, instead of hardwiring it to the SKYLINE OF syntax. There's a certain group in the Venezuelan Uni that was about to finish their thesis. They promised me a look into their report; maybe I can give further input from that and maybe merge Ranbeer's stuff with it. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Auto creation of Partitions
Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. Maybe. The most obvious use for automatic partitioning that I can think of would be based in the value of a timestamptz field rather than any PK. Of course I tend to work more in the OLTP field than in DW type apps, where other considerations might apply. cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Arrays of Complex Types
On Fri, Mar 02, 2007 at 06:59:50PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: 1. In src/backend/commands/tablecmds.c, change DefineRelation as follows: * After the first call to heap_create_with_catalog, construct and do another call to for the array type. I'm still not happy about the idea of doing this for every relation (and doing it for sequences and indexes would be the height of wastefulness). How about we only do it for composite types? How about doing it for user-defined tables, views and composite types, and skipping ? * Add an appropriate pg_depend entry. 2. Change RemoveRelation to reflect the above. You only need one of those two: either you drop by hand or you let the dependency machinery deal with it. Not both. pg_depend it is, then :) Does the above make sense? Have I missed anything critical? Ummm ... making it actually work? Possibly that just falls out, but I'm not sure. If it turns out that it does Just Work, you might take a stab at arrays of domains too. OK. I noticed something in src/backend/commands/tablecmds.c which worries me, namely that it ignores functions and views. It should at least be checking that the typeoid isn't in pg_proc.prorettype or pg_proc.proargtypes, and if possible, the DECLARE section of pl/pgsql functions. Is there a way to do SQL at that place in the back-end, or is there some different kind of Magick(TM) needed to access these kinds of things at that level? Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! Consider donating to PostgreSQL: http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Arrays of Complex Types
David Fetter [EMAIL PROTECTED] writes: I noticed something in src/backend/commands/tablecmds.c which worries me, namely that it ignores functions and views. What? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Arrays of Complex Types
On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: I noticed something in src/backend/commands/tablecmds.c which worries me, namely that it ignores functions and views. What? The it in question is, find_composite_type_dependencies() Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! Consider donating to PostgreSQL: http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Arrays of Complex Types
David Fetter [EMAIL PROTECTED] writes: On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: I noticed something in src/backend/commands/tablecmds.c which worries me, namely that it ignores functions and views. What? The it in question is, find_composite_type_dependencies() All that that's interested in is whether there are stored values of the datatype somewhere. Views don't have any storage, and a function definition doesn't either. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Arrays of Complex Types
On Tue, Mar 06, 2007 at 04:24:36PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: I noticed something in src/backend/commands/tablecmds.c which worries me, namely that it ignores functions and views. What? The it in question is, find_composite_type_dependencies() All that that's interested in is whether there are stored values of the datatype somewhere. Views don't have any storage, and a function definition doesn't either. I see. Perhaps I've misunderstood what this thing was for, then. What is it that checks whether it's OK to change a composite type, then? Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! Consider donating to PostgreSQL: http://www.postgresql.org/about/donate ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto creation of Partitions
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Josh Berkus Sent: dinsdag 6 maart 2007 19:45 To: pgsql-hackers@postgresql.org Cc: Florian G. Pflug; Martijn van Oosterhout; Shane Ambler; NikhilS; Peter Eisentraut Subject: Re: [HACKERS] Auto creation of Partitions Florian, This sounds like what is really needed is a way to lock a certain condition, namely the existance or non-existance of a record with certain values in certain fields. This would not only help this case, it would also help RI triggers, because those wouldn't have to acquire a share lock on the referenced rows anymore. That's called predicate locking and it's very, very hard to do. That's definitely not needed. Rather something good such that we can finally enforce RI ourselves in the general case. This is currently not possible to do easily, except in C code. This means we need to look at all the rows that exists, but are normally be invisible to our view of the database. Still I'm not sure about all cases, as the MVCC model is quite tricky and I'm not sure whether my idea's about it are valid. The basic idea is that you need to guarentee the constraint for the 'single underlaying model' (with everything visible) and for your view (under your visibility rules). I believe, but are not certain, that under these conditions any (valid) snapshot will obey the desired constraints. - Joris Dobbelsteen ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. Hi Sherry, I'm trying to relate what you've said to how we should proceed from here. My understanding of what you've said is: - Tom's assessment that the observed performance quirk could be fixed in the OS kernel is correct and you have the numbers to prove it - currently Solaris only does NTA for 128K reads, which we don't currently do. If we were to request 16 blocks at time, we would get this benefit on Solaris, at least. The copyout_max_cached parameter can be patched, but isn't a normal system tunable. - other workloads you've traced *do* reuse the same buffer again very soon afterwards when reading sequentially (not writes). Reducing the working set size is an effective technique in improving performance if we don't have a kernel that does NTA or we don't read in big enough chunks (we need both to get NTA to kick in). and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Plan invalidation vs. unnamed prepared statements
On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote: A. Just accept the extra overhead, thereby preserving the current behavior of unnamed statements, and gaining the benefit that plan invalidation will work correctly in the few cases where an unnamed statement's plan lasts long enough to need replanning. With connection pooling, multiple sessions will execute each statement. If we check the cache each time this does seem more expensive for each individual session, but we should gain synergy from other similar sessions. Taken across multiple sessions, A will be a win because it will reduce planning overhead by ~99%. C. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, don't analyze or plan at Parse-message time, but postpone *all* that work until Bind; and always discard the plan after Execute. We could still do raw parsing at Parse time, since that's independent of database state, but all but the most trivial syntactic errors would now occur at Bind not Parse time, as well as the majority of the time expenditure. ISTM there will be some cases where the current behaviour will not be maintained if we implement A exactly. One thing I've not seen mentioned is the effect of constants on various plans. The current system plans at Bind time so it can make longer term decisions based upon the values of initial parameters. So I'd say we need to check the cache at Parse time, but if we do need to plan, continue to do this at Bind time (and so don't write to plan cache until that point). That might mean we end up giving some of our benefit away if multiple sessions all concurrently plan a previously unplanned query. That does seem less likely and in any case much better than taking a step backwards in query planning of parameterised queries. Also, some of those plans are only currently possible with actual constants, specifically predicate proving for partial indexes and constraint exclusion. Parameter to constant folding may change the plan completely and make it non-reusable anyhow. How would we cope with that type of prepared query with plan inval? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Plan invalidation vs. unnamed prepared statements
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote: A. Just accept the extra overhead, thereby preserving the current behavior of unnamed statements, and gaining the benefit that plan invalidation will work correctly in the few cases where an unnamed statement's plan lasts long enough to need replanning. With connection pooling, multiple sessions will execute each statement. If we check the cache each time this does seem more expensive for each individual session, but we should gain synergy from other similar sessions. It seems fairly unlikely to me that client code would try to share an unnamed statement across multiple application threads; the entire point is that it's for one-off queries. Or did you miss the point that the plan cache is local per-backend? ISTM there will be some cases where the current behaviour will not be maintained if we implement A exactly. One thing I've not seen mentioned is the effect of constants on various plans. There is none. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GIST and TOAST
Teodor Sigaev [EMAIL PROTECTED] writes: It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion in there to cause it to generate a core dump. Wow, catch that, see attached patch. g_int_decompress doesn't returns detoasted array in case it was empty. Previously it was safe because empty array never has been toasted. Ah, thanks a bunch. Should I commit it or you'll include in your patch? I'll include it in the patch I guess since it's fine the way it is until the patch hits. Now I'll try running the regressions again with the gist datatypes like hstore etc all packed as well. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 18:47 +, Heikki Linnakangas wrote: Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) The scenario that I'm worried about is that you have a table that's slightly larger than RAM. If you issue many seqscans on that table, one at a time, every seqscan will have to read the whole table from disk, even though say 90% of it is in cache when the scan starts. If you're issuing sequential scans one at a time, that 90% of the table that was cached is probably not cached any more, unless the scans are close together in time without overlapping (serial sequential scans). And the problem you describe is no worse than current behavior, where you have exactly the same problem. This can be alleviated by using a large enough sync_scan_offset, but a single setting like that is tricky to tune, especially if your workload is not completely constant. Tune it too low, and you don't get much benefit, tune it too high and your scans diverge and you lose all benefit. I see why you don't want to manually tune this setting, however it's really not that tricky. You can be quite conservative and still use a good fraction of your physical memory. I will come up with some numbers and see how much we have to gain. Regards, Jeff Davis ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
On Tue, 6 Mar 2007, Alvaro Herrera wrote: Also, keep in mind that there were plenty of changes in the executor. This stuff is not likely to be very easy to implement efficiently using our extant executor machinery; note that Ranbeer mentioned implementation of block nested loop and other algorithms. Not sure how easy would be to fold that stuff into the optimizer for multi-input aggregates, instead of hardwiring it to the SKYLINE OF syntax. Yes, there's been a lot of working on calculating skyline efficiently, with different sorting techniques and so on. This is the most interesting part of the idea. You could calculate the query Ranbeer gave using pure SQL and, perhaps, use of some covariance aggregates or something already. Of course, it gets harder when you want to calculate across many dimensions. Personally, I'd love to see some of these newer data analysis capabilities added to PostgreSQL -- or at least put out there as interesting patches. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 6, 2007, at 12:17 AM, Tom Lane wrote: Jim Nasby [EMAIL PROTECTED] writes: An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, The bgwriter already tries to keep pages just in front of the clock sweep pointer clean. True, but that still means that each backend has to run the clock- sweep. AFAICT that's something that backends will serialize on (due to BufFreelistLock), so it would be best to make StrategyGetBuffer as fast as possible. It certainly seems like grabbing a buffer off the free list is going to be a lot faster than running the clock sweep. That's why I think it'd be better to have the bgwriter run the clock sweep and put enough buffers on the free list to try and keep up with demand. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Aggressive freezing in lazy-vacuum
Gregory Stark [EMAIL PROTECTED] wrote: The hoped for gain here is that vacuum finds fewer pages with tuples that exceed vacuum_freeze_min_age? That seems useful though vacuum is still going to have to read every page and I suspect most of the writes pertain to dead tuples, not freezing tuples. Yes. VACUUM makes dirty pages only for freezing exceeded tuples in particular cases and I think we can reduce the writes by keeping the number of unfrozen tuples low. There are three additional costs in FREEZE. 1. CPU cost for changing the xids of target tuples. 2. Writes cost for WAL entries of FREEZE (log_heap_freeze). 3. Writes cost for newly created dirty pages. I did additional freezing in the following two cases. We'll have created dirty buffers and WAL entries for required operations then, so that I think the additional costs of 2 and 3 are ignorable, though 1 still affects us. | - There are another tuple to be frozen in the same page. | - There are another dead tuples in the same page. | Freezing is delayed until the heap vacuum phase. This strikes me as something that will be more useful once we have the DSM especially if it ends up including a frozen map. Once we have the DSM vacuum will no longer be visiting every page, so it will be much easier for pages to get quite old and only be caught by a vacuum freeze. The less i/o that vacuum freeze has to do the better. If we get a freeze map then agressive freezing would help keep pages out of that map so they never need to be vacuumed just to freeze the tuples in them. Yeah, I was planning to 2 bits/page DSM exactly for the purpose. One of the bits means to-be-vacuumed and another means to-be-frozen. It helps us avoid full scanning of the pages for XID wraparound vacuums, but DSM should be more reliable and not lost any information. I made an attempt to accomplish it in DSM, but I understand the need to demonstrate it works as designed to you. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote: On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. You still haven't told me how we take advantage of the OS buffer cache with the bitmap idea. What makes you think that my current implementation is nowhere near as useful as the bitmap idea? My current implementation is making use of OS buffers + shared memory; the bitmap idea can only make use of shared memory, and is likely throwing the OS buffers away completely. I also suspect that the bitmap idea relies too much on the idea that there's a contiguous cache trail in the shared buffers alone. Any devation from that -- which could be caused by PG's page replacement algorithm, especially in combination with a varied load pattern -- would negate any benefit from the bitmap idea. I feel much more confident that there will exist a trail of pages that are cached in *either* the PG shared buffers *or* the OS buffer cache. There may be holes/gaps in either one, but it's much more likely that they combine into a contiguous series of cached pages. Do you have an idea how I might test this claim? Regards, Jeff Davis ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Aggressive freezing in lazy-vacuum
Tom Lane [EMAIL PROTECTED] wrote: I said nothing about expired tuples. The point of not freezing is to preserve information about the insertion time of live tuples. I don't know what good it will do -- for debugging? Why don't you use CURRENT_TIMESTAMP? And your test case is unconvincing, because no sane DBA would run with such a small value of vacuum_freeze_min_age. I intended to use the value for an accelerated test. The penalties of freeze are divided for the long term in normal use, but we surely suffer from them by bits. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. Regards, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Sherry Moore [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time To: Simon Riggs Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Simon, and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Tue, 6 Mar 2007, Jim Nasby wrote: The flipside is that it's much easier to machine-parse a table rather than trying to scrape the logs. Now you might realize why I've been so vocal on the SQL log export implementation details. And I don't think we'll generally care about each individual checkpoint; rather we'll want to look at things like 'checkpoints/hour' and 'checkpoint written pages/hour'. After a few months of staring at this data, I've found averages like that misleading. The real problem areas correlate with the peak pages written at any one checkpoint. Lowering that value is really the end-game for optimizing the background writer, and the peaks are what will nail you with a nasty fsync pause at checkpoint time. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Aggressive freezing in lazy-vacuum
ITAGAKI Takahiro [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: I said nothing about expired tuples. The point of not freezing is to preserve information about the insertion time of live tuples. I don't know what good it will do -- for debugging? Exactly. As an example, I've been chasing offline a report from Merlin Moncure about duplicate entries in a unique index; I still don't know what exactly is going on there, but the availability of knowledge about which transactions inserted which entries has been really helpful. If we had a system designed to freeze tuples as soon as possible, that info would have been gone forever pretty soon after the problem happened. I don't say that this behavior can never be acceptable, but you need much more than a marginal performance improvement to convince me that it's worth the loss of forensic information. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Greg Smith [EMAIL PROTECTED] wrote: After a few months of staring at this data, I've found averages like that misleading. The real problem areas correlate with the peak pages written at any one checkpoint. Lowering that value is really the end-game for optimizing the background writer, and the peaks are what will nail you with a nasty fsync pause at checkpoint time. If you've already had some technical knowledge to lead best settings from activity logs, could you write it down in the codes? I hope some kinds of automatic control features in bgwriter if its best configurations vary by usages or activities. BTW, I'm planning two changes in bgwriter. [Load distributed checkpoint] http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php [Automatic adjustment of bgwriter_lru_maxpages] http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php I have some results that if we have plenty of time for checkpoints, bgwriter_all_maxpages is not a so important parameter because it is adjusted to shared_buffers / duration of checkpoint. Also, my recommended bgwriter_lru_maxpages is average number of recycled buffers per cycle, that is hardly able to tune manually. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
Hi Shane, Maybe I'm looking at auto-maintenance which is beyond any current planning? Many of your suggestions are useful, but auto-maintenance will be beyond the current plan. Regards, Nikhils EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/7/07, Tom Lane [EMAIL PROTECTED] wrote: Peter Eisentraut [EMAIL PROTECTED] writes: But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. regards, tom lane Sure, but as Chris mentioned earlier, wouldn't it be useful to maintain uniqueness on a partition-by-partition basis too? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Andrew Dunstan wrote: Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: But when I say CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ... then I expect that the primary key will be enforced across all partitions. We currently sidestep that issue by not offering seemingly transparent partitioning. But if you are planning to offer that, the unique index issue needs to be solved, and I see nothing in your plan about that. Agreed, it needs to Just Work. I think it'd still be useful though if we only support auto-partitioning on the primary key, and that restriction avoids the indexing problem. Maybe. The most obvious use for automatic partitioning that I can think of would be based in the value of a timestamptz field rather than any PK. Of course I tend to work more in the OLTP field than in DW type apps, where other considerations might apply. I second that - partitioning on some kind of timestamp field is a common usecase here too ... Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Maybe it should not ? Mostly there can be use of OS cache only if it is much bigger than shared buffer cache. It may make sense to forget about OS cache and just tell those who can make use of sync scans to set most of memory aside for shared buffers. Then we can do better predictions/lookups of how much of a table is actually in memory. Dual caching is usually not very beneficial anyway, not to mention about difficulties in predicting any doual-caching effects. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). The simplest thing in case table si much bigger than buffer cache usable for it is to start the second scan at the point the first scan is traversing *now*, and hope that the scans will stay together. Or start at some fixed lag, which makes the first scan to be always the one issuing reads and second just freerides on buffers already in cache. It may even be a good idea to throttle the second scan to stay N pages behind if the OS readahead gets confused when same file is read from multiple processes. If the table is smaller than the cache, then just scan it without syncing. Trying to read buffers in the same order starting from near the point where ppages are still in shared buffer cache seems good mostly for case where table is as big as or just a little larger than cache. In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. If there is an easy way to tell PG give me this page only if it is in shared cache already, then a good approach might be to start 2nd scan at the point where 1st is now, and move in both directions simultabeously, like this: First scan is at page N. Second scan: M=N-1 WHILE NOT ALL PAGES ARE READ: IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER READ PAGE N N++ ELSE IF M=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES READ PAGE M M-- ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st WAIT FOR PAGE N TO BECOME AVAILABLE READ PAGE N N++ ELSE:-- BECOME 1st reader READ PAGE N N++ PROCESS PAGE -- IF N PAGES_IF_TABLE: N=0 IF M 0: M=PAGES_IF_TABLE This should work reasonably well for LRU caches and it may be made to work with clock sweep scheme if the sweep arranges pages to purge in file order. If we could make the IF PAGE x IS IN CACHE part also know about OS cache this could also make use of os cache. Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
Maybe. The most obvious use for automatic partitioning that I can think of would be based in the value of a timestamptz field rather than any PK. Of course I tend to work more in the OLTP field than in DW type apps, where other considerations might apply. I second that - partitioning on some kind of timestamp field is a common usecase here too ... Partitioning period needs to work. It doesn't matter what the user chooses as their partition key. Timestamp is an obvious choice but there are others such as serial where you just partition every million rows (for example) to keep things manageable. Joshua D. Drake Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 6: explain analyze is your friend