Re: [HACKERS] Index Page Split logging
On Jan 2, 2008 2:25 AM, Simon Riggs [EMAIL PROTECTED] wrote: On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: If we could log *only* the insert that caused the split, rather than the split itself, we would avoid that situation entirely. How are you going to avoid the need to run user-defined functions (specifically, the btree comparison functions) during replay? Seems like a good objection. Just exercising my lateral thought muscles. Can this be explained in more detail??? Thanks, Gokul.
Re: [HACKERS] Index Page Split logging
On Wed, Jan 02, 2008 at 02:49:35PM +0530, Gokulakannan Somasundaram wrote: On Jan 2, 2008 2:25 AM, Simon Riggs [EMAIL PROTECTED] wrote: On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote: How are you going to avoid the need to run user-defined functions (specifically, the btree comparison functions) during replay? Seems like a good objection. Just exercising my lateral thought muscles. Can this be explained in more detail??? If the goal is to only store the insert, then we need to determine during recovery which page the record needs to be added to. To do this you need to go through the index, which can only be done by calling user-defined functions. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [HACKERS] Index Page Split logging
On Jan 2, 2008 3:35 PM, Martijn van Oosterhout [EMAIL PROTECTED] wrote: If the goal is to only store the insert, then we need to determine during recovery which page the record needs to be added to. To do this you need to go through the index, which can only be done by calling user-defined functions. Correct me if i am wrong. User-defined functions act as comparison functions in GIST. But when do they act as comparison functions in b-tree? I am not able to understand exactly that part. Thanks for the explanation. -- Thanks, Gokul.
Re: [HACKERS] Index Page Split logging
On Wed, Jan 02, 2008 at 04:04:48PM +0530, Gokulakannan Somasundaram wrote: On Jan 2, 2008 3:35 PM, Martijn van Oosterhout [EMAIL PROTECTED] wrote: If the goal is to only store the insert, then we need to determine during recovery which page the record needs to be added to. To do this you need to go through the index, which can only be done by calling user-defined functions. Correct me if i am wrong. User-defined functions act as comparison functions in GIST. But when do they act as comparison functions in b-tree? I am not able to understand exactly that part. All indexes are done by user-defined functions, even b-trees. People can make their own b-tree indexes by defining an operator class. Note that user-defined is this case means anything called via the fmgr interface. For reference, hash indexes are defined by user-defined functions as well... It's all about the intereaction between Index AMs and operator classes. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [HACKERS] Index Page Split logging
All indexes are done by user-defined functions, even b-trees. People can make their own b-tree indexes by defining an operator class. Note that user-defined is this case means anything called via the fmgr interface. Again, i think i have one more wrong understanding. My understanding is, We are discussing about user-defined functions because, they might be actually be mutable functions, but the user might have classified as immutable. This might cause some problems while replaying the log. In the case of hash-indexes, if the hash-function is mutable, then the user has a corrupted index. Is there some other problem also, because of user-defined functions, that will stall recovery in the proposed idea? In our standard b-tree(with no user-defined operator classes), there shouldn't be any problem with replaying right? Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If not what is the difference between both? Again thanks for the explanation. -- Thanks, Gokul.
Re: [HACKERS] Index Page Split logging
On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote: All indexes are done by user-defined functions, even b-trees. People can make their own b-tree indexes by defining an operator class. Note that user-defined is this case means anything called via the fmgr interface. Again, i think i have one more wrong understanding. My understanding is, We are discussing about user-defined functions because, they might be actually be mutable functions, but the user might have classified as immutable. This is where it gets a bit beyond by depth, so someone may have to correct me. The point is that during the recovery the system is not yet fully running and not everything works yet. For example, what happens if the system crashed halfway through updating a page in pg_proc and that page needs to be recovered from WAL. Yet to insert into the index you need to be able to read pg_am, pg_amproc, pg_proc at least, probably more. The point being, you can't rely on anything except WAL during recovery. In our standard b-tree(with no user-defined operator classes), there shouldn't be any problem with replaying right? Even inserting into our standard b-tree involves looking up the operator class and associated functions and using the fmgr interface. There is no difference in the system between the builtin operator classes and user defined ones. Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If not what is the difference between both? gist-btree is a nice example but IIRC it takes more diskspace and can't handle uniqueness. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [HACKERS] Index Page Split logging
On Wed, 2008-01-02 at 13:54 +0100, Martijn van Oosterhout wrote: On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote: All indexes are done by user-defined functions, even b-trees. People can make their own b-tree indexes by defining an operator class. Note that user-defined is this case means anything called via the fmgr interface. Again, i think i have one more wrong understanding. My understanding is, We are discussing about user-defined functions because, they might be actually be mutable functions, but the user might have classified as immutable. This is where it gets a bit beyond by depth, so someone may have to correct me. The point is that during the recovery the system is not yet fully running and not everything works yet. For example, what happens if the system crashed halfway through updating a page in pg_proc and that page needs to be recovered from WAL. Yet to insert into the index you need to be able to read pg_am, pg_amproc, pg_proc at least, probably more. That's right; shame I forgot this before I started the thread... -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index Page Split logging
On Jan 2, 2008 6:24 PM, Martijn van Oosterhout [EMAIL PROTECTED] wrote: On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote: All indexes are done by user-defined functions, even b-trees. People can make their own b-tree indexes by defining an operator class. Note that user-defined is this case means anything called via the fmgr interface. Again, i think i have one more wrong understanding. My understanding is, We are discussing about user-defined functions because, they might be actually be mutable functions, but the user might have classified as immutable. This is where it gets a bit beyond by depth, so someone may have to correct me. The point is that during the recovery the system is not yet fully running and not everything works yet. For example, what happens if the system crashed halfway through updating a page in pg_proc and that page needs to be recovered from WAL. Yet to insert into the index you need to be able to read pg_am, pg_amproc, pg_proc at least, probably more. The point being, you can't rely on anything except WAL during recovery. Thanks a lot for the nice explanation. That was something new for me:(( -- Thanks, Gokul.
Re: [HACKERS] Index Page Split logging
On Wed, Jan 02, 2008 at 01:17:03PM +, Simon Riggs wrote: That's right; shame I forgot this before I started the thread... Actually, I think your idea has merit, it's just not as easy as originally thought. All splits, including multiple-level splits, can be described as a sequence of split page X with items {S} going to the left followed by an insert item into page X. The trick is that for multi-level splits you have to split from the top down. Then each split becomes a simple loggable operation. But, I think in the live system we split from the bottom up (the split and insert is a single operation), so I don't think you can actually combine the current split algorithm with the logging operation I suggest above. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
[HACKERS] Autovacuum Table List Ordering
OK, so i was going through the autovacuum code and i noticed that when we construct the table list to vacuum in a database per run, we don't process them in a particular order. I mean since we pick the list up from pg_class so it may be ordered on oid? but when we select a database we have a priority algo, which select the db's in XID wraparound danger first, and that too the in an order that the db nearest to the XID wraparound is picked first So wouldn't it make sense to pick up the tables in a similar order as well? like sorting the list on (deadtuples - calculated threshold) this way we will be vacuuming the tables in more need first. Without an order i can imagine a scenario where a table which is near a wrap-around or has more dead-rows waiting or possible having the XID warparound while another big table takes a lot of time. With the launcher - worker architecture in 8.3 the situation will improve as you can have multiple workers vacuuming multiple tables in parallel, but having some order there would also help , i would think. Thanks, -- Usama Munir Dar http://www.linkedin.com/in/usamadar Consultant Architect Cell:+92 321 5020666 Skype: usamadar
[HACKERS] Slow count(*)
Hi, We are looking for a patch that will help us count using the indexes. Our product is about 20 times slower on Postgres compared to MS SQL Server. Any ideas? Danny Abraham BMC Software CTMD Business Unit 972-52-4286-513 [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Slow count(*)
On 02/01/2008, Abraham, Danny [EMAIL PROTECTED] wrote: Hi, We are looking for a patch that will help us count using the indexes. Our product is about 20 times slower on Postgres compared to MS SQL Server. Any ideas? There isn't any similar patch and will not be. Use materialized views or similar techniques. Are you sure, so all your problems are only in SELECT COUNT(*)? Check, please, all slow queries. Regards Pavel Stehule Danny Abraham BMC Software CTMD Business Unit 972-52-4286-513 [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Slow count(*)
am Wed, dem 02.01.2008, um 9:29:24 -0600 mailte Abraham, Danny folgendes: Hi, We are looking for a patch that will help us count using the indexes. Our product is about 20 times slower on Postgres compared to MS SQL Server. Any ideas? Please show us your SQL and the execution plan (EXPLAIN or, better, EXPLAIN ANALYSE) and read our FAQ. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: - Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] Index performance
Hi, I have a table with a primarry key made of two columns. One of these has about 150 distinct values which are unique IDs, and the other has over 3 million almost unique data values. This table is added to in real time, at least 10 rows per second. If I do a select which uses the pkey index, where equal to the ID column, and greater than one of the values, which should return about 1500 rows, it sometimes takes 1/2 minute to return, and other times takes only seconds. Is it the number of rows being added in real time, that is maybe causing the index to be locked? I tried creating a partial index where equal to the ID column, and that sped it up cconsiderably and consistently, but this is a headache to maintain. Any ideas? Thanks Brian -- Brian Modra Land line: +27 23 5411 462 Mobile: +27 79 183 8059 6 Jan Louw Str, Prince Albert, 6930 Postal: P.O. Box 2, Prince Albert 6930 South Africa ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Slow count(*)
On Wed, Jan 02, 2008 at 09:29:24AM -0600, Abraham, Danny wrote: We are looking for a patch that will help us count using the indexes. Is this for SELECT count(*) FROM table; or SELECT count(1) FROM table WHERE. . . The latter _will_ use an index, if the index is correct, the statistics are right, and the index selectivity is worth the cost of reading the index. The former will not use an index at all, because the answer depends on visibility, and you can't know that without reading the table. If you're counting how many rows are in the table (for, for instance, display purposes), you probably need to do something else. Our product is about 20 times slower on Postgres compared to MS SQL Server. Any ideas? Not without the queries, the EXPLAIN ANALYZE plans, and some information about the database. A ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Index performance
On Wed, Jan 02, 2008 at 05:53:35PM +0200, Brian Modra wrote: This table is added to in real time, at least 10 rows per second. [. . .] If I do a select which uses the pkey index, where equal to the ID column, and greater than one of the values, which should return about 1500 rows, it sometimes takes 1/2 minute to return, and other times takes only seconds. Is it the number of rows being added in real time, that is maybe causing the index to be locked? No, it's probably a bad plan. A minimum 10 rows/second is probably just making the statistics for the table look bad. You likely want to SET STATISTICS wider on the 1st (~150 distinct values) column, and then run ANALYSE on the table very frequently. Are you updating or deleting at all? If so, that will also affect things: you need to perform very frequent VACUUM on that table in that case. Aside from that generic advice, it's impossible to say more without EXPLAIN ANALYSE output for the slow and fast examples. A ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] Dynamic Partitioning using Segment Visibility Maps
Happy New Year, everybody. This proposal follows on from previous thinking about partitioning, where I've taken up Andrew Sullivan's suggestion to re-examine the current partitioning concept of using tables as partitions. So I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Recap: Very Large Table Use Case Many large tables have a regular access pattern: - new rows inserted frequently - some updates and deletes on the recent data - older data becomes effectively read-only - at some point data may be removed in bulk from the table Most tables vary on what recent means; some are insert only, many not. A typical example might be a large history table where the last 6 months data is updated, but after that the data has almost zero change. Partitioning relies on knowledge of the physical location of data to exclude sections of data from a scan. Currently the partitioning knowledge is provided by user declarations in the form of constraints on tables linked together by views or inheritance. We also currently do this in constraint-first fashion, i.e. we put the data where the constraints say we should, rather than deriving the constraints from the data. In-table Partitioning - In the common use-case there is a clear affinity between certain data columns and the physical placement of data in a table. This is true for columns such as LOG_DATE or TRANSACTION_DATE, but is also true for any columns where the id column is assigned by a Sequence. HOT helps this to remain true over time. Given the above use case, when there is an affinity between data values and data placement *and* we know that older data tends to become read only, we can then realise that certain physical sections of a table will become effectively read only. (My experience is that the larger the table, the less random the write pattern - whatever the guys that wrote TPC-C say). If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. We can imagine that we do this within a single table. Imagine: split a table up into 100 sections, then record the min/max values of all tuples in those sections. When we perform a SeqScan we could skip over sections that could be proved would never satisfy the query predicate. Same thing as partitioning, but within the table. Currently tables are already broken up into 1GB file segments, so it seems fairly natural to pick that as our section size. That fits reasonably well with recommendations for ideal partition size. Segment Exclusion - After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. The implicit constraints can then allow SeqScans to assess segment exclusion for each segment at execution time, i.e. a scan can skip a whole segment if constraints don't match. If a segment does not (yet) have implicit constraints it will always be scanned in full. This allows partitioning, synch scans and buffer recycling to work much better together than they currently do. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive,
Re: [HACKERS] Slow count(*)
On Wed, 2008-01-02 at 09:29 -0600, Abraham, Danny wrote: Our product is about 20 times slower on Postgres compared to MS SQL Server. If you want to have a cross-platform product then you must consider how to access multiple systems both accurately and quickly. Not much point catering for the different SQL dialects and then ignoring the performance differences. All products are not the same; you will find many advantages with Postgres. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Table rewrites vs. pending AFTER triggers
On Tue, 2008-01-01 at 16:09 -0500, Tom Lane wrote: Paranoia would suggest forbidding *any* form of ALTER TABLE when there are pending trigger events, but maybe that's unnecessarily strong. That works for me. Such a combination makes no sense, so banning it is the right thing to do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Table rewrites vs. pending AFTER triggers
Simon Riggs wrote: On Tue, 2008-01-01 at 16:09 -0500, Tom Lane wrote: Paranoia would suggest forbidding *any* form of ALTER TABLE when there are pending trigger events, but maybe that's unnecessarily strong. That works for me. Such a combination makes no sense, so banning it is the right thing to do. +1. Doesn't make much sense to me either. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Like it? Sounds good. I've only given it a quick scan though. Would read-only segments retain the same disk-level format as is currently? It seems possible to remove the MVCC fields and hence get more tuples per page--- whether this would actually be a net performance gain/loss seems like a difficult question question to answer, it would definitly be a complexity increase though. Reading this reminds me of the design of the store for a persistent operating system called EROS. It has a very good paper[1] describing the design (implementation and careful benchmarking thereof) that I think could be a useful read. A lot of your design sounds like the EROS store, with the the Checkpoint Area being, in current and all previous versions of Postgres, the only place data is stored. Data in EROS also has a Home Location which is where the data ends up after a checkpoint, and sounds somewhat like the proposed read-only. Checkpoints here serve a similar purpose than checkpoints to PG, so the following analogy may get a bit confusing. When you're reading the paper I'd try and imagine the checkpoints not occurring as one event, but spread across time as the database recognizes that data is now (or has been marked as) read-only. The home locations would then store only the read-only copies of the data and all the churn would (if the recognition of read-only data works) be done in the checkpoint area. Sam [1] http://www.eros-os.org/papers/storedesign2002.pdf ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index performance
Thanks, I think you have me on the right track. I'm testing a vacuum analyse now to see how long it takes, and then I'll set it up to automatically run every night (so that it has a chance to complete before about 6am.) On 02/01/2008, Andrew Sullivan [EMAIL PROTECTED] wrote: On Wed, Jan 02, 2008 at 05:53:35PM +0200, Brian Modra wrote: This table is added to in real time, at least 10 rows per second. [. . .] If I do a select which uses the pkey index, where equal to the ID column, and greater than one of the values, which should return about 1500 rows, it sometimes takes 1/2 minute to return, and other times takes only seconds. Is it the number of rows being added in real time, that is maybe causing the index to be locked? No, it's probably a bad plan. A minimum 10 rows/second is probably just making the statistics for the table look bad. You likely want to SET STATISTICS wider on the 1st (~150 distinct values) column, and then run ANALYSE on the table very frequently. Are you updating or deleting at all? If so, that will also affect things: you need to perform very frequent VACUUM on that table in that case. Aside from that generic advice, it's impossible to say more without EXPLAIN ANALYSE output for the slow and fast examples. A ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate -- Brian Modra Land line: +27 23 5411 462 Mobile: +27 79 183 8059 6 Jan Louw Str, Prince Albert, 6930 Postal: P.O. Box 2, Prince Albert 6930 South Africa ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Table rewrites vs. pending AFTER triggers
Is there why we allow DDLs inside a transaction and allow it to be rolled back? If we commit the previous transaction, as soon as we encounter a DDL, and commit the DDL too (without waiting for commit) will it be affecting some use cases? I actually mean to say that DDLs can be declared as self-committing. That would get rid of these exceptions. Am i missing something? Thanks, Gokul. On Jan 3, 2008 12:02 AM, Andrew Dunstan [EMAIL PROTECTED] wrote: Simon Riggs wrote: On Tue, 2008-01-01 at 16:09 -0500, Tom Lane wrote: Paranoia would suggest forbidding *any* form of ALTER TABLE when there are pending trigger events, but maybe that's unnecessarily strong. That works for me. Such a combination makes no sense, so banning it is the right thing to do. +1. Doesn't make much sense to me either. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Table rewrites vs. pending AFTER triggers
Gokulakannan Somasundaram [EMAIL PROTECTED] writes: I actually mean to say that DDLs can be declared as self-committing. Egad, an Oracle lover in our midst. Most of us think that roll-back-able DDL is one of the best features of Postgres, and certainly one of our best selling points vis-a-vis Oracle. Don't expect us to give it up. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Table rewrites vs. pending AFTER triggers
On Jan 3, 2008 12:44 PM, Tom Lane [EMAIL PROTECTED] wrote: Gokulakannan Somasundaram [EMAIL PROTECTED] writes: I actually mean to say that DDLs can be declared as self-committing. Egad, an Oracle lover in our midst. :). True, its an impact of working more with Oracle. I made the suggestion here, because it might reduce some if conditions. Most of us think that roll-back-able DDL is one of the best features of Postgres, and certainly one of our best selling points vis-a-vis Oracle. Don't expect us to give it up. Can you please explain, any specific use-case where DDLs are necessary within a transaction? -- Thanks, Gokul.