[PERFORM] The right SHMMAX and FILE_MAX
Hi. I'm on a 64 Bit CentOS 5 system, quadcore processor, 8GB RAM and tons of data storage (1 TB SATAII disks). The current SHMMAX and SHMMIN are (commas added for legibility) -- kernel.shmmax = 68,719,476,736 kernel.shmall = 4,294,967,296 Now, according to my reading in the PG manual and this list, a good recommended value for SHMMAX is (shared_buffers * 8192) My postgresql.conf settings at the moment are: max_connections = 300 shared_buffers = 300MB effective_cache_size = 2000MB By this calculation, shared_b * 8192 will be: 2,457,600,000,000 That's a humongous number. So either the principle for SHMMAX is amiss, or I am reading this wrongly? Similarly with "fs.file_max". There are articles like this one: http://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec72.html Is this relevant for PostgreSQL performance at all, or should I skip that? Thanks for any pointers! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Sat, Apr 30, 2011 at 5:12 PM, Jeff Janes wrote: > > > gist indices are designed to make this type of thing fast, by using > techniques to rule out most of those comparisons without actually > performing them. I don't know enough about the > guts of either your distance function or the gist indexes to know if > you can do it this way, but if you can it would certainly be the way > to go. > It is definitely a good idea to consider a gist index for eliminating most of a large dataset, if applicable. Do a little reading on the topic and, hopefully, it's applicability (or not) will become apparent. However, as someone who has built a number of ad servers over the years, for several of the larger ad networks, the first thing I'd do is separate your ad serving from real-time interaction with your database, no matter what the underlying technology. If you ad serving is dependent on your database, it means that hiccups in the database will have an impact on ad serving, which is rarely tolerable. And god forbid you should need to take the db down for a period of maintenance. The reliability and performance required of most ad servers is often greater than what should reasonably be expected of a relational database, particularly if there are other processes accessing the database, as is the case with your system. The first rule of ad serving is that no outage of backend systems should ever be able to prevent or impact front end ad serving. Some kind of in-memory cache of doc/ad mappings which the ad server interacts with will serve you in good stead and will be much easier to scale horizontally than most relational db architectures lend themselves to. If you have an ever increasing set of documents and ads, you'll inevitably wind up 'sharding' your dataset across multiple db hosts in order to maintain performance - which creates a number of maintenance complexities. Easier to maintain a single database and do analytics over a single data source, but insulate it from the real-time performance requirements of your ad serving. Even something as simple as a process that pushes the most recent doc/ad mappings into a memcache instance could be sufficient - and you can scale your memcache across as many hosts as is necessary to deliver the lookup latencies that you require no matter how large the dataset. Similarly, if you are updating the database from the ad server with each ad served in order to record an impression or click, you'll be far better off logging those and then processing the logs in bulk on a periodic basis. If subsequent impressions are dependent upon what has already been served historically, then use your memcache instance (or whatever structure you eventually choose to utilize) to handle those lookups. This gives you the convenience and flexibility of a relational system with SQL for access, but without the constraints of the capabilities of a single host limiting real-time performance of the system as a whole.
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Sat, Apr 30, 2011 at 3:29 PM, Joel Reymont wrote: > > On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote: > >> But what exactly are you inserting? The queries you reported below >> are not the same as the ones you originally described. > > I posted the wrong query initially. The only difference is in the table that > holds the probability array. > > I'm inserting document id and ad id pairs to show that this ad is not linked > to this document. The mapping table has a primary key on the serial document > id. Having the (doc_id, ad_id) pair be missing from the table is a far more efficient way to show that the ad is not linked to the document (i.e. that it is below the threshold). Provided that you are careful that there are no other reasons that the pair could be missing--but if you are not careful about that, then I don't see how storing the full matrix will save you anyway. > >> In particular, they do not seem to use the "threshold" parameter that >> the original ones did, whose job is presumably to cut the 2 million >> down to a much smaller number that meet the threshold. But how much >> smaller is that number? > > The 5 minutes is with a threshold large enough to be irrelevant. I would like > to optimize the process before I apply the threshold to cut down the number > of rows. > >> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing >> anything. Are you sure the parameter shouldn't be <=0.50 instead? > > No, ignore the threshold for now. OK, but it seems to me that you are starting out by ruling out the one optimization that is most likely to work. >> Also, you previously said you didn't mind of this process took a >> couple minutes, as long as it didn't interfere with other things going >> on in the database. So you probably need to describe what those other >> things going on in the database are. > > Those other things are ad serving which boils down to a lookup of ad ids > linked to the document. > > This is a lookup from the mapping table using the primary key that goes on at > the same time as a large number of mappings are being inserted into > the same table. What numbers do you get for lookups per second when inserts are also going on, versus when they are not going on? The way I would approach this is by making two independent tasks, one that insert records at your anticipated rate "insert into foo select generate_series from generate_series(1,10);" in a loop, and another than generates select load against a separate table (like pgbench -S) and see how the two interact with each other by competing for CPU and IO. You could throttle the insert process by adding pg_sleep() as a column in one of your selects, so it pauses at every row. But due to granularity of pg_sleep, you might have to put it in a CASE expression so it is invoked on only a random subset of the rows rather than each row.But once throttled, will it be able to keep up with the flow of new docs and ads? > > I don't think I can do the matching when serving an ad, though, as I will > still need to scan millions of probability vectors (one per doc) to calculate > the distance between current document and existing ads. gist indices are designed to make this type of thing fast, by using techniques to rule out most of those comparisons without actually performing them. I don't know enough about the guts of either your distance function or the gist indexes to know if you can do it this way, but if you can it would certainly be the way to go. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote: > But what exactly are you inserting? The queries you reported below > are not the same as the ones you originally described. I posted the wrong query initially. The only difference is in the table that holds the probability array. I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has a primary key on the serial document id. > In particular, they do not seem to use the "threshold" parameter that > the original ones did, whose job is presumably to cut the 2 million > down to a much smaller number that meet the threshold. But how much > smaller is that number? The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply the threshold to cut down the number of rows. > It looks like "WHERE tab.distance <= 50.0;" is not accomplishing > anything. Are you sure the parameter shouldn't be <=0.50 instead? No, ignore the threshold for now. > Also, you previously said you didn't mind of this process took a > couple minutes, as long as it didn't interfere with other things going > on in the database. So you probably need to describe what those other > things going on in the database are. Those other things are ad serving which boils down to a lookup of ad ids linked to the document. This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of mappings are being inserted into the same table. Documents are uploaded into the system at a rate of 10k per day, once every couple of seconds. I wish I could get rid of storing the mapping as that table is gonna grow absolutely huge when each new ad matches tens or hundreds of thousands of documents. I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability vectors (one per doc) to calculate the distance between current document and existing ads. Then again, the number of ads in the system will always be a fraction of the number of documents so, perhaps, the matching of document to ads can be done at runtime. > Also, you might have a data correctness problem. If the plan is to > scan new ads against all docs, and new docs against all ads; That's basically it. As new ads are entered, they need to be matched with existing documents. As new documents are entered, they need to be matched with existing ads. Both ads and docs are represented by probability vectors of 150 floats so it's the same distance calculation. > then if new rows are added to each table during overlapping transaction, the > new ads against new docs comparison will not actually happen. You > will probably need to add manual locking to get around this problem. I'll ponder this, thanks for pointing it out! -- - for hire: mac osx device driver ninja, kernel extensions and usb drivers -++--- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont -++--- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont wrote: > > On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote: > >> If this is where most of the time is, the next thing is to run it >> with EXPLAIN ANALYZE, and post the output. > > I was absolutely wrong about the calculation taking < 1s, it actually takes > about 30s for 2 million rows. > > Still, the difference between 5 minutes and 30s must be the insert. But what exactly are you inserting? The queries you reported below are not the same as the ones you originally described. In particular, they do not seem to use the "threshold" parameter that the original ones did, whose job is presumably to cut the 2 million down to a much smaller number that meet the threshold. But how much smaller is that number? This will have a large effect on how long the insert takes. ... > Digging deeper into the distance function, > > EXPLAIN ANALYZE VERBOSE > SELECT * > FROM (SELECT id, divergence(, topics) AS distance FROM docs) AS > tab > WHERE tab.distance <= 50.0; > > Subquery Scan on tab (cost=0.00..38.00 rows=53 width=12) (actual > time=0.027..20429.299 rows=202 loops=1) > Output: tab.id, tab.distance > Filter: (tab.distance <= 50::double precision) > -> Seq Scan on public.docs (cost=0.00..358333.50 rows=160 width=36) > (actual time=0.025..19908.200 rows=202 loops=1) > Output: docs.id, divergence((::double > precision[])::topics, docs.topics) It looks like "WHERE tab.distance <= 50.0;" is not accomplishing anything. Are you sure the parameter shouldn't be <=0.50 instead? Also, you previously said you didn't mind of this process took a couple minutes, as long as it didn't interfere with other things going on in the database. So you probably need to describe what those other things going on in the database are. Also, you might have a data correctness problem. If the plan is to scan new ads against all docs, and new docs against all ads; then if new rows are added to each table during overlapping transaction, the new ads against new docs comparison will not actually happen. You will probably need to add manual locking to get around this problem. Cheers Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Apr 30, 2011, at 7:36 PM, Kevin Grittner wrote: > It may even be amenable to knnGiST indexing (a new feature coming in > 9.1), which would let you do your select with an ORDER BY on the > distance. I don't think I can wait for 9.1, need to go live in a month, with PostgreSQL or without. > PostgreSQL has a lot of very cool features you just don't have in any other > product! :-) There's a strong NoSQL lobby here and I'm trying my best to show that PG can handle the job! -- - for hire: mac osx device driver ninja, kernel extensions and usb drivers -++--- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont -++--- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote: > If this is where most of the time is, the next thing is to run it > with EXPLAIN ANALYZE, and post the output. I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows. Still, the difference between 5 minutes and 30s must be the insert. SELECT (t).doc_id, (t).distance FROM (SELECT docs_within_distance('{ 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.586099770475, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.167233562858, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667, 0.001667 }', 50.0) as t) AS x; This takes 27.44 seconds EXPLAIN ANALYZE VERBOSE Subquery Scan on x (cost=0.00..0.27 rows=1 width=32) (actual time=22422.418..23835.468 rows=202 loops=1) Output: (t).doc_id, (t).distance -> Result (cost=0.00..0.26 rows=1 width=0) (actual time=22422.410..23184.086 rows=202 loops=1) Output: docs_within_distance(('{}'::double precision[])::topics, 50::double precision) Total runtime: 23948.563 ms Topics is defined thusly: CREATE DOMAIN topics AS float[150]; Digging deeper into the distance function, EXPLAIN ANALYZE VERBOSE SELECT * FROM (SELECT id, divergence(, topics) AS distance FROM docs) AS tab WHERE tab.distance <= 50.0; Subquery Scan on tab (cost=0.00..38.00 rows=53 width=12) (actual time=0.027..20429.299 rows=202 loops=1) Output: tab.id, tab.distance Filter: (tab.distance <= 50::double precision) -> Seq Scan on public.docs (cost=0.00..358333.50 rows=160 width=36) (actual time=0.025..19908.200 rows=202 loops=1) Output: docs.id, divergence((::double precision[])::topics, docs.topics) Total runtime: 20550.019 ms I can't dig any deeper because divergence is a C function. Thanks, Joel -- - for hire: mac osx device driver ninja, kernel extensions and usb drivers -++--- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont -++--- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
Joel Reymont wrote: > I'm calculating distance between probability vectors, e.g. topics > that a document belongs to and the topics of an ad. > > The distance function is already a C function. Topics are > float8[150]. > > Distance is calculated against all documents in the database There's probably a way to index that so that you don't need to do a full calculation against all documents in the database each time. It may even be amenable to knnGiST indexing (a new feature coming in 9.1), which would let you do your select with an ORDER BY on the distance. PostgreSQL has a lot of very cool features you just don't have in any other product! :-) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
[rearranging to correct for top-posting] Joel Reymont wrote: > Kevin Grittner wrote: >> Joel Reymont wrote: >> >>> We have 2 million documents now and linking an ad to all of them >>> takes 5 minutes on my top-of-the-line SSD MacBook Pro. >> >> How long does it take to run just the SELECT part of the INSERT >> by itself? > Are you suggesting eliminating the physical linking and > calculating matching documents on the fly? I'm not suggesting anything other than it being a good idea to determine where the time is being spent before trying to make it faster. You showed this as the apparent source of the five minute delay: INSERT INTO doc_ads (doc_id, ad_id, distance) SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; What we don't know is how much of that time is due to writing to the doc_ads table, and how much is due to reading the other tables. We can find that out by running this: SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; If this is where most of the time is, the next thing is to run it with EXPLAIN ANALYZE, and post the output. It's a whole different set of things to try to tune if that part is fast and the INSERT itself is slow. Of course, be aware of caching effects when you time this. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
I'm calculating distance between probability vectors, e.g. topics that a document belongs to and the topics of an ad. The distance function is already a C function. Topics are float8[150]. Distance is calculated against all documents in the database so it's arable scan. Sent from my comfortable recliner On 30/04/2011, at 19:04, Pierre C wrote: > > If you want to search by geographical coordinates, you could use a gist index > which can optimize that sort of things (like retrieving all rows which fit in > a box). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things (like retrieving all rows which fit in a box). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
Calculating distance involves giving an array of 150 float8 to a pgsql function, then calling a C function 2 million times (at the moment), giving it two arrays of 150 float8. Just calculating distance for 2 million rows and extracting the distance takes less than a second. I think that includes sorting by distance and sending 100 rows to the client. Are you suggesting eliminating the physical linking and calculating matching documents on the fly? Is there a way to speed up my C function by giving it all the float arrays, calling it once and having it return a set of matches? Would this be faster than calling it from a select, once for each array? Sent from my comfortable recliner On 30/04/2011, at 18:28, Kevin Grittner wrote: > Joel Reymont wrote: > >> We have 2 million documents now and linking an ad to all of them >> takes 5 minutes on my top-of-the-line SSD MacBook Pro. > > How long does it take to run just the SELECT part of the INSERT by > itself? > > -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] stored proc and inserting hundreds of thousands of rows
Joel Reymont wrote: > We have 2 million documents now and linking an ad to all of them > takes 5 minutes on my top-of-the-line SSD MacBook Pro. How long does it take to run just the SELECT part of the INSERT by itself? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] stored proc and inserting hundreds of thousands of rows
I have a stored proc that potentially inserts hundreds of thousands, potentially millions, of rows (below). This stored proc is part of the the sequence of creating an ad campaign and links an ad to documents it should be displayed with. A few of these stored procs can run concurrently as users create ad campaigns. We have 2 million documents now and linking an ad to all of them takes 5 minutes on my top-of-the-line SSD MacBook Pro. Last but not least, the system has to quickly serve ads while documents are being linked which is a problem at the moment. What can I do to make linking documents to ads faster or have less impact on the system. I would like the system to be as responsive with serving ads while the linking itself is allowed to take a few minutes. One thing I'm concerned with, for example, is the whole multi-million row insert running within the stored proc transaction. I think inserting rows one by one or in small batches may be an improvement. I don't know how to accomplish this, though. Thanks, Joel --- CREATE DOMAIN doc_id AS varchar(64); CREATE DOMAIN id AS int; CREATE TABLE doc_ads ( doc_idid NOT NULL REFERENCES docs, ad_id id NOT NULL REFERENCES ads, distance float NOT NULL ); CREATE INDEX doc_ads_idx ON doc_ads(doc_id); CREATE OR REPLACE FUNCTION link_doc_to_ads(doc id, threshold float) RETURNS void AS $$ BEGIN INSERT INTO doc_ads (doc_id, ad_id, distance) SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; ANALYZE doc_ads; END; $$ LANGUAGE plpgsql; -- - for hire: mac osx device driver ninja, kernel extensions and usb drivers -++--- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont -++--- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] REINDEX takes half a day (and still not complete!)
On Sat, Apr 30, 2011 at 05:26:36PM +0800, Phoenix Kiula wrote: > On Sat, Apr 30, 2011 at 4:07 PM, Greg Smith wrote: > > On 04/23/2011 03:44 PM, Robert Haas wrote: > >> > >> On Apr 17, 2011, at 11:30 AM, Phoenix Kiula > >> ?wrote: > >> > >>> > >>> Postgres is 8.2.9. > >>> > >>> > >> > >> An upgrade would probably help you a lot, and as others have said it > >> sounds like your hardware is failing, so you probably want to deal with > >> that > >> first. > >> > >> I am a bit surprised, however, that no one seems to have mentioned using > >> CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try... > >> > > > > Don't know if it was for this reason or not for not mentioning it by others, > > but CLUSTER isn't so great in 8.2. ?The whole "not MVCC-safe" bit does not > > inspire confidence on a production server. > > > > > To everyone. Thanks so much for everything, truly. We have managed to > salvage the data by exporting it in bits and pieces. > > 1. First the schema only > 2. Then pg_dump of specific small tables > 3. Then pg_dump of timed bits of the big mammoth table > > Not to jinx it, but the newer hardware seems to be doing well. I am on > 9.0.4 now and it's pretty fast. > > Also, as has been mentioned in this thread and other discussions on > the list, just doing a dump and then fresh reload has compacted the DB > to nearly 1/3rd of its previously reported size! > > I suppose that's what I am going to do on a periodic basis from now > on. There is a lot of DELETE/UPDATE activity. But I wonder if the > vacuum stuff really should do something that's similar in function? > What do the high-end enterprise folks do -- surely they can't be > dumping/restoring every quarter or soor are they? > > Anyway, many many thanks to the lovely folks on this list. Much appreciated! > The autovacuum and space management in 9.0 is dramatically more effective and efficient then that of 8.2. Unless you have an odd corner-case there really should be no reason for a periodic dump/restore. This is not your grandmother's Oldsmobile... :) Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] REINDEX takes half a day (and still not complete!)
On Sat, Apr 30, 2011 at 4:07 PM, Greg Smith wrote: > On 04/23/2011 03:44 PM, Robert Haas wrote: >> >> On Apr 17, 2011, at 11:30 AM, Phoenix Kiula >> wrote: >> >>> >>> Postgres is 8.2.9. >>> >>> >> >> An upgrade would probably help you a lot, and as others have said it >> sounds like your hardware is failing, so you probably want to deal with that >> first. >> >> I am a bit surprised, however, that no one seems to have mentioned using >> CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try... >> > > Don't know if it was for this reason or not for not mentioning it by others, > but CLUSTER isn't so great in 8.2. The whole "not MVCC-safe" bit does not > inspire confidence on a production server. To everyone. Thanks so much for everything, truly. We have managed to salvage the data by exporting it in bits and pieces. 1. First the schema only 2. Then pg_dump of specific small tables 3. Then pg_dump of timed bits of the big mammoth table Not to jinx it, but the newer hardware seems to be doing well. I am on 9.0.4 now and it's pretty fast. Also, as has been mentioned in this thread and other discussions on the list, just doing a dump and then fresh reload has compacted the DB to nearly 1/3rd of its previously reported size! I suppose that's what I am going to do on a periodic basis from now on. There is a lot of DELETE/UPDATE activity. But I wonder if the vacuum stuff really should do something that's similar in function? What do the high-end enterprise folks do -- surely they can't be dumping/restoring every quarter or soor are they? Anyway, many many thanks to the lovely folks on this list. Much appreciated! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] REINDEX takes half a day (and still not complete!)
On 04/23/2011 03:44 PM, Robert Haas wrote: On Apr 17, 2011, at 11:30 AM, Phoenix Kiula wrote: Postgres is 8.2.9. An upgrade would probably help you a lot, and as others have said it sounds like your hardware is failing, so you probably want to deal with that first. I am a bit surprised, however, that no one seems to have mentioned using CLUSTER rather than VACUUM or REINDEX. Sometimes that's worth a try... Don't know if it was for this reason or not for not mentioning it by others, but CLUSTER isn't so great in 8.2. The whole "not MVCC-safe" bit does not inspire confidence on a production server. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance