Re: [PERFORM] Postgres Benchmark Results
Am 21.05.2007 um 23:51 schrieb Greg Smith: The standard pgbench transaction includes a select, an insert, and three updates. I see. Didn't know that, but it makes sense. Unless you went out of your way to turn it off, your drive is caching writes; every Seagate SATA drive I've ever seen does by default. 1062 tps with 3-4 clients just isn't possible with your hardware otherwise. Btw: it wasn't my hardware in this test! cug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Postgres Benchmark Results
Jim C. Nasby írta: On Sun, May 20, 2007 at 08:00:25PM +0200, Zoltan Boszormenyi wrote: I also went into benchmarking mode last night for my own amusement when I read on the linux-kernel ML that NCQ support for nForce5 chips was released. I tried current PostgreSQL 8.3devel CVS. pgbench over local TCP connection with 25 clients and 3000 transacts/client gave me around 445 tps before applying NCQ support. 680 tps after. It went over 840 tps after adding HOT v7 patch, still with 25 clients. It topped at 1062 tps with 3-4 clients. I used a single Seagate 320GB SATA2 drive for the test, which only has less than 40GB free. So it's already at the end of the disk giving smaller transfer rates then at the beginning. Filesystem is ext3. Dual core Athlon64 X2 4200 in 64-bit mode. I have never seen such a performance before on a desktop machine. I'd be willing to bet money that the drive is lying about commits/fsync. It could well be the case. Each transaction committed essentially requires one revolution of the drive with pg_xlog on it, so a 15kRPM drive limits you to 250TPS. By revolution, you mean one 360 degrees turnaround of the platter, yes? On the other hand, if you have multiple clients, isn't the 250 COMMITs/sec limit is true only per client? Of course assuming that the disk subsystem has more TCQ/NCQ threads than the actual number of DB clients. BTW, PostgreSQL sees a big speed boost if you mount ext3 with the option data=writeback. Note that doing that probably has a negative impact on data recovery after a crash for non-database files. I haven't touched the FS options. I can even use ext2 if I want non-recoverability. :-) -- -- Zoltán Böszörményi Cybertec Geschwinde Schönig GmbH http://www.postgresql.at/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Postgres Benchmark Results
Greg Smith írta: On Mon, 21 May 2007, Guido Neitzer wrote: Yes, that right, but if a lot of the transactions are selects, there is no entry in the x_log for them and most of the stuff can come from the cache - read from memory which is blazing fast compared to any disk ... And this was a pg_bench test - I don't know what the benchmark really does but if I remember correctly it is mostly reading. The standard pgbench transaction includes a select, an insert, and three updates. All five finished equals one transaction; the fact that the SELECT statment in there could be executed much faster where it to happen on its own doesn't matter. Because it does the most work on the biggest table, the entire combination is usually driven mostly by how long the UPDATE to the accounts table takes. The TPS numbers can certainly be no larger than the rate at which you can execute that. As has been pointed out, every time you commit a transacation the disk has to actually write that out before it's considered complete. Unless you have a good caching disk controller (which your nForce5 is not) you're limited to 120 TPS with a 7200RPM drive and 250 with a 15000 RPM one. While it's possible to improve slightly on this using the commit_delay feature, I haven't been able to replicate even a 100% improvement that way when running pgbench, and to get even close to that level of improvement would require a large number of clients. Unless you went out of your way to turn it off, your drive is caching writes; every Seagate SATA drive I've ever seen does by default. 1062 tps with 3-4 clients just isn't possible with your hardware otherwise. If you turn that feature off with: hdparm -W0 /dev/hda (might be /dev/sda with the current driver) that will disable the disk caching and you'll be reporting accurate numbers--which will be far lower than you're seeing now. And AFAIR according to a comment on LKML some time ago, it greatly decreases your disk's MTBF as well. But thanks for the great insights, anyway. I already knew that nForce5 is not a caching controller. :-) I meant it's a good desktop performer. And having a good UPS and a bit oversized Enermax PSU helps avoiding crashes with the sometimes erratic power line. While your results are an interesting commentary on how fast the system can run when it has a write cache available, and the increase with recent code is interesting, your actual figures here are a fantasy. The database isn't working properly and a real system using this hardware would be expected to become corrupted if ran for long enough. I have a paper at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm you might want to read that goes into more detail than you probably want to know on this subject if you're like to read more about it--and you really, really should if you intend to put important data into a PostgreSQL database. Thanks, I will read it. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- -- Zoltán Böszörményi Cybertec Geschwinde Schönig GmbH http://www.postgresql.at/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Postgres Benchmark Results
- Deferred Transactions, since adding a comment to a blog post doesn't need the same guarantees than submitting a paid order, it makes sense that the application could tell postgres which transactions we care about if power is lost. This will massively boost performance for websites I believe. This would be massively useful. Very often all I care about is that the transaction is semantically committed; that is, that other transactions starting from that moment will see the modifications done. As opposed to actually persisting data to disk. In particular I have a situation where I attempt to utilize available hardware by using concurrency. The problem is that I have to either hugely complicate my client code or COMMIT more often than I would like in order to satisfy dependencies between different transactions. If a deferred/delayed commit were possible I could get all the performance benefit without the code complexity, and with no penalty (because in this case persistence is not important). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature
[PERFORM] Feature suggestion : FAST CLUSTER
Well, CLUSTER is so slow (and it doesn't cluster the toast tables associated with the table to be clustered). However, when people use CLUSTER they use it to speed up their queries. For that the table does not need to be perfectly in-order. So, here is a new idea for CLUSTER : - choose a chunk size (about 50% of your RAM) - setup disk sorts for all indexes - seq scan the table : - take a chunk of chunk_size - sort it (in memory) - write it into new table file - while we have the data on-hand, also send the indexed columns data into the corresponding disk-sorts - finish the index disk sorts and rebuild indexes This does not run a complete sort on the table. It would be about as fast as your seq scan disk throughput. Obviously, the end result is not as good as a real CLUSTER since the table will be made up of several ordered chunks and a range lookup. Therefore, a range lookup on the clustered columns would need at most N seeks, versus 1 for a really clustered table. But it only scans the table once and writes it once, even counting index rebuild. I would think that, with this approach, if people can CLUSTER a large table in 5 minutes instead of hours, they will use it, instead of not using it. Therefore, even if the resulting table is not as optimal as a fully clustered table, it will still be much better than the non-clustered case. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Postgres Benchmark Results
Alvaro Herrera [EMAIL PROTECTED] writes: Scott Marlowe wrote: I thought you were limited to 250 or so COMMITS to disk per second, and since 1 client can be committed at once, you could do greater than 250 tps, as long as you had 1 client providing input. Or was I wrong? My impression is that you are correct in theory -- this is the commit delay feature. But it seems that the feature does not work as well as one would like; and furthermore, it is disabled by default. Even without commit delay a client will commit any pending WAL records when it syncs the WAL. The clients waiting to commit their records will find it already synced when they get woken up. However as mentioned a while back in practice it doesn't work quite right and you should expect to get 1/2 the expected performance. So even with 10 clients you should expect to see 5*120 tps on a 7200 rpm drive and 5*250 tps on a 15kprm drive. Heikki posted a patch that experimented with fixing this. Hopefully it'll be fixed for 8.4. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Postgres Benchmark Results
What's interesting here is that on a couple metrics the green curve is actually *better* until it takes that nosedive at 500 MB. Obviously it's not better on average hits/s, the most obvious metric. But on deviation and worst-case hits/s it's actually doing better. Note that while the average hits/s between 100 and 500 is over 600 tps for Postgres there is a consistent smattering of plot points spread all the way down to 200 tps, well below the 400-500 tps that MySQL is getting. Some of those are undoubtedly caused by things like checkpoints and vacuum runs. Hopefully the improvements that are already in the pipeline will reduce them. I mention this only to try to move some of the focus from the average performance to trying to remove the pitfalls that affact 1-10% of transactions and screw the worst-case performance. In practical terms it's the worst-case that governs perceptions, not average case. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. CREATE TABLE WORD_COUNTS ( word text NOT NULL, count integer, CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word) ) WITHOUT OIDS; I have some PL/pgSQL code in a stored procedure like FOR r IN select id, array_of_words from word_storage LOOP begin -- insert the missing words insert into WORD_COUNTS ( word, count ) ( select word, 0 from ( select distinct (r.array_of_words) [s.index] as d_word from generate_series(1, array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words where word not in ( select d_word from WORD_COUNTS ) ); -- update the counts update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where word in ( select distinct (r.array_of_words)[s.index] as word from generate_series(1, array_upper( r.array_of_words, 1) ) as s(index) ); exception when others then error_count := error_count + 1; end; record_count := record_count + 1; END LOOP; This code runs extremely slowly. It takes about 10 minutes to process 1 records and the word storage has more then 2 million records to be processed. Does anybody have a know-how about populating of such a reference tables and what can be optimized in this situation. Maybe the generate_series() procedure to unnest the array is the place where I loose the performance? Are the set update/inserts more effitient, then single inserts/updates run in smaller loops? Thanks for your help, Valentine Gogichashvili ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
On 22 May 2007 01:23:03 -0700, valgog [EMAIL PROTECTED] wrote: I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. CREATE TABLE WORD_COUNTS ( word text NOT NULL, count integer, CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word) ) WITHOUT OIDS; Is there any reason why count is not not null? (That should siplify your code by removing the coalesce) insert is more efficient than update because update is always a delete followed by an insert. Oh and group by is nearly always quicker than distinct and can always? be rewritten as such. I'm not 100% sure why its different but it is. Peter. I have some PL/pgSQL code in a stored procedure like FOR r IN select id, array_of_words from word_storage LOOP begin -- insert the missing words insert into WORD_COUNTS ( word, count ) ( select word, 0 from ( select distinct (r.array_of_words) [s.index] as d_word from generate_series(1, array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words where word not in ( select d_word from WORD_COUNTS ) ); -- update the counts update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where word in ( select distinct (r.array_of_words)[s.index] as word from generate_series(1, array_upper( r.array_of_words, 1) ) as s(index) ); exception when others then error_count := error_count + 1; end; record_count := record_count + 1; END LOOP; This code runs extremely slowly. It takes about 10 minutes to process 1 records and the word storage has more then 2 million records to be processed. Does anybody have a know-how about populating of such a reference tables and what can be optimized in this situation. Maybe the generate_series() procedure to unnest the array is the place where I loose the performance? Are the set update/inserts more effitient, then single inserts/updates run in smaller loops? Thanks for your help, Valentine Gogichashvili ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
valgog wrote: I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. I think this is the root of your problem, I'm afraid. You're trying to count individual words when you're storing an array of words. I don't think any of the Gist/GIN indexes will help you with this either. However, since you don't want to start from here isn't very useful here and now: 1. See what the performance (explain analyse) of the select distinct...generate_series() statement is. I think you're right and it's going to be slow. 2. You're looping through each row of word_storage and counting separately. Write it as one query if possible. 3. As Peter says, don't insert then update, start with an empty table and just insert totals for the lot (see #2). I'd probably write the query in plperl/python or something else that supports hash/dictionary structures. Then just process the whole word_storage into the hash - assuming you only have a few thousand distinct words that shouldn't take up too much memory. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] is file size relevant in choosing index or table scan?
Hi, I have a table with a file size of 400 MB with an index of 100 MB. Does PostgreSQL take the file sizes of both the table and the index into account when determing if it should do a table or an index scan? TIA Joost ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] is file size relevant in choosing index or table scan?
Joost Kraaijeveld wrote: Hi, I have a table with a file size of 400 MB with an index of 100 MB. Does PostgreSQL take the file sizes of both the table and the index into account when determing if it should do a table or an index scan? In effect yes, although it will think in terms of row sizes and disk blocks. It also considers how many rows it thinks it will fetch and whether the rows it wants are together or spread amongst many blocks. It also tries to estimate what the chances are of those blocks being cached in RAM vs still on disk. So: 1 row from a 4 million row table, accessed by primary key = index. 20 rows from a 200 row table = seq scan (probably). In between = depends on your postgresql.conf -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
I have rewritten the code like existing_words_array := ARRAY( select word from WORD_COUNTS where word = ANY ( array_of_words ) ); not_existing_words_array := ARRAY( select distinct_word from ( select distinct (array_of_words)[s.index] as distinct_word from generate_series(1, array_upper( array_of_words, 1 ) ) as s(index) ) as distinct_words where distinct_word ALL ( existing_words_array ) ); -- insert the missing words if not_existing_words_array is not null then insert into WORD_COUNTS ( word, count ) ( select word, 1 from ( select not_existing_words_array[s.index] as word from generate_series( 1, array_upper( not_existing_words_array, 1 ) ) as s(index) ) as distinct_words ); end if; -- update the counts if existing_words_array is not null then update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where sw_word = ANY ( existing_words_array ); end if; Now it processes a million records in 14 seconds... so it was probably the problem of looking up NOT IN WORD_COUNTS was way too expencive ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Postgres Benchmark Results
Note that while the average hits/s between 100 and 500 is over 600 tps for Postgres there is a consistent smattering of plot points spread all the way down to 200 tps, well below the 400-500 tps that MySQL is getting. Yes, these are due to checkpointing, mostly. Also, note that a real forum would not insert 100 posts/s, so it would not feel this effect. But in order to finish the benchmark in a correct amount of time, we have to push on the inserts. Some of those are undoubtedly caused by things like checkpoints and vacuum runs. Hopefully the improvements that are already in the pipeline will reduce them. I am re-running it with other tuning, notably cost-based vacuum delay and less frequent checkpoints, and it is a *lot* smoother. These take a full night to run, so I'll post more results when I have usefull stuff to show. This has proven to be a very interesting trip to benchmarkland... ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
On Tue, 22 May 2007 10:23:03 +0200, valgog [EMAIL PROTECTED] wrote: I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. Mmm. If I were you, I would : - Create a procedure that flattens all the arrays and returns all the words : PROCEDURE flatten_arrays RETURNS SETOF TEXT FOR word_array IN SELECT word_array FROM your_table LOOP FOR i IN 1...array_upper( word_array ) LOOP RETURN NEXT tolower( word_array[ i ] ) So, SELECT * FROM flatten_arrays() returns all the words in all the arrays. To get the counts quickly I'd do this : SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word You can then populate your counts table very easily and quickly, since it's just a seq scan and hash aggregate. One second for 10.000 rows would be slow. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
On May 22, 12:14 pm, [EMAIL PROTECTED] (PFC) wrote: On Tue, 22 May 2007 10:23:03 +0200, valgog [EMAIL PROTECTED] wrote: I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. Mmm. If I were you, I would : - Create a procedure that flattens all the arrays and returns all the words : PROCEDURE flatten_arrays RETURNS SETOF TEXT FOR word_array IN SELECT word_array FROM your_table LOOP FOR i IN 1...array_upper( word_array ) LOOP RETURN NEXT tolower( word_array[ i ] ) So, SELECT * FROM flatten_arrays() returns all the words in all the arrays. To get the counts quickly I'd do this : SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word You can then populate your counts table very easily and quickly, since it's just a seq scan and hash aggregate. One second for 10.000 rows would be slow. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq good idea indeed! will try this approach. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
On May 22, 12:00 pm, valgog [EMAIL PROTECTED] wrote: I have rewritten the code like existing_words_array := ARRAY( select word from WORD_COUNTS where word = ANY ( array_of_words ) ); not_existing_words_array := ARRAY( select distinct_word from ( select distinct (array_of_words)[s.index] as distinct_word from generate_series(1, array_upper( array_of_words, 1 ) ) as s(index) ) as distinct_words where distinct_word ALL ( existing_words_array ) ); -- insert the missing words if not_existing_words_array is not null then insert into WORD_COUNTS ( word, count ) ( select word, 1 from ( select not_existing_words_array[s.index] as word from generate_series( 1, array_upper( not_existing_words_array, 1 ) ) as s(index) ) as distinct_words ); end if; -- update the counts if existing_words_array is not null then update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where sw_word = ANY ( existing_words_array ); end if; Now it processes a million records in 14 seconds... so it was probably the problem of looking up NOT IN WORD_COUNTS was way too expencive Sorry... this code did not update anythig at all, as I forgot about the NULL values... had to COALASCE practically everything and use array_upper()... do not have the performance numbers of the insert, updates yet... ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Performace comparison of indexes over timestamp fields
Hi all, I have some tables where all the queries that will be executed are timestamps driven, so it'd be nice to have an index over those fields. On older versions of PostgreSQL, at least in my experience, queries on timestamps fields even having indexes where performing quite bad mainly sequential scans where performed. Now I have a newer version of PostgreSQL and I've done some tests comparing the performance of an index over a timestamp field with a numeric field. To do so, I have the following table: Table public.payment_transactions Column |Type |Modifiers +-+- transaction_id | character varying(32) | not null timestamp_in | timestamp without time zone | default now() credits| integer | epoch_in | bigint | epoch_in2 | double precision| Indexes: pk_paytrans_transid PRIMARY KEY, btree (transaction_id) idx_paytrans_epochin btree (epoch_in) idx_paytrans_epochin2 btree (epoch_in2) idx_paytrans_timestamp btree (timestamp_in) timestamp_in it's the timestamp, epoch_in and epoch_in2 are the epoch equivalent to timestamp to test how the indexes perform. We have three different indexes (testing purposes) one over a timestamp field, one over an int8 and one over a double precision field. While doing the tests this table has about 100.000 entries. To test the diferent indexes I have executed the following: Index over timestamp_in (timestamp) # explain analyze select * from payment_transactions where timestamp_in between '2007-02-13'::timestamp and '2007-02-15'::timestamp; QUERY PLAN --- Index Scan using idx_paytrans_timestamp on payment_transactions (cost=0.00..1480.24 rows=1698 width=138) (actual time=11.693..310.402 rows=1587 loops=1) Index Cond: ((timestamp_in = '2007-02-13 00:00:00'::timestamp without time zone) AND (timestamp_in = '2007-02-15 00:00:00'::timestamp without time zone)) Total runtime: 318.328 ms (3 rows) Index over epoch_in (int8) # explain analyze select * from payment_transactions where epoch_in between extract( epoch from '2007-02-13'::date )::int8 and extract( epoch from '2007-02-15'::date )::int8; QUERY PLAN - Index Scan using idx_paytrans_epochin on payment_transactions (cost=0.00..1483.24 rows=1698 width=138) (actual time=34.369..114.943 rows=1587 loops=1) Index Cond: ((epoch_in = 1171321200::bigint) AND (epoch_in = 1171494000::bigint)) Total runtime: 120.804 ms (3 rows) Index over epoch_in (double precision) # explain analyze select * from payment_transactions where epoch_in2 between extract( epoch from '2007-02-13'::date ) and extract( epoch from '2007-02-15'::date ); QUERY PLAN - Index Scan using idx_paytrans_epochin2 on payment_transactions (cost=0.00..1479.24 rows=1698 width=138) (actual time=26.115..51.357 rows=1587 loops=1) Index Cond: ((epoch_in2 = 1171321200::double precision) AND (epoch_in2 = 1171494000::double precision)) Total runtime: 57.065 ms (3 rows) As you can see the time difference are very big Timestamp:318.328 ms int8 index: 120.804 ms double precision: 57.065 ms is this normal? am I doing anything wrong? As rule of thumb is better to store epochs than timestamps? Thank you very much -- Arnau ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance
Le mardi 22 mai 2007, Richard Huxton a écrit : valgog wrote: I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. I think this is the root of your problem, I'm afraid. You're trying to count individual words when you're storing an array of words. I don't think any of the Gist/GIN indexes will help you with this either. However, since you don't want to start from here isn't very useful here and now: 1. See what the performance (explain analyse) of the select distinct...generate_series() statement is. I think you're right and it's going to be slow. 2. You're looping through each row of word_storage and counting separately. Write it as one query if possible. 3. As Peter says, don't insert then update, start with an empty table and just insert totals for the lot (see #2). I'd probably write the query in plperl/python or something else that supports hash/dictionary structures. Then just process the whole word_storage into the hash - assuming you only have a few thousand distinct words that shouldn't take up too much memory. +1 I made something very similar, and using PL/pgsql is very slow, when using perl is very quick. I have also use partioning because of cost of update (copy last partition to the new, adding the new count, so there is only insert, and drop old table if you want) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Performace comparison of indexes over timestamp fields
On 5/22/07, Arnau [EMAIL PROTECTED] wrote: On older versions of PostgreSQL, at least in my experience, queries on timestamps fields even having indexes where performing quite bad mainly sequential scans where performed. PostgreSQL uses B-tree indexes for scalar values. For an expression such as t between a and b, I believe it's going to match both sides of the table independently (ie., t = a and t = b) and intersect these subsets. This is inefficient. You should get better performance by mapping timestamps to a one-dimensional plane and indexing them using GiST. GiST implements an R-tree-like structure that supports bounding-box searches. This involves setting up a functional index: create index ... on payment_transactions using gist ( box(point(extract(epoch from time), 0), point(extract(epoch from time), 0)) box_ops) I'm using box() here because GiST doesn't have a concept of points. Then insert as usual, and then query with something like: select ... from payment_transactions where box( point(extract(epoch from '2006-04-01'::date), 0), point(extract(epoch from '2006-08-01'::date), 0)) box( point(extract(epoch from time), 0), point(extract(epoch from time), 0)); PostgreSQL should be able to exploit the GiST index by recognizing that the result of box() expression operand is already computed in the index. This much less inconvenient and portable -- I would love for PostgreSQL to be provide syntactic sugar and special-casing to make this transparent -- but worth it if you are dealing with a lot of range searches. Now I have a newer version of PostgreSQL and I've done some tests comparing the performance of an index over a timestamp field with a numeric field. To do so, I have the following table: Table public.payment_transactions Column |Type |Modifiers +-+- transaction_id | character varying(32) | not null timestamp_in | timestamp without time zone | default now() credits| integer | epoch_in | bigint | epoch_in2 | double precision| [snip] A timestamp is stored internally as an 8-byte double-precision float. Therefore, timestamp_in and epoch_in2 should behave identically. While doing the tests this table has about 100.000 entries. Make sure PostgreSQL is able to keep the entire table in memory by setting shared_buffers; you don't want to be hitting to the disk. Make sure you run analyze on the table before you execute the test. To test the diferent indexes I have executed the following: Your query plans are roughly identical. The difference in the timings implies that you only ran the queries once. I suggest you run each query at least 10 times, and report the individual numbers (the total runtime parts of the output) you get. Arithmetic means are not that interesting. Alexander. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Performace comparison of indexes over timestamp fields
On Tue, May 22, 2007 at 02:39:33PM +0200, Alexander Staubo wrote: PostgreSQL uses B-tree indexes for scalar values. For an expression such as t between a and b, I believe it's going to match both sides of the table independently (ie., t = a and t = b) and intersect these subsets. This is inefficient. A B-tree index can satisfy range queries such as this. You should get better performance by mapping timestamps to a one-dimensional plane and indexing them using GiST. GiST implements an R-tree-like structure that supports bounding-box searches. You may be thinking of interval overlaps? /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Performace comparison of indexes over timestamp fields
On 5/22/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote: On Tue, May 22, 2007 at 02:39:33PM +0200, Alexander Staubo wrote: PostgreSQL uses B-tree indexes for scalar values. For an expression such as t between a and b, I believe it's going to match both sides of the table independently (ie., t = a and t = b) and intersect these subsets. This is inefficient. A B-tree index can satisfy range queries such as this. You're right, and I'm wrong -- my head is not in the right place today. B-trees are inefficient for intervals, but quite satisfactory for range searches. Alexander. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Tips Tricks for validating hardware/os
Hi, Out of curiosity, can anyone share his tips tricks to validate a machine before labelling it as 'ready to use postgres - you probably won't trash my data today' ? I'm looking for a way to stress test components especially kernel/disk to have confidence 0 that I can use postgres on top of it. Any secret trick is welcome (beside the memtest one :) Thanks ! -- stephane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Tips Tricks for validating hardware/os
On 5/22/07, Stephane Bailliez [EMAIL PROTECTED] wrote: Out of curiosity, can anyone share his tips tricks to validate a machine before labelling it as 'ready to use postgres - you probably won't trash my data today' ? I'm looking for a way to stress test components especially kernel/disk to have confidence 0 that I can use postgres on top of it. Any secret trick is welcome (beside the memtest one :) Compile the Linux kernel -- it's a pretty decent stress test. You could run pgbench, which comes with PostgreSQL (as part of the contrib package). Give a database size that's larger than the amount of physical memory in the box. Alexander. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Tips Tricks for validating hardware/os
Out of curiosity, can anyone share his tips tricks to validate a machine before labelling it as 'ready to use postgres - you probably won't trash my data today' ? I'm looking for a way to stress test components especially kernel/disk to have confidence 0 that I can use postgres on top of it. That would be running a filesystem benchmark, pulling the plug, then counting the dead. http://sr5tech.com/write_back_cache_experiments.htm ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Performace comparison of indexes over timestamp fields
Arnau [EMAIL PROTECTED] writes: As you can see the time difference are very big Timestamp:318.328 ms int8 index: 120.804 ms double precision: 57.065 ms As already suggested elsewhere, you probably weren't sufficiently careful in taking your measurements. A look at the code says that int8 comparison ought to be the fastest of these. If timestamps are implemented as floats (which you didn't say) their comparison speed ought to be *exactly* the same as floats, because the comparison functions are line-for-line the same. If timestamps are implemented as int8 then they should be similar to int8 comparisons, maybe a tad slower due to an extra level of function call. But in any case it seems likely that the actual comparison function calls would be just a small fraction of the runtime. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Domains versus Check Constraints
Are there any performance improvements that come from using a domain over a check constraint (aside from the ease of management component)? thanks -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 Phone: 877-258-8987/919-463-0999 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Drop table vs Delete record
My application has two threads, one inserts thousands of records per second into a table (t1) and the other thread periodically deletes expired records (also in thousands) from the same table (expired ones). So, we have one thread adding a row while the other thread is trying to delete a row. In a short time the overall performance of any sql statements on that instance degrades. (ex. Select count(*) from t1 takes more then few seconds with less than 10K rows). My question is: Would any sql statement perform better if I would rename the table to t1_%indx periodically, create a new table t1 (for new inserts) and just drop the tables with expired records rather then doing a delete record? (t1 is a simple table with many rows and no constraints). (I know I could run vacuum analyze) Thanks, Orhan A.
Re: [PERFORM] Drop table vs Delete record
Consider table partitioning (it's described in the manual). Andreas -- Ursprüngl. Mitteil. -- Betreff:[PERFORM] Drop table vs Delete record Von:Orhan Aglagul [EMAIL PROTECTED] Datum: 22.05.2007 18:42 My application has two threads, one inserts thousands of records per second into a table (t1) and the other thread periodically deletes expired records (also in thousands) from the same table (expired ones). So, we have one thread adding a row while the other thread is trying to delete a row. In a short time the overall performance of any sql statements on that instance degrades. (ex. Select count(*) from t1 takes more then few seconds with less than 10K rows). My question is: Would any sql statement perform better if I would rename the table to t1_%indx periodically, create a new table t1 (for new inserts) and just drop the tables with expired records rather then doing a delete record? (t1 is a simple table with many rows and no constraints). (I know I could run vacuum analyze) Thanks, Orhan A. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Drop table vs Delete record
Checking out right now Thanks for the fast response. -Original Message- From: Andreas Kostyrka [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 22, 2007 11:49 AM To: Orhan Aglagul Cc: pgsql-performance@postgresql.org Subject: AW: [PERFORM] Drop table vs Delete record Consider table partitioning (it's described in the manual). Andreas -- Ursprüngl. Mitteil. -- Betreff:[PERFORM] Drop table vs Delete record Von:Orhan Aglagul [EMAIL PROTECTED] Datum: 22.05.2007 18:42 My application has two threads, one inserts thousands of records per second into a table (t1) and the other thread periodically deletes expired records (also in thousands) from the same table (expired ones). So, we have one thread adding a row while the other thread is trying to delete a row. In a short time the overall performance of any sql statements on that instance degrades. (ex. Select count(*) from t1 takes more then few seconds with less than 10K rows). My question is: Would any sql statement perform better if I would rename the table to t1_%indx periodically, create a new table t1 (for new inserts) and just drop the tables with expired records rather then doing a delete record? (t1 is a simple table with many rows and no constraints). (I know I could run vacuum analyze) Thanks, Orhan A. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Postgres Benchmark Results
On Tue, 22 May 2007, Gregory Stark wrote: However as mentioned a while back in practice it doesn't work quite right and you should expect to get 1/2 the expected performance. So even with 10 clients you should expect to see 5*120 tps on a 7200 rpm drive and 5*250 tps on a 15kprm drive. I would agree that's the approximate size of the upper-bound. There are so many factors that go into the effectiveness of commit_delay that I wouldn't word it so strongly as to say you can expect that much benefit. The exact delay amount (which can be hard to set if your client load varies greatly), size of the transactions, balance of seek-bound reads vs. memory based ones in the transactions, serialization in the transaction stream, and so many other things can slow the effective benefit. Also, there are generally other performance issues in the types of systems you would think would get the most benefit from this parameter that end up slowing things down anyway. I've been seeing a best case of closer to 2*single tps rather than 5* on my single-drive systems with no write caching, but I'll admit I haven't done an exhausting look at it yet (too busy with the real systems that have good controllers). One of these days... -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly