Re: [PERFORM] Improving PostgreSQL insert performance
Alvaro>Something like INSERT INTO .. VALUES ('col1', 'col2'), ('col1', 'col2'), ('col1', 'col2')>I did not Frits>try that, to be honest. pgjdbc does automatically rewrite insert values(); into insert ... values(),(),(),() when reWriteBatchedInserts=true. I don't expect manual multivalues to be noticeably faster there. Frits>https://etc.to/confluence/display/~admjal/PostgreSQL+performance+tests Do you really intend to measure just a single insert operation? It looks odd, as typical applications would execute inserts for quite a while before they terminate. You are including lots of warmup overheads (e.g. JIT-compilation), so your approach does not measure peak performance. On the other hand, you are not measuring enough time to catch things like "DB log switch". Would you please use JMH as a load driver? Here's an example: https://github.com/pgjdbc/pgjdbc/blob/master/ubenchmark/src/main/java/org/postgresql/benchmark/statement/InsertBatch.java Vladimir >
Re: [PERFORM] Improving PostgreSQL insert performance
Frits, Would you mind sharing the source code of your benchmark? >BTW: It seems you need a recent driver for this; I'm using postgresql-42.1.1.jar Technically speaking, reWriteBatchedInserts was introduced in 9.4.1209 (2016-07-15) Vladimir
Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT
What could cause this? Note that there is no ANALYZE. Can you capture pstack and/or perf report while explain hangs? I think it should shed light on the activity of PostgreSQL. Vladimir -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query taking long time
I have simplified the query and added the last advise that you told me: Query: explain analyze select * from (select * from entity_compounddict2document where name='ranitidine') as a order by a.hepval; Do you need full result? If you need just top-n rows, then index on entity_compounddict2document(name, a.hepval) might help. Regards, Vladimir Sitnikov
Re: [PERFORM] Performance of complicated query
This leads to the WHERE clause, WHERE read_datetime = max_read, and hence I'm only summing the last read for each device for each patient. Is reads table insert-only? Do you have updates/deletes of the historical rows? 3. Can I modify my tables to make this query (which is the crux of my application) run faster? Can you have a second reads table that stores only up to date values? That will eliminate max-over completely, enable efficient usage in other queries, and make your queries much easier to understand by humans and computers. PS. read_datetime = max_read is prone to what if two measurements have same date errors. PPS. distinct MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read looks like a complete mess. Why don't you just use group by? Regards, Vladimir
Re: [PERFORM] FW: performance issue with a 2.5gb joinded table
Daniel, Somehow oracle seems to know that a right join is the better way to go. In fact, PostgreSQL is just doing the same thing: it hashes smaller table and scans the bigger one. Could you please clarify how do you consume 25M rows? It could be the difference of response times comes not from the PostgreSQL itself, but from the client code. Could you please add the following information? 1) Execution time of simple query that selects MAX of all the required columns select max(test1.slsales_batch) , max(test1.slsales_checksum), I mean not explain (analyze, buffers), but simple execution. The purpose of MAX is to split overhead of consuming of the resultset from the overhead of producing it. 2) explain (analyze, buffers) for the same query with maxes. That should reveal the overhead of explain analyze itself. 3) The output of the following SQLPlus script (from Oracle): set linesize 1000 pagesize 1 trimout on trimspool on time on timing on spool slow_query.lst select /*+ gather_plan_statistics */ max(test1.slsales_batch) , max(test1.slsales_checksum), ..; select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST')); spool off That would display detailed statistics on execution time similar to the explain (analyze, buffers). 4) Could you please clarify how did you migrate test1 table? I guess the order of rows in that table might affect overall execution time. Sorted table would be more CPU cache friendly, thus giving speedup. (see [1] for similar example). As far as I understand, simple create table as select * from test1 order by slsales_date_id, slsales_prod_id should improve cache locality. [1]: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array -- Regards, Vladimir Sitnikov
Re: [PERFORM] filesystems benchmark
I'm trying to compare different filesystems for postgres using pgbench. The problem I've seen is that my IO wait is very very low. Is there a way I can get pgbench to do more to push that up a little? Why do you expect to see high IO wait? Does your database fit in system memory? If you do not modify the data and it is not large enough, it is unlikely you will hit IO wait. Do you have write cache enabled for the file system/device? Even if you have some DML statements, it would not be that easy to saturate write cache enabled storage (especially, when it fits in memory). Regards, Vladimir Sitnikov
Re: [PERFORM] query plan with index having a btrim is different for strings of different length
tii=# explain analyze SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smith')); Is there an index on m_object_paper.assignment? It could solve the problem. With current indices on btrim(last_name) and owner you are just throwing the rows away (you have 521382 rows with smith, 15494737 with owner=-1 and only 58 of them have both smith/jordan and -1). Consider creating index on m_object_paper using btree(lower(btrim(x_lastname))) where owner=-1; (it might add firstname column there as per Tom's suggestion) Or just index on (owner, lower(...)) if you have other queries with different values for owner. One more point that could improve bitmap scans is greater value for work_mem. You'll need 8*15494737 ~ 130Mb == 13 for work_mem (however, that is way too high unless you have lots of RAM and just couple of active database sessions) Regards, Vladimir Sitnikov
Re: [PERFORM] Slow SQL query (14-15 seconds)
Could you please try this one: SELECT pk_societe_id, denomination_commerciale, denomination_sociale, numero_client, COALESCE(stats_commandes.nombre, 0) AS societe_nbre_commandes, COALESCE(stats_adresses_livraison.nombre, 0) AS societe_adresses_livraison_quantite, COALESCE(stats_adresses_facturation.nombre, 0) AS societe_adresses_facturation_quantite, COALESCE(NULLIF(admin_email,''), NULLIF(admin_bis_email,''), NULLIF(admin_ter_email,''), 'n/a') AS email, COALESCE(NULLIF(admin_tel,''), NULLIF(admin_bis_tel,''), NULLIF(admin_ter_tel,''), 'n/a') AS telephone, remise_permanente, is_horeca FROM societes LEFT JOIN ( SELECT societes.pk_societe_id AS societe_id, COUNT(commandes.pk_commande_id) AS nombre, max(case when delivery_date_livraison BETWEEN (NOW() - '1 year'::interval) AND NOW() then 1 end) AS il_y_avait_un_commande FROM commandes INNER JOIN clients ON commandes.fk_client_id = clients.pk_client_id INNER JOIN societes ON clients.fk_societe_id = societes.pk_societe_id GROUP BY societes.pk_societe_id ) AS stats_commandes ON stats_commandes.societe_id = societes.pk_societe_id LEFT JOIN ( SELECT fk_societe_id AS societe_id, COUNT(pk_adresse_livraison_id) AS nombre, FROM societes_adresses_livraison WHERE is_deleted = FALSE GROUP BY fk_societe_id ) AS stats_adresses_livraison ON stats_adresses_livraison.societe_id = societes.pk_societe_id LEFT JOIN ( SELECT fk_societe_id AS societe_id, COUNT(pk_adresse_facturation_id) AS nombre FROM societes_adresses_facturation WHERE is_deleted = FALSE GROUP BY fk_societe_id ) AS stats_adresses_facturation ON stats_adresses_facturation.societe_id = societes.pk_societe_id WHERE societes.is_deleted = FALSE and il_y_avait_un_commande=1 ORDER BY LOWER(denomination_commerciale); Bien a vous, Vladimir Sitnikov
Re: [PERFORM] Increasing select max(datecol) from bilkaib where datecol=date'2008-11-01' and (cr='00' or db='00') speed
On Wed, Nov 12, 2008 at 9:02 AM, Andrus [EMAIL PROTECTED] wrote: There are columns kuupaev date, cr char(10), db char(10) and regular indexes for all those fields. bilkaib table contains large number of rows. The following query takes too much time. How to make it faster ? I think PostgreSql should use multiple indexes as bitmaps to speed it. I am afraid I do not see a way to use bitmaps to get any improvement here: the server will still need to read the whole indices to figure out the answer. I suggest you to create two more indices: create index date_with_zero_cr on bilkaib(date) where cr='00'; create index date_with_zero_db on bilkaib(date) where db='00'; And rewrite query as follows: select greatest( (select max(date) from bilkaib where datecol=date'2008-11-01' and cr='00'), (select max(date) from bilkaib where datecol=date'2008-11-01' and db='00')) Regards, Vladimir Sitnikov
Re: [PERFORM] slow full table update
Recheck Cond: ((sid 2) AND (sid 3)) - Bitmap Index Scan on pk_songs2 (cost=0.00..151.59 rows=8931 width=0) (actual time=4.071..4.071 rows=9579 loops=1) Index Cond: ((sid 2) AND (sid 3)) Is there a way to run this query on sigle throughpass with no Recheck Cond? Recheck Cond is somewhat misleading here. Bitmap Index Scan has almost void recheck impact in case the whole bitmap fits in work_mem. That means bitmap scan degrades when the number of rows in table (not the total number of returned rows) is greater than work_mem*1024*8. 60'000 rows bitmap scan will require 60'000/8=7'500 bytes ~ 8Kbytes of memory to run without additional recheck, thus I do not believe it hurts you in this particular case Regards, Vladimir Sitnikov
Re: [PERFORM] Increasing select max(datecol) from bilkaib where datecol=date'2008-11-01' and (cr='00' or db='00') speed
This query finds initial balance date befeore given date. If you are not interested in other balances except initial ones (the ones that have '00') the best way is to create partial indices that I have suggested. That will keep size of indices small, while providing good performance (constant response time) bilkaib table contains several year transactions so it is large. That is not a problem for the particular case. However, when you evaluate query performance, it really makes sense giving number of rows in each table (is 100K rows a large table? what about 10M rows?) and other properties of the data stored in the table (like number of rows that have cr='00') Alternatively if you create an index on (cr, bilkaib) and one on (db, bilkaib) then you will be able to use other values in the query too. That means if you create one index on biklaib (cr, datecol) and another index on (db, datecol) you will be able to improve queries like select greatest( (select max(date) from bilkaib where datecol=date'2008-11-01' and cr=XXX), (select max(date) from bilkaib where datecol=date'2008-11-01' and db=YYY)). with arbitrary XXX and YYY. I am not sure if you really want this. I'm sorry I do'nt understand this. What does the (cr, bilkaib) syntax mean? I believe that should be read as (cr, datecol). Should I create two functions indexes and re-write query as Vladimir suggests or is there better appoach ? I am afraid PostgreSQL is not smart enough to rewrite query with or into two separate index scans. There is no way to improve the query significantly without rewriting it. Note: for this case indices on (datecol), (cr) and (db) are not very helpful. Regards, Vladimir Sitnikov
Re: [PERFORM] Using index for IS NULL query
Yes, NULL values are not stored in the index, but you may create functional index on Are you sure NULL values are not stored? btree, gist and bitmap index and search for NULL values. select amname, amindexnulls, amsearchnulls from pg_am; amname | amindexnulls | amsearchnulls +--+--- btree | t| t hash | f| f gist | t| t gin| f| f bitmap | t| t (5 rows) Sincerely yours, Vladimir Sitnikov
Re: [PERFORM] Improve Seq scan performance
Maybe there's some hybrid type possible where you can scan the index to find large table regions that are known /not/ to contain tuples of interest and seek over them in your scan. I wouldn't know, really, but it sounds like it'd probably be more I/O than a pure seq scan (given the reading of the index too) unless the table had the values of interest rather neatly clustered. It'd also surely use more memory and CPU time processing the whole index to find table regions without values of interest. Is that what you meant, though? Not exactly. I mean the following: there are cases when index scan even over non-clustered values is a complete win (basically, it is a win when the number of returned values is relatively small no matter is it due to selectivity or due to limit clause). The test case that I have provided creates a 667 pages long table and 30 pages long index thus a complete scan of the index is 22 times faster in terms of I/O. Suppose you want to find all the values that contain '%123%'. Currently PostgreSQL will do a sec scan, while the better option might be (and it is) to loop through all the items in the index (it will cost 30 I/O), find records that truly contain %123% (it will find 20 of them) and do 20 I/O to check tuple visiblity. That is 50 I/O versus 667 for seq scan. A b-tree index cannot be used on a LIKE query with a leading wildcard. See the FAQ. Unfortunately it is true. I would love to improve that particular case. In addition, if your database is not in the C locale you can't use an ordinary index for LIKE queries. See the FAQ. You need to create a text_pattern_ops index instead: create index i_ix_txt on seq_test(i text_pattern_ops); Good catch. However, that does not change the results. PostgresSQL does the same amount of 2529 I/O for index scan on '%123%' for some unknown reason. set enable_seqscan=off -- Index Scan reads 2529 pages for some reason. I would expect *30 *(index size) + *20 *(number of matching entries) = 50 pages maximum, that is 10 times better than with seq scan. Index Scan using i_ix on seq_test (cost=0.00..1643.74 rows=356 width=508) (actual time=0.334..16.746 rows=*20 *loops=1 read_shared=2529(2529) read_local=0(0) flush=0 local_flush=0 file_read=0 file_write=0) Filter: (i ~~ '%123%'::text) Total runtime: 16.863 ms I think it's reading the whole index, because it can't do a prefix search if there's a leading wildcard. I'm a bit confused, though, since I thought in this case it couldn't actually execute the query w/o a sequential scan, and would just use one irrespective of the enable_seqscan param. That's what happens here. Please, follow the case carefully: the index is only 30 pages long. Why is PostgreSQL doing 2529 I/O? It drives me crazy. Regards, Vladimir Sitnikov
Re: [PERFORM] Improve Seq scan performance
Lutischán Ferenc wrote: It is possible to make an index on the table, and make a seq index scan on this values? My understanding is that this isn't possible in PostgreSQL, because indexes do not contain information about tuple visibility. Data read from the index might refer to tuples that've been deleted as far as your transaction is concerned, or to tuples that were created after your snapshot was taken. My understanding is even though indices do not contain information on tuple visibility, index could be used to filter out records that is known to make no match. Since btree index stores exact values, PostgreSQL could scan through the index and skip those entries that do not contain '%aaa%'. That will dramatically improve cases where the criteria has good selectivity, since index has much more compact structure than table. As far as I understand, it is discouraged to implement/suggest patches during Commitfest, however, I would love to treat the following case as a performance bug and add it to the TODO list: create table seq_test as select cast(i as text) i, repeat('*', 500) padding from generate_series(1,1) as s(i); create index i_ix on seq_test(i); vacuum analyze verbose seq_test; -- index i_ix now contains 1 row versions in *30 *pages -- seq_test: found 0 removable, 1 nonremovable row versions in *667 * pages explain analyze select * from seq_test where i like '%123%'; -- Seq Scan reads 667 pages (as expected) Seq Scan on seq_test (cost=0.00..792.00 rows=356 width=508) (actual time=0.129..9.071 rows=20 loops=1 read_shared=*667*(667) read_local=0(0) flush=0 local_flush=0 file_read=0 file_write=0) Filter: (i ~~ '%123%'::text) Total runtime: 9.188 ms set enable_seqscan=off -- Index Scan reads 2529 pages for some reason. I would expect *30 *(index size) + *20 *(number of matching entries) = 50 pages maximum, that is 10 times better than with seq scan. Index Scan using i_ix on seq_test (cost=0.00..1643.74 rows=356 width=508) (actual time=0.334..16.746 rows=*20 *loops=1 read_shared=2529(2529) read_local=0(0) flush=0 local_flush=0 file_read=0 file_write=0) Filter: (i ~~ '%123%'::text) Total runtime: 16.863 ms Hopefully, there will be a clear distinction between filtering via index and filtering via table access. Regards, Vladimir Sitnikov
Re: [PERFORM] indexing for distinct search in timestamp based table
You might get great improvement for '%' cases using index on channel_name(field, start_time) and a little bit of pl/pgsql Basically, you need to implement the following algorithm: 1) curr_field = ( select min(field) from ad_log ) 2) record_exists = ( select 1 from ad_log where field=cur_field and _all_other_conditions limit 1 ) 3) if record_exists==1 then add curr_field to the results 3) curr_field = (select min(field) from ad_log where field curr_field ) 4) if curr_field is not null then goto 2 I believe it might make sense implement this approach in the core (I would call it index distinct scan) That could dramatically improve select distinct column from table and select column from table group by column kind of queries when there exists an index on column and a particular column has very small number of distinct values. For instance: say a table has 10'000'000 rows, while column of interest has only 20 distinct values. In that case, the database will be able to get every of those 20 values in virtually 20 index lookups. What does the community think about that?