Re: [PERFORM] Improving PostgreSQL insert performance

2017-06-11 Thread Vladimir Sitnikov
Alvaro>Something like
INSERT INTO .. VALUES ('col1', 'col2'), ('col1', 'col2'), ('col1', 'col2')>I
did not
Frits>try that, to be honest.

pgjdbc does automatically rewrite insert values(); into insert ...
values(),(),(),() when reWriteBatchedInserts=true. I don't expect manual
multivalues to be noticeably faster there.


Frits>https://etc.to/confluence/display/~admjal/PostgreSQL+performance+tests

Do you really intend to measure just a single insert operation?
It looks odd, as typical applications would execute inserts for quite a
while before they terminate.

You are including lots of warmup overheads (e.g. JIT-compilation), so your
approach does not measure peak performance.
On the other hand, you are not measuring enough time to catch things like
"DB log switch".

Would you please use JMH as a load driver?
Here's an example:
https://github.com/pgjdbc/pgjdbc/blob/master/ubenchmark/src/main/java/org/postgresql/benchmark/statement/InsertBatch.java


Vladimir

>


Re: [PERFORM] Improving PostgreSQL insert performance

2017-06-09 Thread Vladimir Sitnikov
Frits,

Would you mind sharing the source code of your benchmark?

>BTW: It seems you need a recent driver for this; I'm
using postgresql-42.1.1.jar

Technically speaking, reWriteBatchedInserts was introduced in 9.4.1209
(2016-07-15)

Vladimir


Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-05 Thread Vladimir Sitnikov
What could cause this? Note that there is no ANALYZE.

Can you capture pstack and/or perf report while explain hangs?
I think it should shed light on the activity of PostgreSQL.

Vladimir


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query taking long time

2014-03-03 Thread Vladimir Sitnikov
 I have simplified the query and added the last advise that you told me:

 Query:

  explain analyze select * from (select * from
entity_compounddict2document  where name='ranitidine') as a order by
a.hepval;

Do you need full result?

If you need just top-n rows, then index on
entity_compounddict2document(name, a.hepval) might help.

Regards,
Vladimir Sitnikov


Re: [PERFORM] Performance of complicated query

2013-05-23 Thread Vladimir Sitnikov
This leads to the WHERE clause, WHERE read_datetime = max_read, and hence
I'm only summing the last read for each device for each patient.
Is reads table insert-only? Do you have updates/deletes of the
historical rows?

3.  Can I modify my tables to make this query (which is the crux of my
application) run faster?
Can you have a second reads table that stores only up to date values?
That will eliminate max-over completely, enable efficient usage in other
queries, and make your queries much easier to understand by humans and
computers.

PS. read_datetime = max_read is prone to what if two measurements have
same date errors.
PPS. distinct MAX(max_read) OVER(PARTITION BY patient_id) AS latest_read
looks like a complete mess. Why don't you just use group by?


Regards,
Vladimir


Re: [PERFORM] FW: performance issue with a 2.5gb joinded table

2013-01-09 Thread Vladimir Sitnikov
Daniel,

Somehow oracle seems to know that a right join is the better way to go.
In fact, PostgreSQL is just doing the same thing: it hashes smaller table
and scans the bigger one.

Could you please clarify how do you consume 25M rows?
It could be the difference of response times comes not from the PostgreSQL
itself, but from the client code.

Could you please add the following information?
1) Execution time of simple query that selects MAX of all the required
columns select max(test1.slsales_batch) , max(test1.slsales_checksum),

I mean not explain (analyze, buffers), but simple execution.
The purpose of MAX is to split overhead of consuming of the resultset from
the overhead of producing it.

2) explain (analyze, buffers) for the same query with maxes. That should
reveal the overhead of explain analyze itself.

3) The output of the following SQLPlus script (from Oracle):
  set linesize 1000 pagesize 1 trimout on trimspool on time on timing on
  spool slow_query.lst
  select /*+ gather_plan_statistics */ max(test1.slsales_batch) ,
max(test1.slsales_checksum), ..;
  select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS
LAST'));
  spool off

  That would display detailed statistics on execution time similar to the
explain (analyze, buffers).

4) Could you please clarify how did you migrate test1 table?
I guess the order of rows in that table might affect overall execution time.
Sorted table would be more CPU cache friendly, thus giving speedup. (see
[1] for similar example).
As far as I understand, simple create table as select * from test1 order by
slsales_date_id, slsales_prod_id should improve cache locality.


[1]:
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

-- 
Regards,
Vladimir Sitnikov


Re: [PERFORM] filesystems benchmark

2008-12-15 Thread Vladimir Sitnikov

 I'm trying to compare different filesystems for postgres using pgbench.
  The problem I've seen is that my IO wait is very very low.  Is there a way
 I can get pgbench to do more to push that up a little?

Why do you expect to see high IO wait?

Does your database fit in system memory? If you do not modify the data and
it is not large enough, it is unlikely you will hit IO wait.

Do you have write cache enabled for the file system/device? Even if you have
some DML statements, it would not be that easy to saturate write cache
enabled storage (especially, when it fits in memory).


Regards,
Vladimir Sitnikov


Re: [PERFORM] query plan with index having a btrim is different for strings of different length

2008-12-10 Thread Vladimir Sitnikov

 tii=# explain analyze SELECT m_object_paper.id FROM m_object_paper,
 m_assignment WHERE m_object_paper.assignment = m_assignment.id AND
 m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND
 lower(btrim(x_firstname)) = lower(btrim('Jordan')) and
 lower(btrim(x_lastname)) = lower(btrim('Smith'));

Is there an index on m_object_paper.assignment? It could solve the
problem.

With current indices on btrim(last_name) and owner you are just throwing
the rows away (you have 521382 rows with smith, 15494737 with owner=-1 and
only 58 of them have both smith/jordan and -1).

Consider creating index on m_object_paper using
btree(lower(btrim(x_lastname))) where owner=-1; (it might add firstname
column there as per Tom's suggestion)

Or just index on (owner, lower(...)) if you have other queries with
different values for owner.

One more point that could improve bitmap scans is greater value for
work_mem. You'll need 8*15494737 ~ 130Mb == 13 for work_mem (however,
that is way too high unless you have lots of RAM and just couple of active
database sessions)


Regards,
Vladimir Sitnikov


Re: [PERFORM] Slow SQL query (14-15 seconds)

2008-11-13 Thread Vladimir Sitnikov
Could you please try this one:

SELECT pk_societe_id,
  denomination_commerciale,
  denomination_sociale,
  numero_client,
  COALESCE(stats_commandes.nombre, 0) AS societe_nbre_commandes,
  COALESCE(stats_adresses_livraison.nombre, 0) AS
societe_adresses_livraison_quantite,
  COALESCE(stats_adresses_facturation.nombre, 0) AS
societe_adresses_facturation_quantite,
  COALESCE(NULLIF(admin_email,''), NULLIF(admin_bis_email,''),
NULLIF(admin_ter_email,''), 'n/a') AS email,
  COALESCE(NULLIF(admin_tel,''), NULLIF(admin_bis_tel,''),
NULLIF(admin_ter_tel,''), 'n/a') AS telephone,
  remise_permanente,
  is_horeca
FROM societes
LEFT JOIN (
   SELECT societes.pk_societe_id AS societe_id,
  COUNT(commandes.pk_commande_id) AS nombre,
  max(case when delivery_date_livraison BETWEEN (NOW() - '1
year'::interval) AND NOW() then 1 end) AS il_y_avait_un_commande
   FROM commandes
   INNER JOIN clients ON commandes.fk_client_id =
clients.pk_client_id
   INNER JOIN societes ON clients.fk_societe_id =
societes.pk_societe_id
   GROUP BY societes.pk_societe_id
 ) AS stats_commandes ON stats_commandes.societe_id =
societes.pk_societe_id
LEFT JOIN (
   SELECT fk_societe_id AS societe_id,
  COUNT(pk_adresse_livraison_id) AS nombre,

   FROM societes_adresses_livraison
   WHERE is_deleted = FALSE
   GROUP BY fk_societe_id
 ) AS stats_adresses_livraison ON
stats_adresses_livraison.societe_id = societes.pk_societe_id
LEFT JOIN (
   SELECT fk_societe_id AS societe_id,
  COUNT(pk_adresse_facturation_id) AS nombre
   FROM societes_adresses_facturation
   WHERE is_deleted = FALSE
   GROUP BY fk_societe_id
 ) AS stats_adresses_facturation ON
stats_adresses_facturation.societe_id = societes.pk_societe_id
WHERE societes.is_deleted = FALSE and il_y_avait_un_commande=1
ORDER BY LOWER(denomination_commerciale);

Bien a vous,
Vladimir Sitnikov


Re: [PERFORM] Increasing select max(datecol) from bilkaib where datecol=date'2008-11-01' and (cr='00' or db='00') speed

2008-11-12 Thread Vladimir Sitnikov
On Wed, Nov 12, 2008 at 9:02 AM, Andrus [EMAIL PROTECTED] wrote:

 There are columns
 kuupaev date,  cr char(10), db char(10)
 and regular indexes  for all those fields.
 bilkaib table contains large number of rows.

 The following query takes too much time.
 How to make it faster ?
 I think PostgreSql should use multiple indexes as bitmaps to speed it.

I am afraid I do not see a way to use bitmaps to get any improvement here:
the server will still need to read the whole indices to figure out the
answer.

I suggest you to create two more indices:

create index date_with_zero_cr on bilkaib(date) where cr='00';
create index date_with_zero_db on bilkaib(date) where db='00';

And rewrite query as follows:
select greatest(
   (select max(date) from bilkaib where datecol=date'2008-11-01' and
cr='00'),
   (select max(date) from bilkaib where datecol=date'2008-11-01' and
db='00'))


Regards,
Vladimir Sitnikov


Re: [PERFORM] slow full table update

2008-11-12 Thread Vladimir Sitnikov
 
Recheck Cond: ((sid  2) AND (sid  3))
 
-  Bitmap Index Scan on pk_songs2  (cost=0.00..151.59 rows=8931
 width=0) (actual time=4.071..4.071 rows=9579 loops=1)
 
  Index Cond: ((sid  2) AND (sid  3))
 
  Is there a way to run this query on sigle  throughpass with no Recheck
 Cond?

 Recheck Cond is somewhat misleading here.

Bitmap Index Scan has almost void recheck impact in case the whole bitmap
fits in work_mem. That means bitmap scan degrades when the number of rows in
table (not the total number of returned rows) is greater than
work_mem*1024*8. 60'000 rows bitmap scan will require 60'000/8=7'500 bytes ~
8Kbytes of memory to run without additional recheck, thus I do not believe
it hurts you in this particular case


Regards,
Vladimir Sitnikov


Re: [PERFORM] Increasing select max(datecol) from bilkaib where datecol=date'2008-11-01' and (cr='00' or db='00') speed

2008-11-12 Thread Vladimir Sitnikov

 This query finds initial balance date befeore given date.

If you are not interested in other balances except initial ones (the ones
that have '00') the best way is to create partial indices that I have
suggested.
That will keep size of indices small, while providing good performance
(constant response time)


 bilkaib table contains several year transactions so it is large.

That is not a problem for the particular case. However, when you evaluate
query performance, it really makes sense giving number of rows in each table
(is 100K rows a large table? what about 10M rows?)  and other properties
of the data stored in the table (like number of rows that have cr='00')


 Alternatively if you create an index on (cr, bilkaib) and one on (db,
 bilkaib) then you will be able to use other values in the query too.

That means if you create one index on biklaib (cr, datecol) and another
index on (db, datecol) you will be able to improve queries like
select greatest(
   (select max(date) from bilkaib where datecol=date'2008-11-01' and
cr=XXX),
   (select max(date) from bilkaib where datecol=date'2008-11-01' and
db=YYY)).
with arbitrary XXX and YYY. I am not sure if you really want this.


 I'm sorry I do'nt understand this.
 What does the (cr, bilkaib) syntax mean?

I believe that should be read as (cr, datecol).



 Should I create two functions indexes and re-write query as Vladimir
 suggests or is there better appoach ?

I am afraid PostgreSQL is not smart enough to rewrite query with or into
two separate index scans. There is no way to improve the query significantly
without rewriting it.

Note:  for this case indices on (datecol), (cr) and (db) are not very
helpful.

Regards,
Vladimir Sitnikov


Re: [PERFORM] Using index for IS NULL query

2008-11-11 Thread Vladimir Sitnikov
 Yes, NULL values are not stored in the index, but you may create functional
 index on

Are you sure NULL values are not stored? btree, gist and bitmap index and
search for NULL values.

select amname, amindexnulls, amsearchnulls from pg_am;

 amname | amindexnulls | amsearchnulls
+--+---
 btree  | t| t
 hash   | f| f
 gist   | t| t
 gin| f| f
 bitmap | t| t
(5 rows)


Sincerely yours,
Vladimir Sitnikov


Re: [PERFORM] Improve Seq scan performance

2008-11-10 Thread Vladimir Sitnikov

 Maybe there's some hybrid type possible where you can scan the index to
 find large table regions that are known /not/ to contain tuples of interest
 and seek over them in your scan. I wouldn't know, really, but it sounds like
 it'd probably be more I/O than a pure seq scan (given the reading of the
 index too) unless the table had the values of interest rather neatly
 clustered. It'd also surely use more memory and CPU time processing the
 whole index to find table regions without values of interest.



 Is that what you meant, though?

Not exactly.  I mean the following:  there are cases when index scan even
over non-clustered values is a complete win (basically, it is a win when the
number of returned values is relatively small no matter is it due to
selectivity or due to limit clause).
The test case that I have provided creates a 667 pages long table and 30
pages long index thus a complete scan of the index is 22 times faster in
terms of I/O.

Suppose you want to find all the values that contain '%123%'. Currently
PostgreSQL will do a sec scan, while the better option might be (and it is)
to loop through all the items in the index (it will cost 30 I/O), find
records that truly contain %123% (it will find 20 of them) and do 20 I/O to
check tuple visiblity. That is 50 I/O versus 667 for seq scan.



 A b-tree index cannot be used on a LIKE query with a leading wildcard. See
 the FAQ.

Unfortunately it is true. I would love to improve that particular case.

In addition, if your database is not in the C locale you can't use an
 ordinary index for LIKE queries. See the FAQ. You need to create a
 text_pattern_ops index instead:

 create index i_ix_txt on seq_test(i text_pattern_ops);

Good catch. However, that does not change the results. PostgresSQL does the
same amount of 2529 I/O for index scan on '%123%' for some unknown reason.




  set enable_seqscan=off
 -- Index Scan reads 2529 pages for some reason. I would expect *30 *(index
 size) + *20 *(number of matching entries) = 50 pages maximum, that is 10
 times better than with seq scan.
 Index Scan using i_ix on seq_test  (cost=0.00..1643.74 rows=356 width=508)
 (actual time=0.334..16.746 rows=*20 *loops=1 read_shared=2529(2529)
 read_local=0(0) flush=0 local_flush=0 file_read=0 file_write=0)
  Filter: (i ~~ '%123%'::text)
 Total runtime: 16.863 ms


 I think it's reading the whole index, because it can't do a prefix search
 if there's a leading wildcard. I'm a bit confused, though, since I thought
 in this case it couldn't actually execute the query w/o a sequential scan,
 and would just use one irrespective of the enable_seqscan param. That's what
 happens here.

Please, follow the case carefully:  the index is only 30 pages long. Why is
PostgreSQL doing 2529 I/O? It drives me crazy.


Regards,
Vladimir Sitnikov


Re: [PERFORM] Improve Seq scan performance

2008-11-09 Thread Vladimir Sitnikov
 Lutischán Ferenc wrote:

  It is possible to make an index on the table, and make a seq index scan on
 this values?


 My understanding is that this isn't possible in PostgreSQL, because indexes
 do not contain information about tuple visibility. Data read from the index
 might refer to tuples that've been deleted as far as your transaction is
 concerned, or to tuples that were created after your snapshot was taken.

 My understanding is even though indices do not contain information on tuple
visibility, index could be used to filter out records that is known to make
no match. Since btree index stores exact values, PostgreSQL could scan
through the index and skip those entries that do not contain '%aaa%'. That
will dramatically improve cases where the criteria has good selectivity,
since index has much more compact structure than table.

As far as I understand, it is discouraged to implement/suggest patches
during Commitfest, however, I would love to treat the following case as a
performance bug and add it to the TODO list:


create table seq_test
 as select cast(i as text) i, repeat('*', 500) padding from
generate_series(1,1) as s(i);

create index i_ix on seq_test(i);

vacuum analyze verbose seq_test;
-- index i_ix now contains 1 row versions in *30 *pages
-- seq_test: found 0 removable, 1 nonremovable row versions in *667 *
pages

explain analyze select * from seq_test where i like '%123%';
-- Seq Scan reads 667 pages (as expected)
Seq Scan on seq_test  (cost=0.00..792.00 rows=356 width=508) (actual
time=0.129..9.071 rows=20 loops=1 read_shared=*667*(667) read_local=0(0)
flush=0 local_flush=0 file_read=0 file_write=0)
  Filter: (i ~~ '%123%'::text)
Total runtime: 9.188 ms

set enable_seqscan=off
-- Index Scan reads 2529 pages for some reason. I would expect *30 *(index
size) + *20 *(number of matching entries) = 50 pages maximum, that is 10
times better than with seq scan.
Index Scan using i_ix on seq_test  (cost=0.00..1643.74 rows=356 width=508)
(actual time=0.334..16.746 rows=*20 *loops=1 read_shared=2529(2529)
read_local=0(0) flush=0 local_flush=0 file_read=0 file_write=0)
  Filter: (i ~~ '%123%'::text)
Total runtime: 16.863 ms

Hopefully, there will be a clear distinction between filtering via index and
filtering via table access.


Regards,
Vladimir Sitnikov


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-09-05 Thread Vladimir Sitnikov
You might get great improvement for '%' cases using index on
channel_name(field,
start_time) and a little bit of pl/pgsql

Basically, you need to implement the following algorithm:
 1) curr_field = ( select  min(field) from ad_log )
 2) record_exists = ( select 1 from ad_log where field=cur_field and
_all_other_conditions limit 1 )
 3) if record_exists==1 then add curr_field to the results
 3) curr_field = (select min(field) from ad_log where field  
 curr_field )
 4) if curr_field is not null then goto 2


I believe it might make sense implement this approach in the core (I would
call it index distinct scan)

That could dramatically improve select distinct column from table and
select column from table group by column kind of queries when there
exists an index on column and a particular column has very small number of
distinct values.

For instance:  say a table has 10'000'000 rows, while column of interest has
only 20 distinct values. In that case, the database will be able to get
every of those 20 values in virtually 20 index lookups.

What does the community think about that?