[PERFORM] Different plan for very similar queries

2015-05-29 Thread Peter J. Holzer
wdsah= select version();
version 
   
---
 PostgreSQL 9.1.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 
4.7.2-5) 4.7.2, 64-bit
(1 row)

I plan to upgrade to Debian 8 (with Postgres 9.4) soon, so the problem
may go away, but I would still like to understand what is happening
here.

IRL the queries are a bit more complicated (they involve two additional
tables), but I can demonstrate it with just two:

wdsah= \d facttable_stat_fta4
  Table public.facttable_stat_fta4
   Column|Type | Modifiers 
-+-+---
 macrobondtimeseries | character varying(255)  | not null
 date| date| not null
 value   | double precision| 
 berechnungsart  | character varying   | 
 einheit | character varying   | 
 kurzbezeichnung | character varying   | 
 partnerregion   | character varying   | 
 og  | character varying   | 
 sitcr4  | character varying   | 
 warenstrom  | character varying   | 
 valid_from  | timestamp without time zone | 
 from_job_queue_id   | integer | 
 kommentar   | character varying   | 
Indexes:
facttable_stat_fta4_pkey PRIMARY KEY, btree (macrobondtimeseries, date)
facttable_stat_fta4_berechnungsart_idx btree (berechnungsart)
facttable_stat_fta4_einheit_idx btree (einheit)
facttable_stat_fta4_og_idx btree (og)
facttable_stat_fta4_partnerregion_idx btree (partnerregion)
facttable_stat_fta4_sitcr4_idx btree (sitcr4)
facttable_stat_fta4_warenstrom_idx btree (warenstrom)

wdsah= select count(*) from facttable_stat_fta4;
  count   
--
 43577941
(1 row)

wdsah= \d term
  Table public.term
 Column |Type |   Modifiers
+-+
 facttablename  | character varying   | 
 columnname | character varying   | 
 term   | character varying   | 
 concept_id | integer | not null
 language   | character varying   | 
 register   | character varying   | 
 hidden | boolean | 
 cleansing_job_queue_id | integer | not null default (-1)
 meta_insert_dt | timestamp without time zone | not null default now()
 meta_update_dt | timestamp without time zone | 
 valid_from | timestamp without time zone | 
 from_job_queue_id  | integer | 
Indexes:
term_concept_id_idx btree (concept_id)
term_facttablename_columnname_idx btree (facttablename, columnname)
term_facttablename_idx btree (facttablename)
term_facttablename_idx1 btree (facttablename) WHERE facttablename IS NOT 
NULL AND columnname::text = 'macrobondtimeseries'::text
term_language_idx btree (language)
term_register_idx btree (register)
term_term_ftidx gin (to_tsvector('simple'::regconfig, term::text))
term_term_idx btree (term)
Check constraints:
term_facttablename_needs_columnname_chk CHECK (facttablename IS NULL OR 
columnname IS NOT NULL)
Foreign-key constraints:
term_concept_id_fkey FOREIGN KEY (concept_id) REFERENCES concept(id) 
DEFERRABLE

wdsah= select count(*) from term;
  count  
-
 6109087
(1 row)

The purpose of the query is to find all terms which occur is a given
column of the facttable (again, IRL this is a bit more complicated),
basically an optimized version of select distinct.

Some of my columns have very few distinct members:

wdsah= select * from pg_stats where tablename='facttable_stat_fta4' and 
attname in ('einheit', 'berechnungsart', 'warenstrom');
 schemaname |  tablename  |attname | inherited | null_frac | 
avg_width | n_distinct | most_common_vals |  most_common_freqs  | 
histogram_bounds | correlation 
+-++---+---+---++--+-+--+-
 public | facttable_stat_fta4 | berechnungsart | f | 0 |
 2 |  2 | {n,m}| {0.515167,0.484833} |  
|0.509567
 public | facttable_stat_fta4 | einheit| f | 0 |
 3 |  2 | {EUR,kg} | {0.515167,0.484833} |  
|0.491197
 public | facttable_stat_fta4 | warenstrom | f | 0 |
 2 |  2 | {X,M}| {0.580267,0.419733} |   

Re: [PERFORM] Different plan for very similar queries

2015-05-29 Thread Peter J. Holzer
On 2015-05-29 10:55:44 +0200, Peter J. Holzer wrote:
 wdsah= explain analyze select facttablename, columnname, term, concept_id, 
 t.hidden, language, register 
 from term t where facttablename='facttable_stat_fta4' and 
 columnname='einheit' and exists (select 1 from facttable_stat_fta4 f where 
 f.einheit=t.term );
   
  QUERY PLAN   
  
 -
  Nested Loop Semi Join  (cost=0.00..384860.48 rows=1 width=81) (actual 
 time=0.061..0.119 rows=2 loops=1)
-  Index Scan using term_facttablename_columnname_idx on term t  
 (cost=0.00..391.46 rows=636 width=81) (actual time=0.028..0.030 rows=3 
 loops=1)
  Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) 
 AND ((columnname)::text = 'einheit'::text))
-  Index Scan using facttable_stat_fta4_einheit_idx on 
 facttable_stat_fta4 f  (cost=0.00..384457.80 rows=21788970 width=3) (actual 
 time=0.027..0.027 rows=1 loops=3)
  Index Cond: ((einheit)::text = (t.term)::text)
  Total runtime: 0.173 ms
 (6 rows)
 
[...]
 wdsah= explain analyze select facttablename, columnname, term, concept_id, 
 t.hidden, language, register 
 from term t where facttablename='facttable_stat_fta4' and 
 columnname='berechnungsart' and exists (select 1 from facttable_stat_fta4 f 
 where f.berechnungsart=t.term );
   
QUERY PLAN 
 
 
  Merge Semi Join  (cost=316864.57..319975.79 rows=1 width=81) (actual 
 time=7703.917..30948.271 rows=2 loops=1)
Merge Cond: ((t.term)::text = (f.berechnungsart)::text)
-  Index Scan using term_term_idx on term t  (cost=0.00..319880.73 
 rows=636 width=81) (actual time=7703.809..7703.938 rows=3 loops=1)
  Filter: (((facttablename)::text = 'facttable_stat_fta4'::text) AND 
 ((columnname)::text = 'berechnungsart'::text))
-  Index Scan using facttable_stat_fta4_berechnungsart_idx on 
 facttable_stat_fta4 f  (cost=0.00..2545748.85 rows=43577940 width=2) (actual 
 time=0.089..16263.582 rows=21336180 loops=1)
  Total runtime: 30948.648 ms
 (6 rows)

A couple of additional observations:

The total cost of both queries is quite similar, so random variations
might push into one direction or the other. Indeed, after dropping and
recreating indexes (I tried GIN indexes as suggested by Heikki on [1])
and calling analyze after each change, I have now reached a state where
both queries use the fast plan.

In the first case the query planner seems to add the cost of the two
index scans to get the total cost, despite the fact that for a semi join
the second index scan can be aborted after the first hit (so either the
cost of the second scan should be a lot less than 384457.80 or it needs
to be divided by a large factor for the semi join).

In the second case the cost of the second index scan (2545748.85) is
either completely ignored or divided by a large factor: It doesn't seem
to contribute much to the total cost.

hp


[1] http://hlinnaka.iki.fi/2014/03/28/gin-as-a-substitute-for-bitmap-indexes/

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature


[PERFORM] Fwd: Postgres is using 100% CPU

2015-05-29 Thread Ashik S L
Hi All,

I am using postgresDB on redhat machine which is having 4GB RAM
machine. As soon as it starts to Inserting rows into the postgres DB it
will reach
100%cpu. It will comedown to normal after 40 minutes. I tried perform
some tuning on the postgres DB, But result was same.I am not postgres
DB expert. Even we are not seeing in all machine. Only few machines we
are seeing this issue. Any help on this would be appreciated.

Thanks,
Ashik


Re: [PERFORM] Different plan for very similar queries

2015-05-29 Thread Tomas Vondra

Hi,

On 05/29/15 11:51, Peter J. Holzer wrote:

A couple of additional observations:

The total cost of both queries is quite similar, so random variations
might push into one direction or the other. Indeed, after dropping
and recreating indexes (I tried GIN indexes as suggested by Heikki on
[1]) and calling analyze after each change, I have now reached a
state where both queries use the fast plan.


I don't think bitmap indexes are particularly good match for this use 
case. The queries need to check an existence of a few records, and btree 
indexes are great for that - the first plan is very fast.


Why exactly does the second query use a much slower plan I'm not sure. I 
believe I've found an issue in planning semi joins (reported to 
pgsql-hackers a few minutes ago), but may be wrong and the code is OK.


Can you try forcing the same plan for the second query, using enable 
flags? E.g.


   SET enable_mergejoin = off;

will disable the merge join, and push the optimizer towards a different 
join type. You may have to disable a few more node types until you get 
the same plan as for the first query, i.e.


   nestloop semi join
 - index scan
 - index scan

See this for more info:

   http://www.postgresql.org/docs/9.1/static/runtime-config-query.html

Also, have you tuned the PostgreSQL configuration? How?

Can you provide the dataset? Not necessarily all the columns, it should 
be sufficient to provide the columns used in the join/where clauses:


term - facttablename, columnname, term
facttable_stat_fta4 - einheit, berechnungsart

That'd make reproducing the problem much easier.


In the first case the query planner seems to add the cost of the two
index scans to get the total cost, despite the fact that for a semi
join the second index scan can be aborted after the first hit (so
either the cost of the second scan should be a lot less than
384457.80 or it needs to be divided by a large factor for the semi
join).

In the second case the cost of the second index scan (2545748.85) is
either completely ignored or divided by a large factor: It doesn't
seem to contribute much to the total cost.


I believe this is a consequence of the semi join semantics, because the 
explain plan contains total costs and row counts, as if the whole 
relation was scanned (in this case all the 43M rows), but the optimizer 
only propagates fraction of the cost estimate (depending on how much of 
the relation it expects to scan). In this case it expects to scan a tiny 
part of the index scan, so the impact on the total cost is small.


A bit confusing, yeah.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres is using 100% CPU

2015-05-29 Thread Tomas Vondra



On 05/29/15 20:10, Ashik S L wrote:

Hi All,

I am using postgresDB on redhat machine which is having 4GB RAM
machine. As soon as it starts to Inserting rows into the postgres DB
it will reach 100%cpu. It will comedown to normal after 40 minutes. I
tried perform some tuning on the postgres DB, But result was same.I
am not postgres DB expert. Even we are not seeing in all machine.
Only few machines we are seeing this issue. Any help on this would
be appreciated.


Ashik, before pointing you to this list, I asked for some basic 
information that are needed when diagnosing issues like this - database 
size, postgres version etc. We can't really help you without this info, 
because right now we only know you're doing some inserts (while before 
you mentioned updates), and it's slow.


Also, can you please provide info about the configuration and what 
changes have you done when tuning it?


Have you seen this?

  https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Postgres is using 100% CPU

2015-05-29 Thread Ashik S L
Hi All,

I am using postgresDB on redhat machine which is having 4GB RAM
machine. As soon as it starts to Inserting rows into the postgres DB it
will reach 100%cpu. It will comedown to normal after 40 minutes. I tried perform
some tuning on the postgres DB, But result was same.I am not postgres
DB expert. Even we are not seeing in all machine. Only few machines we
are seeing this issue. Any help on this would be appreciated.

Thanks,
Ashik


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres is using 100% CPU

2015-05-29 Thread Yves Dorfsman

 machine. As soon as it starts to Inserting rows into the postgres DB it
 will reach 100%cpu. It will comedown to normal after 40 minutes. I tried 
 perform

How many rows are you inserting at once? How (sql insert? copy? \copy? using a
temp or unlogged table?)?


-- 
http://yves.zioup.com
gpg: 4096R/32B0F416



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance