[PERFORM] Nested loop issue

2014-04-08 Thread Manoj Gadi
Hi All,

I have been looking for a solution to a problem where my query is executing for 
a long time because it is running into a nested loop problem.

I have done explain analyze and it shows the query taking a very long time due 
to nested loops.

On the DB side, there are indices in place for all the required columns. By 
setting nested loop off there is a drastic increase in performance (from 40,000 
ms to 600 ms) but I know this is not a right practice.

My postgres version is 9.3.2 on linux.

Please find the link for the query plan below :

http://explain.depesz.com/s/l9o


Also, find below the query that is being executed.

SELECT DISTINCT
   Sektion/Fachbereich.parent,
   Studienfach.ltxt,
   SUM(CASE
  WHEN Studiengang.faktor IS NOT NULL
  AND Studiengang.faktor = 0 THEN Studiengang.faktor * 
Studierende.summe
  ELSE Studierende.summe
   END)

FROM (
   SELECT sos_stg_aggr.tid_stg, sos_stg_aggr.ca12_staat, 
sos_stg_aggr.geschlecht, sos_stg_aggr.alter, sos_stg_aggr.hzbart, 
sos_stg_aggr.hmkfzkz, sos_stg_aggr.hmkfz, sos_stg_aggr.semkfzkz, 
sos_stg_aggr.semkfz, sos_stg_aggr.hzbkfzkz, sos_stg_aggr.hzbkfz, 
sos_stg_aggr.hrst, sos_stg_aggr.studiengang_nr, sos_stg_aggr.fach_nr, 
sos_stg_aggr.fach_sem_zahl, sos_stg_aggr.sem_rueck_beur_ein, 
sos_stg_aggr.kz_rueck_beur_ein, sos_stg_aggr.klinsem, sos_stg_aggr.hssem, 
sos_stg_aggr.stuart, sos_stg_aggr.stutyp, sos_stg_aggr.stufrm, 
sos_stg_aggr.stichtag, sos_stg_aggr.summe, sos_stg_aggr.hzbart_int, 
sos_stg_aggr.matrikel_nr, sos_stg_aggr.ch27_grund_beurl, 
sos_stg_aggr.ch62_grund_exmatr, sos_stg_aggr.hzbnote, 
textcat(sos_stg_aggr.studiengang_nr::text, sos_stg_aggr.fach_nr::text) AS 
koepfe_faelle
   FROM sos_stg_aggr

   union all

   SELECT sos_stg_aggr.tid_stg, sos_stg_aggr.ca12_staat, 
sos_stg_aggr.geschlecht, sos_stg_aggr.alter, sos_stg_aggr.hzbart, 
sos_stg_aggr.hmkfzkz, sos_stg_aggr.hmkfz, sos_stg_aggr.semkfzkz, 
sos_stg_aggr.semkfz, sos_stg_aggr.hzbkfzkz, sos_stg_aggr.hzbkfz, 
sos_stg_aggr.hrst, sos_stg_aggr.studiengang_nr, sos_stg_aggr.fach_nr, 
sos_stg_aggr.fach_sem_zahl, sos_stg_aggr.sem_rueck_beur_ein, 
sos_stg_aggr.kz_rueck_beur_ein, sos_stg_aggr.klinsem, sos_stg_aggr.hssem, 
sos_stg_aggr.stuart, sos_stg_aggr.stutyp, sos_stg_aggr.stufrm, 
sos_stg_aggr.stichtag, sos_stg_aggr.summe, sos_stg_aggr.hzbart_int, 
sos_stg_aggr.matrikel_nr, sos_stg_aggr.ch27_grund_beurl, 
sos_stg_aggr.ch62_grund_exmatr, sos_stg_aggr.hzbnote, '21' AS koepfe_faelle
   FROM sos_stg_aggr
   where sos_stg_aggr.tid_stg in (select distinct lehr_stg_ab_tid from 
lehr_stg_ab2fb)

) AS Studierende
INNER JOIN (
   select astat::integer, trim(druck) as druck from sos_k_status

) AS Rückmeldestatus
ON (
   Studierende.kz_rueck_beur_ein = Rückmeldestatus.astat
)
INNER JOIN (
   select tid, trim(name) as name from sos_stichtag

) AS Stichtag
ON (
   Studierende.stichtag = Stichtag.tid
)
INNER JOIN (
   select abschluss, kz_fach, stg, pversion, regel, trim(text) as text, fb, 
lehr, anteil, tid,null as faktor from lehr_stg_ab
where lehr_stg_ab.tid not in (select lehr_stg_ab_tid from lehr_stg_ab2fb)

union
select abschluss, kz_fach, stg, pversion, regel, trim(text) as text, 
lehr_stg_ab2fb.fb, lehr, anteil, tid,faktor from lehr_stg_ab
inner join lehr_stg_ab2fb
on lehr_stg_ab2fb.lehr_stg_ab_tid = lehr_stg_ab.tid

) AS Studiengang
ON (
   Studierende.tid_stg = Studiengang.tid
)
INNER JOIN (
   select astat, astfr, astgrp, fb, trim(ltxt) as ltxt, stg from k_stg

) AS Studienfach
ON (
   Studiengang.stg = Studienfach.stg
)
AND (
   Studienfach.ltxt IS NOT NULL
)
INNER JOIN (
   select instnr, ch110_institut, btrim(druck) as druck, btrim(parent) as 
parent from unikn_k_fb

) AS Sektion/Fachbereich
ON (
   Studiengang.fb = Sektion/Fachbereich.instnr
)
INNER JOIN (
   select apnr, trim(druck) as druck from cifx where key=613

) AS Hörerstatus
ON (
   Studierende.hrst = Hörerstatus.apnr
)
WHERE
(
   Sektion/Fachbereich.druck = 'FB Biologie'
)
AND
 (
   (
  Hörerstatus.druck = 'Haupthörer/in'
  AND Stichtag.name = 'Amtl. Statistik Land'
  AND Rückmeldestatus.druck IN ('Beurlaubung', 'Ersteinschreibung', 
'Neueinschreibung', 'Rückmeldung')
  AND Studierende.sem_rueck_beur_ein = 20132
   )
)
GROUP BY
   Sektion/Fachbereich.parent,
   Studienfach.ltxt


According to my analysis, the where clause after the Union All is taking a lot 
of time for execution.

Any help with an alternative way to represent the query or what the cause of 
issue would be very helpful.


Thanks in advance,
Manoj


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] performance drop when function argument is evaluated in WHERE clause

2014-04-08 Thread Gerardo Herzig
Hi all. I have a function that uses a simple select between 3 tables. There 
is a function argument to help choose how a WHERE clause applies. This is the 
code section:

select * from
[...]
where case $3  
  when 'I' then [filter 1]
  when 'E' then [filter 2]
  when 'P' then [filter 3]
else true end

When the function is called with, say, parameter $3 = 'I', the funcion run in 
250ms,
but when there is no case involved, and i call directly with [filter 1] the 
function runs in 70ms.

Looks like the CASE is doing something nasty.
Any hints about this?

Thanks!
Gerardo



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] query against large table not using sensible index to find very small amount of data

2014-04-08 Thread Andrew W. Gibbs
I have a fairly large table (~100M rows), let's call it events, and
among other things it has a couple of columns on it, columns that
we'll call entity_type_id (an integer) and and published_at (a
timestamp).  It has, among others, indices on (published_at) and
(entity_type_id, published_at).

A very common query against this table is of the form...

SELECT * FROM events WHERE entity_type_id = XXX ORDER BY published_at DESC 
LIMIT 25;

... to get the most recent 25 events from the table for a given type
of entity, and generally the query planner does the expected thing of
using the two-part index on (entity_type_id, published_at).  Every now
and again, though, I have found the query planner deciding that it
ought use the single column (published_at) index.  This can,
unsurprisingly, result in horrendous performance if events for a given
entity type are rare, as we end up with a very long walk of an index.

I had this happen again yesterday and I noticed something of
particular interest pertaining to the event.  Specifically, the query
was for an entity type that the system had only seen for the first
time one day prior, and furthermore the events table had not been
analyzed by the statistics collector for a couple of weeks.

My intuition is that the query planner, when working with an enormous
table, and furthermore encountering an entity type that the statistics
collector had never previously seen, would assume that the number of
rows in the events table of that entity type would be very small, and
therefore the two-part index on (entity_type_id, published_at) would
be the right choice.  Nonetheless, an EXPLAIN was showing usage of the
(published_at) index, and since there were only ~20 rows in the entire
events table for that entity type the queries were getting the worst
possible execution imaginable, i.e. reading in the whole table to find
the rows that hit, but doing it with the random I/O of an index walk.

As an experiment, I ran a VACUUM ANALYZE on the events table, and then
re-ran the EXPLAIN of the query, and...  Same query plan again...
Maybe for whatever issue I am having the random sampling nature of the
statistics collector made it unhelpful, i.e. in its sampling of the
~100M rows it never hit a single row that had the new entity type
specified?

Other possibly relevant pieces of information...  The entity type
column has a cardinality in the neighborhood of a couple dozen.
Meanwhile, for some of the entity types there is a large and ongoing
number of events, and for other entity types there is a smaller and
more sporadic number of events.  Every now and again a new entity type
shows up.

I can't understand why the query planner would make this choice.
Maybe it has gotten ideas into its head about the distribution of
data?  Or maybe there is a subtle bug that my data set is triggering?
Or maybe I need to turn some knobs on statistics collection?  Or maybe
it's all of these things together?  I worry that even if there is a
knob turning exercise that helps that we're still going to get burned
whenever a new entity type shows up until we re-run ANALYZE, assuming
that I can find a fix that involves tweaking statistics collection.  I
just can't fathom how it would ever be the case that Postgres's choice
of index usage in this case would make sense.  It doesn't even slot
cleanly into the problem space of why did Postgres do a sequential
scan instead of an index scan?.  If you're doing a query of the
described form and the entity type is specified, wouldn't the two-part
index theoretically _always_ yield better performance than the
one-part index?  Maybe I have a flawed understanding of the cost of
using various indexes?  Maybe there is something analogous between
sequential-versus-index-scan and one-part-versus-two-part-index scan
choices?

FWIW, we're running on 8.4.X and using the out-of-the-box
default_statistics_target setting and haven't dabbled with setting
table level statistics configurations.

Thoughts?  Recommended reading?

  -- AWG


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance drop when function argument is evaluated in WHERE clause

2014-04-08 Thread Tom Lane
Gerardo Herzig gher...@fmed.uba.ar writes:
 Hi all. I have a function that uses a simple select between 3 tables. There 
 is a function argument to help choose how a WHERE clause applies. This is the 
 code section:
 select * from
 [...]
 where case $3  
   when 'I' then [filter 1]
   when 'E' then [filter 2]
   when 'P' then [filter 3]
 else true end

 When the function is called with, say, parameter $3 = 'I', the funcion run in 
 250ms,
 but when there is no case involved, and i call directly with [filter 1] the 
 function runs in 70ms.

 Looks like the CASE is doing something nasty.
 Any hints about this?

Don't do it like that.  You're preventing the optimizer from understanding
which filter applies.  Better to write three separate SQL commands
surrounded by an if/then/else construct.

(BTW, what PG version is that?  I would think recent versions would
realize that dynamically generating a plan each time would work around
this.  Of course, that approach isn't all that cheap either.  You'd
probably still be better off splitting it up manually.)

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query against large table not using sensible index to find very small amount of data

2014-04-08 Thread Tom Lane
Andrew W. Gibbs awgi...@awgibbs.com writes:
 A very common query against this table is of the form...

 SELECT * FROM events WHERE entity_type_id = XXX ORDER BY published_at DESC 
 LIMIT 25;

 ... to get the most recent 25 events from the table for a given type
 of entity, and generally the query planner does the expected thing of
 using the two-part index on (entity_type_id, published_at).  Every now
 and again, though, I have found the query planner deciding that it
 ought use the single column (published_at) index.

What is the estimated rows count according to EXPLAIN when it does that,
versus when it chooses the better plan?

 FWIW, we're running on 8.4.X and using the out-of-the-box
 default_statistics_target setting and haven't dabbled with setting
 table level statistics configurations.

8.4.X is due to reach EOL in July, so you really ought to be thinking
about an upgrade.  It's not clear from the given info whether this issue
is fixable with stats configuration adjustments, is a bug already fixed
in later versions, or neither, but we're unlikely to make any significant
changes in the 8.4 planner code at this point...

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PGSQL, checkpoints, and file system syncs

2014-04-08 Thread Shaun Thomas
 Is there something I can set in the PGSQL parameters or in the file system
 parameters to force a steady flow of writes to disk rather than waiting for
 a sync system call? Mounting with commit=1 did not make a difference.

The PostgreSQL devs actually had a long talk with the Linux kernel devs over 
this exact issue, actually. While we wait for the results of that to bear some 
fruit, I'd recommend using the dirty_background_bytes and dirty_bytes settings 
both on the VM side, and on the host server. To avoid excessive flushes, you 
want to avoid having more dirty memory than the system can handle in one gulp.

The dirty_bytes setting will begin flushing disks synchronously when the amount 
of dirty memory reaches this amount. While dirty_background_bytes will flush in 
the background when the amount of dirty memory hits the specified limit. It's 
the background flushing that will prevent your current problems, and it should 
be set at the same level as the amount of write cache your system has available.

So if you are on a 1GB RAID card, set it to 1GB. Once you have 1GB of dirty 
memory (from a checkpoint or whatever), Linux will begin flushing.

This is a pretty well-known issue on Linux systems with large amounts of RAM. 
Most VM servers fit that profile, so I'm not surprised it's hurting you.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance drop when function argument is evaluated in WHERE clause

2014-04-08 Thread Gerardo Herzig
Tom, thanks (as allways) for your answer. This is a 9.1.12. I have to say, im 
not very happy about if-elif-else'ing at all. 
The conditional filter es a pretty common pattern in our functions, i would 
have to add (and maintain) a substantial amount of extra code.

And i dont really understand why the optimizer issues, since the arguments are 
immutable strings, and should (or could at least) be evaluated only once.

Thanks again for your time!

Gerardo



- Mensaje original -
 De: Tom Lane t...@sss.pgh.pa.us
 Para: Gerardo Herzig gher...@fmed.uba.ar
 CC: pgsql-performance@postgresql.org, pgsql-sql pgsql-...@postgresql.org
 Enviados: Martes, 8 de Abril 2014 10:50:01
 Asunto: Re: [PERFORM] performance drop when function argument is evaluated in 
 WHERE clause
 
 Gerardo Herzig gher...@fmed.uba.ar writes:
  Hi all. I have a function that uses a simple select between 3
  tables. There is a function argument to help choose how a WHERE
  clause applies. This is the code section:
  select * from
  [...]
  where case $3
when 'I' then [filter 1]
when 'E' then [filter 2]
when 'P' then [filter 3]
  else true end
 
  When the function is called with, say, parameter $3 = 'I', the
  funcion run in 250ms,
  but when there is no case involved, and i call directly with
  [filter 1] the function runs in 70ms.
 
  Looks like the CASE is doing something nasty.
  Any hints about this?
 
 Don't do it like that.  You're preventing the optimizer from
 understanding
 which filter applies.  Better to write three separate SQL commands
 surrounded by an if/then/else construct.
 
 (BTW, what PG version is that?  I would think recent versions would
 realize that dynamically generating a plan each time would work
 around
 this.  Of course, that approach isn't all that cheap either.  You'd
 probably still be better off splitting it up manually.)
 
   regards, tom lane
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [SQL] Re: [PERFORM] performance drop when function argument is evaluated in WHERE clause

2014-04-08 Thread Tom Lane
Gerardo Herzig gher...@fmed.uba.ar writes:
 Tom, thanks (as allways) for your answer. This is a 9.1.12. I have to say, im 
 not very happy about if-elif-else'ing at all. 
 The conditional filter es a pretty common pattern in our functions, i would 
 have to add (and maintain) a substantial amount of extra code.

In that case consider moving to 9.2 or later.  I believe it'd handle
this scenario better.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regressions in PG 9.3 vs PG 9.0

2014-04-08 Thread uher dslij
As a follow up to this issue on Graeme's suggestion in a private email,

I checked the statistics in both databases, and they were the same (these
values as seen from pg_stat_user_tables to not seem to propagate to slave
databases however).  I even ran a manual analyze on the master database in
the 9.3.2 cluster, it did not affect query performance in the least bit.

We've installed all versions of postgres and tested the same query on the
same data:

PG 9.0.x : 196 ms
PG 9.1.13 : 181 ms
PG 9.2.8 : 861 ms
PG 9.3.4 : 861 ms

The EXPLAINs all pretty much look like my original post.  The planner in
9.2 and above is simply not using bitmap heap scans or bitmap index scans?
 What could be the reason for this?

Thanks in advance,


On Mon, Apr 7, 2014 at 2:34 PM, uher dslij cod...@gmail.com wrote:

 Hi,

 We recently upgraded from pg 9.0.5 to 9.3.2 and we are observing much
 higher load on our hot standbys (we have 3).  As you can see from the query
 plans below, we have some queries that are running 4-5 times slower now,
 many due to what looks like a bad plan in 9.3.  Are there any known issues
 with query plan regressions in 9.3?  Any ideas about how I can get back the
 old planning behavior with 9.3.2?

 Thanks in advance for any help!

 On Production System
 --*Postgres 9.3.2*
 Intel(R) Xeon(R) CPU E5649 (2.53 Ghz 6-core)
 12 GB RAM
 Intel 710 SSD

 -

 explain analyze select distinct on (t1.id) t1.id, t1.hostname as name, 
 t1.active, t1.domain_id, t1.base, t1.port, t1.inter_domain_flag from location 
 t1, host t2, container t3, resource_location t4 where t2.id = 34725278 and 
 t3.id = t2.container_id and t4.location_id = t1.id and t4.parent_id in 
 (select * from parentContainers(t3.id)) and t1.license is not null and 
 (t1.license_end_date is null or t1.license_end_date = current_date) and 
 t1.active  0 and t3.active  0 and t4.active  0 and t1.domain_id = 
 t2.domain_id and t2.domain_id = t3.domain_id and t3.domain_id = t4.domain_id 
 and (0 = 0 or t1.active  0);
   
QUERY PLAN
 
  Unique  (cost=313.44..313.45 rows=1 width=35) (actual time=989.836..989.837 
 rows=1 loops=1)
-  Sort  (cost=313.44..313.44 rows=1 width=35) (actual 
 time=989.836..989.837 rows=1 loops=1)
  Sort Key: t1.id
  Sort Method: quicksort  Memory: 25kB
  -  Nested Loop  (cost=1.27..313.43 rows=1 width=35) (actual 
 time=922.484..989.791 rows=1 loops=1)
Join Filter: (SubPlan 1)
Rows Removed by Join Filter: 742
-  Nested Loop  (cost=0.99..33.80 rows=1 width=53) (actual 
 time=0.174..5.168 rows=934 loops=1)
  Join Filter: (t2.domain_id = t1.domain_id)
  -  Nested Loop  (cost=0.71..11.23 rows=1 width=18) 
 (actual time=0.101..0.103 rows=1 loops=1)
-  Index Scan using host_pkey on host t2  
 (cost=0.29..5.29 rows=1 width=12) (actual time=0.041..0.042 rows=1 loops=1)
  Index Cond: (id = 34725278::numeric)
-  Index Scan using container_pkey on container 
 t3  (cost=0.42..5.43 rows=1 width=12) (actual time=0.057..0.058 rows=1 
 loops=1)
  Index Cond: (id = t2.container_id)
  Filter: ((active  0::numeric) AND 
 (t2.domain_id = domain_id))
  -  Index Scan using idx_location_domain_id on location 
 t1  (cost=0.28..18.55 rows=8 width=35) (actual time=0.065..3.768 rows=934 
 loops=1)
Index Cond: (domain_id = t3.domain_id)
Filter: ((license IS NOT NULL) AND (active  
 0::numeric) AND ((license_end_date IS NULL) OR (license_end_date = 
 ('now'::cstring)::date)))
Rows Removed by Filter: 297
-  Index Scan using idx_resource_location_domain_id on 
 resource_location t4  (cost=0.28..27.63 rows=1 width=21) (actual 
 time=0.532..0.849 rows=1 loops=934)
  Index Cond: (domain_id = t1.domain_id)
  Filter: ((active  0::numeric) AND (t1.id = 
 location_id))
  Rows Removed by Filter: 1003
SubPlan 1
  -  Function Scan on parentcontainers  (cost=0.25..500.25 
 rows=1000 width=32) (actual time=0.253..0.253 rows=2 loops=743)
  Total runtime: 990.045 ms
 (26 rows)


 On test box:
 --*Postgres 9.0.2*
 Intel(R) Xeon(R) CPU E5345 (2.33 Ghz 4-core)
 8 GB RAM
 6 x SAS 10K RAID 10

 --

 explain analyze select distinct on (t1.id) t1.id, t1.hostname as name, 
 t1.active, t1.domain_id, t1.base, t1.port, t1.inter_domain_flag from location 
 t1, host t2, container t3, 

Re: [PERFORM] Nested loop issue

2014-04-08 Thread Dhananjay Singh
REPLACE -- where sos_stg_aggr.tid_stg in (select distinct lehr_stg_ab_tid from 
lehr_stg_ab2fb)

WITH -- where sos_stg_aggr.tid_stg EXISTS (select distinct lehr_stg_ab_tid from 
lehr_stg_ab2fb)

Similarly others also like -- lehr_stg_ab.tid not in (select lehr_stg_ab_tid 
from lehr_stg_ab2fb) with NOT EXISTS


This should surely going to improve performance depending on results from inner 
query.

Regards
Dhananjay
OpenSCG


On Tuesday, 8 April 2014 3:06 PM, Manoj Gadi manoj.g...@uni-konstanz.de wrote:
 
Hi All,

I have been looking for a solution to a problem where my query is executing for 
a long time because it is running into a nested loop problem.

I have done explain analyze and it shows the query taking a very long time due 
to nested loops.

On the DB side, there are indices in place for all the required columns. By 
setting nested loop off there is a drastic increase in performance (from 40,000 
ms to 600 ms) but I know this is not a right practice.

My postgres version is 9.3.2 on linux.

Please find the link for the query plan below :

http://explain.depesz.com/s/l9o


Also, find below the query that is being executed.

SELECT DISTINCT
   Sektion/Fachbereich.parent,
   Studienfach.ltxt,
   SUM(CASE
      WHEN Studiengang.faktor IS NOT NULL
      AND Studiengang.faktor = 0 THEN Studiengang.faktor * 
Studierende.summe
      ELSE Studierende.summe
   END)

FROM (
   SELECT sos_stg_aggr.tid_stg, sos_stg_aggr.ca12_staat, 
sos_stg_aggr.geschlecht, sos_stg_aggr.alter, sos_stg_aggr.hzbart, 
sos_stg_aggr.hmkfzkz, sos_stg_aggr.hmkfz, sos_stg_aggr.semkfzkz, 
sos_stg_aggr.semkfz, sos_stg_aggr.hzbkfzkz, sos_stg_aggr.hzbkfz, 
sos_stg_aggr.hrst, sos_stg_aggr.studiengang_nr, sos_stg_aggr.fach_nr, 
sos_stg_aggr.fach_sem_zahl, sos_stg_aggr.sem_rueck_beur_ein, 
sos_stg_aggr.kz_rueck_beur_ein, sos_stg_aggr.klinsem, sos_stg_aggr.hssem, 
sos_stg_aggr.stuart, sos_stg_aggr.stutyp, sos_stg_aggr.stufrm, 
sos_stg_aggr.stichtag, sos_stg_aggr.summe, sos_stg_aggr.hzbart_int, 
sos_stg_aggr.matrikel_nr, sos_stg_aggr.ch27_grund_beurl, 
sos_stg_aggr.ch62_grund_exmatr, sos_stg_aggr.hzbnote, 
textcat(sos_stg_aggr.studiengang_nr::text, sos_stg_aggr.fach_nr::text) AS 
koepfe_faelle
   FROM sos_stg_aggr

   union all

   SELECT sos_stg_aggr.tid_stg, sos_stg_aggr.ca12_staat, 
sos_stg_aggr.geschlecht, sos_stg_aggr.alter, sos_stg_aggr.hzbart, 
sos_stg_aggr.hmkfzkz, sos_stg_aggr.hmkfz, sos_stg_aggr.semkfzkz, 
sos_stg_aggr.semkfz, sos_stg_aggr.hzbkfzkz, sos_stg_aggr.hzbkfz, 
sos_stg_aggr.hrst, sos_stg_aggr.studiengang_nr, sos_stg_aggr.fach_nr, 
sos_stg_aggr.fach_sem_zahl, sos_stg_aggr.sem_rueck_beur_ein, 
sos_stg_aggr.kz_rueck_beur_ein, sos_stg_aggr.klinsem, sos_stg_aggr.hssem, 
sos_stg_aggr.stuart, sos_stg_aggr.stutyp, sos_stg_aggr.stufrm, 
sos_stg_aggr.stichtag, sos_stg_aggr.summe, sos_stg_aggr.hzbart_int, 
sos_stg_aggr.matrikel_nr, sos_stg_aggr.ch27_grund_beurl, 
sos_stg_aggr.ch62_grund_exmatr, sos_stg_aggr.hzbnote, '21' AS koepfe_faelle
   FROM sos_stg_aggr
   where sos_stg_aggr.tid_stg in (select distinct lehr_stg_ab_tid from 
lehr_stg_ab2fb)

) AS Studierende
INNER JOIN (
   select astat::integer, trim(druck) as druck from sos_k_status

) AS Rückmeldestatus
ON (
   Studierende.kz_rueck_beur_ein = Rückmeldestatus.astat
)
INNER JOIN (
   select tid, trim(name) as name from sos_stichtag

) AS Stichtag
ON (
   Studierende.stichtag = Stichtag.tid
)
INNER JOIN (
   select abschluss, kz_fach, stg, pversion, regel, trim(text) as text, fb, 
lehr, anteil, tid,null as faktor from lehr_stg_ab
where lehr_stg_ab.tid not in (select lehr_stg_ab_tid from lehr_stg_ab2fb)

union
select abschluss, kz_fach, stg, pversion, regel, trim(text) as text, 
lehr_stg_ab2fb.fb, lehr, anteil, tid,faktor from lehr_stg_ab
inner join lehr_stg_ab2fb
on lehr_stg_ab2fb.lehr_stg_ab_tid = lehr_stg_ab.tid

) AS Studiengang
ON (
   Studierende.tid_stg = Studiengang.tid
)
INNER JOIN (
   select astat, astfr, astgrp, fb, trim(ltxt) as ltxt, stg from k_stg

) AS Studienfach
ON (
   Studiengang.stg = Studienfach.stg
)
AND (
   Studienfach.ltxt IS NOT NULL
)
INNER JOIN (
   select instnr, ch110_institut, btrim(druck) as druck, btrim(parent) as 
parent from unikn_k_fb

) AS Sektion/Fachbereich
ON (
   Studiengang.fb = Sektion/Fachbereich.instnr
)
INNER JOIN (
   select apnr, trim(druck) as druck from cifx where key=613

) AS Hörerstatus
ON (
   Studierende.hrst = Hörerstatus.apnr
)
WHERE
(
   Sektion/Fachbereich.druck = 'FB Biologie'
)
AND
(
   (
      Hörerstatus.druck = 'Haupthörer/in'
      AND Stichtag.name = 'Amtl. Statistik Land'
      AND Rückmeldestatus.druck IN ('Beurlaubung', 'Ersteinschreibung', 
'Neueinschreibung', 'Rückmeldung')
      AND Studierende.sem_rueck_beur_ein = 20132
   )
)
GROUP BY
   Sektion/Fachbereich.parent,
   Studienfach.ltxt


According to my analysis, the where clause after the Union All is taking a lot 
of time for execution.

Any help with an alternative way to represent the query or what the cause of 
issue would be very helpful.


Thanks in 

[PERFORM] Optimizing Time Series Access

2014-04-08 Thread Robert Burgholzer
I am looking for advice on dealing with large tables of environmental model
data and looking for alternatives to my current optimization approaches.
 Basically, I have about 1 Billion records stored in a table which I access
in groups of roughly 23 Million at a time.   Which means that I have
somewhere in the neighborhood of 400-500 sets of 23Mil points.

The 23Mil that I pull at a time are keyed on 3 different columns, it's all
indexed, and retrieval happens in say, 2-3 minutes (my hardware is so-so).
 So, my thought is to use some kind of caching and wonder if I can get
advice - here are my thoughts on options, would love to hear others:

* use cached tables for this - since my # of actual data groups is small,
why not just retrieve them once, then keep them around in a specially named
table (I do this with some other stuff, using a 30 day cache expiration)
* Use some sort of stored procedure?  I don't even know if such a thing
really exists in PG and how it works.
* Use table partitioning?

Thanks,
/r/b

-- 
--
Robert W. Burgholzer
 'Making the simple complicated is commonplace; making the complicated
simple, awesomely simple, that's creativity.'  - Charles Mingus
Athletics: http://athleticalgorithm.wordpress.com/
Science: http://robertwb.wordpress.com/
Wine: http://reesvineyard.wordpress.com/


Re: [PERFORM] Performance regressions in PG 9.3 vs PG 9.0

2014-04-08 Thread Tom Lane
uher dslij cod...@gmail.com writes:
 The EXPLAINs all pretty much look like my original post.  The planner in
 9.2 and above is simply not using bitmap heap scans or bitmap index scans?
  What could be the reason for this?

I don't see any reason to think this is a planner regression.  The
rowcount estimates are pretty far off in both versions; so it's just a
matter of luck that 9.0 is choosing a better join order than 9.3.

I'd try cranking up the statistics targets for the join columns
(particularly domain_id) and see if that leads to better estimates.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query against large table not using sensible index to find very small amount of data

2014-04-08 Thread Jeff Janes
On Tue, Apr 8, 2014 at 6:39 AM, Shaun Thomas stho...@optionshouse.comwrote:


  Other possibly relevant pieces of information...  The entity type
  column has a cardinality in the neighborhood of a couple dozen.
  Meanwhile, for some of the entity types there is a large and ongoing
  number of events, and for other entity types there is a smaller and
  more sporadic number of events.  Every now and again a new entity
  type shows up.

 With that as the case, I have two questions for you:

 1. Why do you have a low cardinality column as the first column in an
 index?


Because if he didn't have it, the planner would never be able to use it.
 Remember, the problem is when the planner chooses NOT to use that index.

Cheers,

Jeff


Re: [PERFORM] Performance regressions in PG 9.3 vs PG 9.0

2014-04-08 Thread uher dslij
Thanks for your reply Tom.  I've found that the culprit is the function
parentContainers(), which recurses in a folder structure and looks like
this:

create function parentContainers(numeric) returns setof numeric
as '
select parentContainers( (select container_id from container where id =
$1 ) )
union
select id from container where id = $1
' language sql stable returns null on null input;


Is is declared stable, but I know that is just planner hint, so it doesn't
guarantee that it will only get called once.  If I replace the function
call with the two values this function returns,


On Tue, Apr 8, 2014 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 uher dslij cod...@gmail.com writes:
  The EXPLAINs all pretty much look like my original post.  The planner in
  9.2 and above is simply not using bitmap heap scans or bitmap index
 scans?
   What could be the reason for this?

 I don't see any reason to think this is a planner regression.  The
 rowcount estimates are pretty far off in both versions; so it's just a
 matter of luck that 9.0 is choosing a better join order than 9.3.

 I'd try cranking up the statistics targets for the join columns
 (particularly domain_id) and see if that leads to better estimates.

 regards, tom lane



Re: [PERFORM] Performance regressions in PG 9.3 vs PG 9.0

2014-04-08 Thread uher dslij
Sorry for the premature send on that last email.  Here is the full one:

Thanks for your reply Tom.  I've found that the culprit is the function
parentContainers(), which recurses up a folder structure and looks like
this:

create function parentContainers(numeric) returns setof numeric
as '
select parentContainers( (select container_id from container where id =
$1 ) )
union
select id from container where id = $1
' language sql stable returns null on null input;

It is declared stable, but I know that STABLE is just planner hint, so it
doesn't guarantee that it will only get called once.  If I replace the
function call with the two values this function returns, I get  1 ms
runtime on all versions of pg.  So there is data to support the statement
that we were relying on planner luck before and that luck has run out.

What is the best practice to ensure a stable function only gets called
once?  Should I use a CTE to cache the result?  Is there a better way?

Thanks in advance,


On Tue, Apr 8, 2014 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 uher dslij cod...@gmail.com writes:
  The EXPLAINs all pretty much look like my original post.  The planner in
  9.2 and above is simply not using bitmap heap scans or bitmap index
 scans?
   What could be the reason for this?

 I don't see any reason to think this is a planner regression.  The
 rowcount estimates are pretty far off in both versions; so it's just a
 matter of luck that 9.0 is choosing a better join order than 9.3.

 I'd try cranking up the statistics targets for the join columns
 (particularly domain_id) and see if that leads to better estimates.

 regards, tom lane



Re: [PERFORM] query against large table not using sensible index to find very small amount of data

2014-04-08 Thread 'Andrew W. Gibbs'
Your understanding of the utility of multi-part indices does not jive
with my own.

While I agree that a partial index might be in order here, that ought
just be a performance optimization that lowers the footprint of the
index from an index size and index maintenance standpoint, not
something that governs when the index is used for an item whose entity
type rarely comes up in the table.  If a couple of the entity types
were to constitute 80% of the events, then using a partial index would
reduce the performance strain of maintaining the index by 80%, but
this ought not govern the query planner's behavior when doing queries
on entity types that were not among those.

My general understanding of the utility of multi-part indices is that
they will come into play when some number of the leading columns
appear in the query as fixed values and furthermore if a subsequent
column appears as part of a ranging operation.

I know that a b-tree structure isn't exactly the same as a
binary-tree, but it is roughly equivalent for the purposes of our
conversation...  I believe you can think of multi-part indices as
(roughly) equivalent either to nested binary trees, or as equivalent
to a binary tree whose keys are the concatenation of the various
columns.  In the former case, doing a range scan would be a matter of
hopping through the nested trees until you got to the terminal range
scan operation, and in the latter case doing a range scan would be a
matter of finding the first node in the tree that fell within the
values for your concatenation and then walking through the tree.  Yes,
that's not exactly what happens with a b-tree, but it's pretty
similar, the main differences being performance operations, I believe.

Given that, I don't understand how having a multi-part index with the
column over which I intend to range comes _earlier_ than the column(s)
that I intend to have be fixed would be helpful.  This is especially
true given that the timestamp columns are are the granularity of
_milliseconds_ and my data set sees a constant stream of inputs with
bursts up to ~100 events per second.  I think what you are describing
could only make sense if the date column were at a large granularity,
e.g hours or days.

Or maybe I have missed something...

  -- AWG

On Tue, Apr 08, 2014 at 01:39:41PM +, Shaun Thomas wrote:
 
  Other possibly relevant pieces of information...  The entity type
  column has a cardinality in the neighborhood of a couple dozen.
  Meanwhile, for some of the entity types there is a large and ongoing
  number of events, and for other entity types there is a smaller and
  more sporadic number of events.  Every now and again a new entity
  type shows up.
 
 With that as the case, I have two questions for you:
 
 1. Why do you have a low cardinality column as the first column in an index?
 2. Do you have any queries at all that only use the entity type as the only 
 where clause?
 
 I agree that the planner is probably wrong here, but these choices aren't 
 helping. The low cardinality of the first column causes very large buckets 
 that don't limit results very well at all. Combined with the order-by clause, 
 the planner really wants to walk the date index backwards to find results 
 instead. I would do a couple of things.
 
 First, remove the type/date index. Next, do a count of each type in the table 
 with something like this:
 
 SELECT type_id, count(1)
   FROM my_table
  GROUP BY 2
 
 Any type that is more than 20% of the table will probably never be useful in 
 an index. At this point, you have a choice. You can create a new index with 
 date and type *in that order* or create a new partial index on date and type 
 that also ignores the top matches. For instance, if you had a type that was 
 90% of the values, this would be my suggestion:
 
 CREATE INDEX idx_foo_table_date_event_type_part ON foo_table (event_date, 
 event_type)
  WHERE event_type != 14;
 
 Or whatever. If the IDs are basically evenly distributed, it won't really 
 matter.
 
 In any case, index order matters. The planner wants to restrict data as 
 quickly as possible. If you provide an order clause, it wants to read the 
 index in that order. Your specified type as the first column disrupts that, 
 so it has to fetch the values first, which is usually more expensive. Even if 
 that's wrong in your particular case, planner stats are not precise enough to 
 know that.
 
 Either way, try moving the indexes around. I can't think of many indexes in 
 our database where I have the low cardinality value as the first column. 
 Databases have an easier time managing many shallow buckets of values, than a 
 few deep ones.
 
 --
 Shaun Thomas
 OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
 312-676-8870
 stho...@optionshouse.com
 
 __
 
 See http://www.peak6.com/email_disclaimer/ for terms and conditions related 
 to this email
 
 
 -- 
 Sent via pgsql-performance mailing 

Re: [PERFORM] Performance regressions in PG 9.3 vs PG 9.0

2014-04-08 Thread Tom Lane
uher dslij cod...@gmail.com writes:
 Thanks for your reply Tom.  I've found that the culprit is the function
 parentContainers(), which recurses up a folder structure and looks like
 this:

Hmm ... I had noticed the execution of that in a subplan, but it appeared
that the subplan was being done the same number of times and took about
the same amount of time in both 9.0 and 9.3, so I'd discounted it as the
source of trouble.  Still, it's hard to argue with experimental evidence.

 create function parentContainers(numeric) returns setof numeric
 as '
 select parentContainers( (select container_id from container where id =
 $1 ) )
 union
 select id from container where id = $1
 ' language sql stable returns null on null input;

Yeah, that looks like performance trouble waiting to happen --- it's not
clear what would bound the recursion, for one thing.  Have you considered
replacing this with a RECURSIVE UNION construct?  Wasn't there in 9.0
of course, but 9.3 can do that.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance