Re: [PERFORM] Tons of free RAM. Can't make it go away.

2012-10-27 Thread Віталій Тимчишин
Sorry for late response, but may be you are still strugling.

It can be that some query(s) use a lot of work mem, either because of high
work_mem setting or because of planner error. In this case the moment query
runs it will need memory that will later be returned and become free.
Usually this can be seen as active memory spike with a lot of free memory
after.

2012/10/22 Shaun Thomas stho...@optionshouse.com

 Hey everyone!

 This is pretty embarrassing, but I've never seen this before. This is our
 system's current memory allocation from 'free -m':

  total   used   free buffers cached
 Mem: 72485  58473  14012   3  34020
 -/+ buffers/cache:  24449  48036

 So, I've got 14GB of RAM that the OS is just refusing to use for disk or
 page cache. Does anyone know what might cause that?

 Our uname -sir, for reference:

 Linux 3.2.0-31-generic x86_64

 --

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Paged Query

2012-07-11 Thread Віталій Тимчишин
Понеділок, 9 липня 2012 р. користувач Misa Simic misa.si...@gmail.com
написав:


 2012/7/9 Gregg Jaskiewicz gryz...@gmail.com

 Use cursors.
 By far the most flexible. offset/limit have their down sides.


 Well, I am not aware what down sides there are in LIMIT OFFSET what does
not exist in any other solutions for paged queries...

where key  last-previous-key order by key

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Paged Query

2012-07-11 Thread Віталій Тимчишин
Понеділок, 9 липня 2012 р. користувач Misa Simic misa.si...@gmail.com
написав:


 2012/7/9 Gregg Jaskiewicz gryz...@gmail.com

 Use cursors.
 By far the most flexible. offset/limit have their down sides.


 Well, I am not aware what down sides there are in LIMIT OFFSET what does
not exist in any other solutions for paged queries...

'where key  last-value order by key limit N' is much better in performance
for large offsets.
p.s. Sorry for previous email- hit send too early.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Paged Query

2012-07-06 Thread Віталій Тимчишин
What language are you using? Usually there is iterator with chunked fetch
option (like setFetchSize in java jdbc). So you are passing query without
limit and then read as many results as you need. Note that query plan in
this case won't be optimized for your limit and I don't remember if
postgres has optimize for N rows statement option.
Also, if your statement is ordered by some key, you can use general paging
technique when you rerun query with keymax_prev_value filter to get next
chunk.

Середа, 4 липня 2012 р. користувач Hermann Matthes hermann.matt...@web.de
написав:
 I want to implement a paged Query feature, where the user can enter in
a dialog, how much rows he want to see. After displaying the first page of
rows, he can can push a button to display the next/previous page.
 On database level I could user limit to implement this feature. My
problem now is, that the user is not permitted to view all rows. For every
row a permission check is performed and if permission is granted, the row
is added to the list of rows sent to the client.
 If for example the user has entered a page size of 50 and I use limit
50 to only fetch 50 records, what should I do if he is only permitted to
see 20 of these 50 records? There may be more records he can view.
 But if I don't use limit, what happens if the query would return
5,000,000 rows? Would my result set contain 5,000,000 rows or would the
performance of the database go down?

 Thanks in advance
 Hermann

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Best regards,
 Vitalii Tymchyshyn


[PERFORM] correlated exists with join is slow.

2012-06-18 Thread Віталій Тимчишин
Hello.

Today I've found a query that I thought will be fast turned out to be slow.
The problem is correlated exists with join - it does not want to make
correlated nested loop to make exists check.
Even if I force it to use nested loop, it materialized join uncorrelated
and then filters it. It's OK when exists does not have join. Also good old
left join where X=null works fast.
Note that I could see same problem for both exists and not exists.
Below is test case (tested on 9.1.4) with explains.


create temporary table o(o_val,c_val) as select v, v/2 from
generate_series(1,100) v;
create temporary table i(o_ref, l_ref) as select
generate_series(1,100), generate_series(1,10);
create temporary table l(l_val, l_name) as select v, 'n_' || v from
generate_series(1,10) v;
create index o_1 on o(o_val);
create index o_2 on o(c_val);
create index i_1 on i(o_ref);
create index i_2 on i(l_ref);
create index l_1 on l(l_val);
create index l_2 on l(l_name);
analyze o;
analyze i;
analyze l;
explain analyze select 1 from o where not exists (select 1 from i join l on
l_ref = l_val where l_name='n_2' and o_ref=o_val) and c_val=33;
-- http://explain.depesz.com/s/Rvw
explain analyze select 1 from o where not exists (select 1 from i join l on
l_ref = l_val where l_val=2 and o_ref=o_val) and c_val=33;
-- http://explain.depesz.com/s/fVHw
explain analyze select 1 from o where not exists (select 1 from i where
l_ref=2 and o_ref=o_val) and c_val=33;
-- http://explain.depesz.com/s/HgN
explain analyze select 1 from o left join i on o_ref=o_val left join l on
l_ref = l_val and l_name='n_2' where o_ref is null and c_val=33;
-- http://explain.depesz.com/s/mLA
set enable_hashjoin=false;
explain analyze select 1 from o where not exists (select 1 from i join l on
l_ref = l_val where l_name='n_2' and o_ref=o_val) and c_val=33;
-- http://explain.depesz.com/s/LYu
rollback;

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] correlated exists with join is slow.

2012-06-18 Thread Віталій Тимчишин
Glad to hear postgresql becomes better and better :)

2012/6/18 Tom Lane t...@sss.pgh.pa.us

 =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= tiv...@gmail.com writes:
  Today I've found a query that I thought will be fast turned out to be
 slow.
  The problem is correlated exists with join - it does not want to make
  correlated nested loop to make exists check.

 9.2 will make this all better.  These are exactly the type of case where
 you need the parameterized path stuff.

regards, tom lane


-- 
Best regards,
 Vitalii Tymchyshyn


[PERFORM] SSD selection

2012-05-15 Thread Віталій Тимчишин
Hello, all.

We've reached to the point when we would like to try SSDs. We've got a
central DB currently 414 GB in size and increasing. Working set does not
fit into our 96GB RAM server anymore.
So, the main question is what to take. Here what we've got:
1) Intel 320. Good, but slower then current generation sandforce drives
2) Intel 330. Looks like cheap 520 without capacitor
3) Intel 520. faster then 320 No capacitor.
4) OCZ Vertex 3 Pro - No available. Even on OCZ site
5) OCZ Deneva - can't find in my country :)
We are using Areca controller with BBU. So as for me, question is: Can 520
series be set up to handle fsyncs correctly? We've got the Areca to handle
buffering.
-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Maximum number of sequences that can be created

2012-05-13 Thread Віталій Тимчишин
2012/5/11 Robert Klemme shortcut...@googlemail.com

 On Fri, May 11, 2012 at 12:50 PM, Vidhya Bondre meetvbon...@gmail.com
 wrote:
  Is there any max limit set on sequences that can be created on the
 database
  ? Also would like to know if we create millions of sequences in a single
 db
  what is the downside of it.


The sequences AFAIK are accounted as relations. Large list of relations may
slowdown different system utilities like vacuuming (or may not, depends on
queries and indexes on pg_class).



 On the contrary: what would be the /advantage/ of being able to create
 millions of sequences?  What's the use case?


We are using sequences as statistics counters - they produce almost no
performance impact and we can tolerate it's non-transactional nature. I can
imaging someone who wants to have a  sequence per user or other relation
row.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] bad planning with 75% effective_cache_size

2012-04-18 Thread Віталій Тимчишин
How about

with par_ as (select * from product_parent where parent_name like 'aa%' )
select distinct product_code from product p_
inner join par_ on p_.parent_id=par_.id
limit 2

?


2012/4/3 Istvan Endredy istvan.endr...@gmail.com

 Hi,

 i've ran into a planning problem.


 select distinct product_code from product p_
 inner join product_parent par_ on p_.parent_id=par_.id
 where par_.parent_name like 'aa%' limit 2


 If effective_cache_size is smaller (32MB), planning is ok and query is
 fast. (10ms)
 In the worst case (effective_cache_size=6GB) the speed depends on the
 value of 'limit' (in select): if it is smaller, query is slower. (12ms)



-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] SeqScan with full text search

2012-04-17 Thread Віталій Тимчишин
2012/4/16 Tomek Walkuski tomek.walku...@gmail.com

 Hello group!

 I have query like this:

 SELECT
  employments.candidate_id AS candidate_id,
  SUM(TS_RANK(employers.search_vector, TO_TSQUERY('simple', 'One:* |
 Two:* | Three:* | Four:*'), 2)) AS ts_rank
 FROM
  employments
 INNER JOIN
  employers ON employments.employer_id = employers.id
 AND
  employers.search_vector @@ TO_TSQUERY('simple', 'One:* | Two:* |
 Three:* | Four:*')
 GROUP BY
  candidate_id;

 And it results with this:

 http://explain.depesz.com/s/jLM

 The JOIN between employments and employers is the culprit. I'm unable
 to get rid of the seq scan, and setting enable_seqscan to off makes
 things even worse.

 Is there any way to get rid of this JOIN?


Have you got an index on employments.employer_id? It seems for me that only
some employments get out of join, so index would help here. What's the plan
with seq_scan off?

P.S. I don't see why all employments are needed. May be I am reading
something wrong? For me it's max 2616 employments out of 1606432.

Best regards, Vitalii Tymchyshyn

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] database slowdown while a lot of inserts occur

2012-04-02 Thread Віталій Тимчишин
Few words regarding small inserts and a lot of fsyncs:
If it is your problem, you can fix this by using battery-backed raid card.
Similar effect can be  reached by turning synchronious commit off. Note
that the latter may make few last commits lost in case of sudden reboot.
But you can at least test if moving to BBU will help you. (Dunno if this
setting can be changed with SIGHUP without restart).
Note that this may still be a lot of random writes. And in case of RAID5 -
a lot of random reads too. I don't think batching will help other
applications. This is the tool to help application that uses batching. If
you have random writes, look at HOT updates - they may help you if you will
follow requirements.
Check your checkpoints - application writes to commit log first (sequential
write), then during checkpoints data is written to tables (random writes) -
longer checkpoints may make you life easier. Try to increase
checkpoint_segments.
If you have alot of data written - try to move you commit logs to another
drive/partition.
If you have good raid card with memory and BBU, you may try to disable read
cache on it (leaving only write cache). Read cache is usually good at OS
level (with much more memory) and fast writes need BBU-protected write
cache.

Best regards, Vitalii Tymchyshyn

2012/3/29 Campbell, Lance la...@illinois.edu

  PostgreSQL 9.0.x

 We have around ten different applications that use the same database.
 When one particular application is active it does an enormous number of
 inserts.  Each insert is very small.  During this time the database seems
 to slow down in general.  The application in question is inserting into a
 particular table that is not used by the other applications.

 ** **

 **1)  **What should I do to confirm that the database is the issue
 and not the applications?

 **2)  **How can I identify where the bottle neck is occurring if the
 issue happens to be with the database?

 ** **

 I have been using PostgreSQL for eight years.  It is an amazing database.*
 ***

 ** **

 Thanks,

 ** **

 Lance Campbell

 Software Architect

 Web Services at Public Affairs

 217-333-0382

 ** **




-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Subquery flattening causing sequential scan

2011-12-28 Thread Віталій Тимчишин
'
27.12.2011 20:13 пользователь Tom Lane t...@sss.pgh.pa.us написал:

 Jim Crate jim...@gmail.com writes:
  My question is why does it do a seq scan when it flattens this
  subquery into a JOIN?

 Because it thinks there will be 3783 rows out of the msg scan, which if
 true would make your desired nestloop join a serious loser.

But second plan is evaluated cheapier by analyze. I thought this should
make it being used unless it is not evaluated. Can it be collapse limit
problem or like?


Re: [PERFORM] will the planner ever use an index when the condition is ?

2011-12-18 Thread Віталій Тимчишин
17.12.2011 18:25 пользователь Filip Rembiałkowski plk.zu...@gmail.com
написал:

 Normally there is no chance it could work,
 because (a) the planner does not know all possible values of a column,
 and (b) btree indexes cannot search on not equal operator.


Why so? ab is same as (ab or ab), so, planner should chech this option.


Re: [PERFORM] Performance die when COPYing to table with bigint PK

2011-08-05 Thread Віталій Тимчишин
In my tests it greatly depends on if index writes are random or sequential.
My test time goes down from few hours to seconds if I add to the end of
index.
As for me, best comparision would be to make two equal int4 columns with
same data as in int8, two indexes, then perform the test. My bet it will be
slower than int8.

Четвер, 4 серпня 2011 р. користувач Robert Ayrapetyan 
robert.ayrapet...@comodo.com написав:
 All you are saying disproves following:

 in experiment I replaces bigint index:

 CREATE INDEX ix_t_big ON test.t USING btree (id_big) TABLESPACE tblsp_ix;

 with 4 (!) other indexes:

 If you look at the rest of my mail - you would notice 50 times
 difference in performance.
 What you would say?

 That accessing a page from RAM is more than 50 times as fast as a
 random access of that page from disk.

 -Kevin




 --
 Ayrapetyan Robert,
 Comodo Anti-Malware Data Processing Analysis and Management System
(CAMDPAMS)
 http://repo-qa.camdpams.odessa.office.comodo.net/mediawiki/index.php


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] big distinct clause vs. group by

2011-04-25 Thread Віталій Тимчишин
2011/4/23 Robert Haas robertmh...@gmail.com

 On Apr 18, 2011, at 1:13 PM, Uwe Bartels uwe.bart...@gmail.com wrote:
  Hi Robert,
 
  thanks for your answer.
  the aggregate function I was talking about is the function I need to use
 for the non-group by columns like min() in my example.
  There are of course several function to choose from, and I wanted to know
 which causes as less as possible resources.

 Oh, I see. min() is probably as good as anything. You could also create a
 custom aggregate that just always returns its first input. I've occasionally
 wished we had such a thing as a built-in.


I've once done single grouping function - it checks that all it's input
values are equal (non-null ones) and returns the value or raises an error if
there are two different values.

Best regards, Vitalii Tymchyshyn



-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Shouldn't we have a way to avoid risky plans?

2011-03-24 Thread Віталій Тимчишин
2011/3/23 Tom Lane t...@sss.pgh.pa.us

 Claudio Freire klaussfre...@gmail.com writes:
  On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus j...@agliodbs.com wrote:
  On 3/23/11 10:35 AM, Claudio Freire wrote:
   *  consider plan bailout: execute a tempting plan, if it takes too
  long or its effective cost raises well above the expected cost, bail
  to a safer plan

  That would actually solve this particular case.  It would still require
  us to have some definition of safer though.

  In my head, safer = better worst-case performance.

 If the planner starts operating on the basis of worst case rather than
 expected-case performance, the complaints will be far more numerous than
 they are today.

 This can se GUC-controllable. Like plan_safety=0..1 with low default value.
This can influence costs of plans where cost changes dramatically with small
table changes and/or statistics is uncertain. Also this can be used as
direct hint for such dangerous queries by changing GUC for session/single
query.


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Talking about optimizer, my long dream

2011-02-27 Thread Віталій Тимчишин
27 лютого 2011 р. 19:59 Robert Haas robertmh...@gmail.com написав:

 2011/2/4 Віталій Тимчишин tiv...@gmail.com:
  Hi, all.
  All this optimizer vs hint thread reminded me about crazy idea that got
 to
  my head some time ago.
  I currently has two problems with postgresql optimizer
  1) Dictionary tables. Very usual thing is something like select * from
  big_table where distionary_id = (select id from dictionary where
  name=value). This works awful if dictionary_id distribution is not
 uniform.

 Does it work better if you write it as a join?


 SELECT b.* FROM big_table b, dictionary d WHERE b.dictionary_id = d.id
 AND d.name = 'value'

 I would like to see a concrete example of this not working well,
 because I've been writing queries like this (with MANY tables) for
 years and it's usually worked very well for me.

 Here you are:
 PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit
create table a(dict int4, val int4);
create table b(dict int4, name text);
create index c on a(dict);
insert into b values (1, 'small'), (2, 'large');
insert into a values (1,1);
insert into a select 2,generate_series(1,1);
analyze a;
analyze b;
test=# explain analyze select * from a where dict=1;
 QUERY PLAN

-
 Index Scan using c on a  (cost=0.00..8.27 rows=1 width=8) (actual
time=0.014..0.016 rows=1 loops=1)
   Index Cond: (dict = 1)
 Total runtime: 0.041 ms
(3 rows)
test=# explain analyze select * from a where dict=2;
 QUERY PLAN

-
 Seq Scan on a  (cost=0.00..170.01 rows=1 width=8) (actual
time=0.014..6.876 rows=1 loops=1)
   Filter: (dict = 2)
 Total runtime: 13.419 ms
(3 rows)
test=# explain analyze select * from a,b where a.dict=b.dict and b.name
='small';
QUERY PLAN

---
 Hash Join  (cost=1.04..233.55 rows=5000 width=18) (actual
time=0.047..13.159 rows=1 loops=1)
   Hash Cond: (a.dict = b.dict)
   -  Seq Scan on a  (cost=0.00..145.01 rows=10001 width=8) (actual
time=0.009..6.633 rows=10001 loops=1)
   -  Hash  (cost=1.02..1.02 rows=1 width=10) (actual time=0.011..0.011
rows=1 loops=1)
 -  Seq Scan on b  (cost=0.00..1.02 rows=1 width=10) (actual
time=0.006..0.008 rows=1 loops=1)
   Filter: (name = 'small'::text)
 Total runtime: 13.197 ms
(7 rows)
test=# explain analyze select * from a,b where a.dict=b.dict and b.name
='large';
QUERY PLAN

---
 Hash Join  (cost=1.04..233.55 rows=5000 width=18) (actual
time=0.074..21.476 rows=1 loops=1)
   Hash Cond: (a.dict = b.dict)
   -  Seq Scan on a  (cost=0.00..145.01 rows=10001 width=8) (actual
time=0.012..7.085 rows=10001 loops=1)
   -  Hash  (cost=1.02..1.02 rows=1 width=10) (actual time=0.021..0.021
rows=1 loops=1)
 -  Seq Scan on b  (cost=0.00..1.02 rows=1 width=10) (actual
time=0.015..0.016 rows=1 loops=1)
   Filter: (name = 'large'::text)
 Total runtime: 28.293 ms
(7 rows)

It simply don't know that small=1 and large=2, so it never uses nested loop
+ iindex scan:
test=# set enable_hashjoin=false;
SET
test=# explain analyze select * from a,b where a.dict=b.dict and b.name
='small';
   QUERY PLAN


 Nested Loop  (cost=0.00..253.28 rows=5000 width=18) (actual
time=0.041..0.047 rows=1 loops=1)
   -  Seq Scan on b  (cost=0.00..1.02 rows=1 width=10) (actual
time=0.010..0.012 rows=1 loops=1)
 Filter: (name = 'small'::text)
   -  Index Scan using c on a  (cost=0.00..189.75 rows=5000 width=8)
(actual time=0.021..0.023 rows=1 loops=1)
 Index Cond: (a.dict = b.dict)
 Total runtime: 0.089 ms
(6 rows)

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Why we don't want hints Was: Slow count(*) again...

2011-02-11 Thread Віталій Тимчишин
2011/2/10 Tobias Brox tobi...@gmail.com

 On 4 February 2011 04:46, Josh Berkus j...@agliodbs.com wrote:
  Optimizer hints are used to work around problems in the optimizer and
  introduce upgrade and maintenance issues. We would rather have the
  problems reported and fixed. We have discussed a more sophisticated
  system of per-class cost adjustment instead, but a specification remains
  to be developed.

 I have no clue about how hints works in Oracle ... I've never been
 working enterprise level on anything else than Postgres.  Anyway,
 today I just came over an interesting problem in our production
 database today - and I think it would be a benefit to be able to
 explicitly tell the planner what index to use (the dev team is adding
 redundant attributes and more indexes to solve the problem - which
 worries me, because we will run into serious problems as soon as there
 won't be enough memory for all the frequently-used indexes).

 We have users and transactions, and we have transaction types.  The
 transaction table is huge.  The users are able to interactively check
 their transaction listings online, and they have some simple filter
 options available as well.  Slightly simplified, the queries done
 looks like this:

   select * from account_transaction where account_id=? order by
 created desc limit 25;

   select * from account_transaction where trans_type_id in ( ...
 long, hard-coded list ...) and account_id=? order by created desc
 limit 25;

 and we have indexes on:

 account_transaction(account_id, created)

   account_transaction(account_id, trans_type_id, created)

 If the list is hard-coded, you can create partial index  on
account_transaction(account_id, created desc) where trans_type_id in ( ...
long, hard-coded list ...)


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] How to best use 32 15k.7 300GB drives?

2011-02-05 Thread Віталій Тимчишин
2011/2/4 Robert Haas robertmh...@gmail.com

 On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn tiv...@gmail.com
 wrote:
  Why do you expect such a invasive code changes? I know little about
  postgresql code layering, but what I propose (with changing delete to
  truncate) is:
  1) Leave tuple addressing as it is now

 i.e. a block number and a slot position within the block?

 Seems like you'd need file,block,slot.


No, that's what I mean. Leave as it is. You will have file logical length
(fixed for all but the last one, 1GB currently) and file actual legth that
can be less (if file trucated). In the latter case you still have this
empty blocks that don't exists at all. Actually the simplest
implementation could be to tell to file system drop this part of file and
pretend it's all zeros, but I don't think many FSs (OSes?) supports this.
So, each  file still have it's fixed N blocks. And filenumber is still
blocknumber / N.


-- 
Best regards,
 Vitalii Tymchyshyn


[PERFORM] Talking about optimizer, my long dream

2011-02-04 Thread Віталій Тимчишин
Hi, all.

All this optimizer vs hint thread reminded me about crazy idea that got to
my head some time ago.
I currently has two problems with postgresql optimizer
1) Dictionary tables. Very usual thing is something like select * from
big_table where distionary_id = (select id from dictionary where
name=value). This works awful if dictionary_id distribution is not uniform.
The thing that helps is to retrieve subselect value and then simply do
select * from big_table where dictionary_id=id_value.
2) Complex queries. If there are over 3 levels of subselects, optmizer
counts often become less and less correct as we go up on levels. On ~3rd
level this often lead to wrong choises. The thing that helps is to create
temporary tables from subselects, analyze them and then do main select using
this temporary tables.
While first one can be fixed by introducing some correlation statistics, I
don't think there is any simple way to fix second one.

But what if optimizer could in some cases tell fetch this and this and then
I'll plan other part of the query based on statistics of what you've
fetched?

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread Віталій Тимчишин
2011/2/3 da...@lang.hm


 If the table is not large enough to fit in ram, then it will compete for
 I/O, and the user will have to wait.

 what I'm proposing is that as the records are created, the process doing
 the creation makes copies of the records (either all of them, or some of
 them if not all are needed for the analysis, possibly via shareing memory
 with the analysis process), this would be synchronous with the load, not
 asynchronous.

 this would take zero I/O bandwidth, it would take up some ram, memory
 bandwidth, and cpu time, but a load of a large table like this is I/O
 contrained.

 it would not make sense for this to be the default, but as an option it
 should save a significant amount of time.

 I am making the assumption that an Analyze run only has to go over the data
 once (a seqential scan of the table if it's  ram for example) and gathers
 stats as it goes.

 with the current code, this is a completely separate process that knows
 nothing about the load, so if you kick it off when you start the load, it
 makes a pass over the table (competing for I/O), finishes, you continue to
 update the table, so it makes another pass, etc. As you say, this is a bad
 thing to do. I am saying to have an option that ties the two togeather,
 essentially making the data feed into the Analyze run be a fork of the data
 comeing out of the insert run going to disk. So the Analyze run doesn't do
 any I/O and isn't going to complete until the insert is complete. At which
 time it will have seen one copy of the entire table.

 Actually that are two different problems. The one is to make analyze more
automatic to make select right after insert more clever by providing
statistics to it.
Another is to make it take less IO resources.
I dont like for it to be embedded into insert (unless the threshold can be
determined before inserts starts). Simply because it is more CPU/memory that
will slow down each insert. And if you will add knob, that is disabled by
default, this will be no more good than manual analyze.


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread Віталій Тимчишин
2011/2/4 Mladen Gogala mladen.gog...@vmsinfo.com

 Josh Berkus wrote:

 However, since this system wasn't directly compatible with Oracle Hints,
 folks pushing for hints dropped the solution as unsatisfactory. This is
 the discussion we have every time: the users who want hints specifically
 want hints which work exactly like Oracle's, and aren't interested in a
 system designed for PostgreSQL.  It's gotten very boring; it's like the
 requests to support MySQL-only syntax.


 Actually, I don't want Oracle hints. Oracle hints are ugly and cumbersome.
 I would prefer something like this:


 http://dev.mysql.com/doc/refman/5.0/en/index-hints.html

 As far as I can see, this should be embedded into query, should not it? You
can achive something like this by setting variables right before query
(usually even in same sall by embedding multiple statements into execute
query call).
E.g. set random_page_cost=1;select something that need index; set
random_page_to to default;. Yes this is as ugly as a hack may look and
can't be used on per-table basis in complex statement, but you have it.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread Віталій Тимчишин
4 лютого 2011 р. 09:32 da...@lang.hm написав:



 when a copy command is issued, I assume that there is some indication of
 how much data is going to follow. I know that it's not just 'insert
 everything until the TCP connection terminates' because that would give you
 no way of knowing if the copy got everything in or was interrupted part way
 through. think about what happens with ftp if the connection drops, you get
 a partial file 'successfully' as there is no size provided, but with HTTP
 you get a known-bad transfer that you can abort or resume.

 I don't think so, since you can do 'cat my_large_copy.sql | psql'. AFAIR it
simply looks for end of data marker, either in protocol or in stream itself
(run copy from stdin in psql and it will tell you what marker is).



-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] How to best use 32 15k.7 300GB drives?

2011-01-30 Thread Віталій Тимчишин
2011/1/28 Scott Carey sc...@richrelevance.com



 On 1/28/11 9:28 AM, Stephen Frost sfr...@snowman.net wrote:

 * Scott Marlowe (scott.marl...@gmail.com) wrote:
  There's nothing wrong with whole table updates as part of an import
  process, you just have to know to clean up after you're done, and
  regular vacuum can't fix this issue, only vacuum full or reindex or
  cluster.
 
 Just to share my experiences- I've found that creating a new table and
 inserting into it is actually faster than doing full-table updates, if
 that's an option for you.

 I wonder if postgres could automatically optimize that, if it thought that
 it was going to update more than X% of a table, and HOT was not going to
 help, then just create a new table file for XID's = or higher than the one
 making the change, and leave the old one for old XIDs, then regular VACUUM
 could toss out the old one if no more transactions could see it.


 I was thinking if a table file could be deleted if it has no single live
row. And if this could be done by vacuum. In this case vacuum on table that
was fully updated recently could be almost as good as cluster - any scan
would skip such non-existing files really fast. Also almost no disk space
would be wasted.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] hashed subplan 5000x slower than two sequential operations

2011-01-18 Thread Віталій Тимчишин
2011/1/18 masterchief esi...@theiqgroup.com


  Tom Lane wrote:
 
  The only really effective way the planner knows to optimize an
  IN (sub-SELECT) is to turn it into a semi-join, which is not possible
  here because of the unrelated OR clause.  You might consider replacing
  this with a UNION of two scans of contexts.  (And yes, I know it'd be
  nicer if the planner did that for you.)

 In moving our application from Oracle to Postgres, we've discovered that a
 large number of our reports fall into this category.  If we rewrite them as
 a UNION of two scans, it would be quite a big undertaking.  Is there a way
 to tell the planner explicitly to use a semi-join (I may not grasp the
 concepts here)?  If not, would your advice be to hunker down and rewrite
 the
 queries?


 You can try exists instead of in. Postgresql likes exists better.
Alternatively, you can do something like set enable_seqscan=false. Note
that such set is more like a hammer, so should be avoided. If it is the only
thing that helps, it can be set right before calling query and reset to
default afterwards.
--

Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Performance under contention

2010-12-07 Thread Віталій Тимчишин
2010/12/7 Robert Haas robertmh...@gmail.com

 On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras ivo...@freebsd.org wrote:

  I'm not very familiar with PostgreSQL code but if we're
  brainstorming... if you're only trying to protect against a small
  number of expensive operations (like DROP, etc.) that don't really
  happen often, wouldn't an atomic reference counter be good enough for
  the purpose (e.g. the expensive operations would spin-wait until the
  counter is 0)?

 No, because (1) busy-waiting is only suitable for locks that will only
 be held for a short time, and an AccessShareLock on a table might be
 held while we read 10GB of data in from disk, and (2) that wouldn't
 allow for deadlock detection.


As far as I understand this thread, the talk is about contention - where
large number of processors want to get single partition lock to get
high-level shared lock.
As far as I can see from the source, there is a lot of code executed under
the partition lock protection, like two hash searches (and possibly
allocations).
What can be done, is that number of locks can be increased - one could use
spin locks for hash table manipulations, e.g. a lock preventing rehashing
(number of baskets being changed) and a lock for required basket.
In this case only small range of code can be protected by partition lock.
As for me, this will make locking process more cpu-intensive (more locks
will be acquired/freed during the exection), but will decrease contention
(since all but one lock can be spin locks working on atomic counters, hash
searches can be done in parallel), won't it?
The thing I am not sure in is how much spinlocks on atomic counters cost
today.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Slow query to get last created row using CURRVAL

2010-12-04 Thread Віталій Тимчишин
2010/12/4 Mathieu De Zutter math...@dezutter.org


 For each page load I first create an entry in that table, e.g.:

 INSERT INTO log_event (user_id, ip, action_id, object1_id, object2_id,
 event_timestamp, comments) VALUES (1, '127.0.0.1', 96, null, null,
 NOW(), 'TEST');

 After that, I want to retrieve the data stored in log_event from a
 trigger, e.g.:

 SELECT user_id FROM log_event WHERE id = CURRVAL('log_event_id_seq');

 This way my insert-trigger knows who is creating the new row, while
 using only one pg-user to query the database.

 Please note that you can use next query to perform both insert and select:

INSERT INTO log_event (user_id, ip, action_id, object1_id, object2_id,
event_timestamp, comments) VALUES (1, '127.0.0.1', 96, null, null,
NOW(), 'TEST') returning user_id;
-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] anti-join chosen even when slower than old plan

2010-11-11 Thread Віталій Тимчишин
2010/11/11 Robert Haas robertmh...@gmail.com


 But thinking over what you've written here, I'm reminded of something
 Peter said years ago, also about the optimizer.  He was discussed the
 ratio of the estimated cost to the actual cost and made an off-hand
 remark that efforts had been made over the years to make that ratio
 more consistent (i.e. improve the quality of the cost estimates) but
 that they'd been abandoned because they didn't necessarily produce
 better plans.  Applying that line of thinking to this problem, maybe
 we should give up on trying to make the estimates truly model reality,
 and focus more on assigning them values which work well in practice.
 For example, in your case, it would be sufficient to estimate the
 amount of data that a given query is going to grovel through and then
 applying some heuristic to choose values for random_page_cost and
 seq_page_cost based on the ratio of that value to, I don't know,
 effective_cache_size.


As for me, the simplest solution would be to allow to set costs on
per-relation basis. E.g. I know that this relation is most time in memory
and other one (archive) is on the disk. This could work like charm along
with buffer pools (portions of shared cache) - tables (or indexes) that are
required to be cached can be assigned to bufferpool that has enough size to
hold all the data, archive ones - to small bufferpool. This can guarantie
that after query on the archive data, cached tables are still cached.
This solutions however, does not help on tables where only some portion of
table is activelly used. The solution can be to allow set costs via partial
indexes - e.g. for any table access using this index, use this cost
values. This, BTW, will make table access via given index more preferable.

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Slow count(*) again...

2010-10-10 Thread Віталій Тимчишин
2010/10/10 Neil Whelchel neil.whelc...@gmail.com

 On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
  On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel neil.whelc...@gmail.com
 wrote:
   I know that there haven been many discussions on the slowness of
 count(*)
   even when an index is involved because the visibility of the rows has
 to
   be checked. In the past I have seen many suggestions about using
   triggers and tables to keep track of counts and while this works fine
 in
   a situation where you know what the report is going to be ahead of
 time,
   this is simply not an option when an unknown WHERE clause is to be used
   (dynamically generated). I ran into a fine example of this when I was
   searching this mailing list, Searching in 856,646 pages took 13.48202
   seconds. Site search powered by PostgreSQL 8.3. Obviously at some
 point
   count(*) came into play here because the site made a list of pages (1 2
   3 4 5 6  next). I very commonly make a list of pages from search
   results, and the biggest time killer here is the count(*) portion, even
   worse yet, I sometimes have to hit the database with two SELECT
   statements, one with OFFSET and LIMIT to get the page of results I need
   and another to get the amount of total rows so I can estimate how many
   pages of results are available. The point I am driving at here is that
   since building a list of pages of results is such a common thing to do,
   there need to be some specific high speed ways to do this in one query.
   Maybe an estimate(*) that works like count but gives an answer from the
   index without checking visibility? I am sure that this would be good
   enough to make a page list, it is really no big deal if it errors on
 the
   positive side, maybe the list of pages has an extra page off the end. I
   can live with that. What I can't live with is taking 13 seconds to get
 a
   page of results from 850,000 rows in a table.
 
  99% of the time in the situations you don't need an exact measure, and
  assuming analyze has run recently, select rel_tuples from pg_class for
  a given table is more than close enough.  I'm sure wrapping that in a
  simple estimated_rows() function would be easy enough to do.

 This is a very good approach and it works very well when you are counting
 the
 entire table, but when you have no control over the WHERE clause, it
 doesn't
 help. IE: someone puts in a word to look for in a web form.

 From my perspective, this issue is the biggest problem there is when using
 Postgres to create web pages, and it is so commonly used, I think that
 there
 should be a specific way to deal with it so that you don't have to run the
 same WHERE clause twice.
 IE: SELECT count(*) FROM table WHERE clause; to get the total amount of
 items to make page navigation links, then:
 SELECT columns FROM table WHERE clause LIMIT items_per_page OFFSET
 (page_no-1)*items_per_page; to get the actual page contents.

 How about
select * from (select *, count(*) over () as total_count from table where
clause)  a LIMIT items_per_page OFFSET
(page_no-1)*items_per_page
It will return you total_count column with equal value in each row. You may
have problems if no rows are returned (e.g. page num is too high).
-- 
Best regards,
 Vitalii Tymchyshyn


[PERFORM] Exists, limit and alternate plans

2010-07-12 Thread Віталій Тимчишин
Hello.

Today I've found out strange results for query below.
select version();
 version

--
 PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC)
4.2.1 20070719  [FreeBSD], 64-bit

--Original query:
explain analyze select exists(select * from investor i where i.company_id =
this_.id) from COMPANY this_ order by this_.rank desc, this_.id asc limit
10;
 Limit  (cost=0.00..50.67 rows=10 width=16) (actual time=144.489..144.556
rows=10 loops=1)
   -  Index Scan using comp_rank_id on company this_
 (cost=0.00..34616009.08 rows=6831169 width=16) (actual
time=144.484..144.524 rows=10 loops=1)
 SubPlan 1
   -  Index Scan using company_invs on investor i  (cost=0.00..9.52
rows=2 width=0) (never executed)
 Index Cond: ((company_id)::bigint = $0)
 SubPlan 2
   -  Seq Scan on investor i  (cost=0.00..1836.17 rows=41717
width=8) (actual time=0.006..72.364 rows=41722 loops=1)
 Total runtime: 144.975 ms
(8 rows)

--set enable_seqscan=false;
explain analyze select exists(select * from investor i where i.company_id =
this_.id) from COMPANY this_ order by this_.rank desc, this_.id asc limit
10;
 Limit  (cost=0.00..50.67 rows=10 width=16) (actual time=0.045..0.177
rows=10 loops=1)
   -  Index Scan using comp_rank_id on company this_
 (cost=0.00..34616009.08 rows=6831169 width=16) (actual time=0.041..0.146
rows=10 loops=1)
 SubPlan 1
   -  Index Scan using company_invs on investor i  (cost=0.00..9.52
rows=2 width=0) (actual time=0.007..0.007 rows=1 loops=10)
 Index Cond: ((company_id)::bigint = $0)
 SubPlan 2
   -  Seq Scan on investor i  (cost=100.00..1001836.17
rows=41717 width=8) (never executed)
 Total runtime: 0.253 ms
(8 rows)

--limit inside exists
explain analyze select exists(select * from investor i where i.company_id =
this_.id limit 1) from COMPANY this_ order by this_.rank desc, this_.id asc
limit 10;
 Limit  (cost=0.00..50.67 rows=10 width=16) (actual time=0.052..0.219
rows=10 loops=1)
   -  Index Scan using comp_rank_id on company this_
 (cost=0.00..34616009.08 rows=6831169 width=16) (actual time=0.049..0.189
rows=10 loops=1)
 SubPlan 1
   -  Limit  (cost=0.00..4.76 rows=1 width=422) (actual
time=0.011..0.011 rows=1 loops=10)
 -  Index Scan using company_invs on investor i
 (cost=0.00..9.52 rows=2 width=422) (actual time=0.007..0.007 rows=1
loops=10)
   Index Cond: ((company_id)::bigint = $0)
 Total runtime: 0.291 ms
(7 rows)

So, my Qs:
1) Do we really have alternative plans for SubPlan that are selected at
runtime? Wow.
2) Why Seq scan plan is selected by default? Is it because of outer limit
not being applied when calculating costs for subplans at runtime?
3) Why does limit inside exists helps? Is it simply because new
alternative logic in not applied for complex case?

-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Benchmark with FreeBSD 8.0 and pgbench

2010-05-17 Thread Віталій Тимчишин
Hi.

Not strictly connected to your tests, but:
As of ZFS, we've had experience that it degrades over time after random
updates because of files becoming non-linear and sequential reads becomes
random.
Also there are Q about ZFS block size - setting it to 8K makes first problem
worse, setting it to higher values means that 8K write will need a read to
recreate the whole block in new place.

Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Slow Bulk Delete

2010-05-17 Thread Віталій Тимчишин
2010/5/17 Jon Nelson jnelson+pg...@jamponi.netjnelson%2bpg...@jamponi.net


 On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote:
  - or use a JOIN delete with a virtual VALUES table
  - or fill a temp table with ids and use a JOIN DELETE

 What is a virtual VALUES table? Can you give me an example of using a
 virtual table with selects, joins, and also deletes?



delete from a using (values (1),(2),(5),(8)) b(x) where a.id=b.x

See http://www.postgresql.org/docs/8.4/static/sql-values.html

-- 
Best regards,
Vitalii Tymchyshyn


[PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Віталій Тимчишин
Hello.

I have a query that performs very poor because there is a limit on join
column that is not applied to other columns:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id left outer join company_descr ces3_ on
this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id5000
and this_1_.company_id5000
order by this_.id asc limit 1000;

(plan1.txt)
Total runtime: 7794.692 ms

At the same time if I apply the limit (5000) to other columns in query
itself it works like a charm:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id left outer join company_descr ces3_ on
this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id5000
and this_1_.company_id5000
and companymea2_.company_id5000 and ces3_.company_id5000
order by this_.id asc limit 1000;

(plan2.txt)
Total runtime: 27.547 ms

I've thought and someone in this list've told me that this should be done
automatically. But I have pretty recent server:
PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1
20070719  [FreeBSD], 64-bit
and it still do not work

Do I misunderstand something or this feature don't work in such a query?

Best regards, Vitalii Tymchyshyn
Limit  (cost=497.46..1464.50 rows=1000 width=693) (actual 
time=7767.721..7793.047 rows=1000 loops=1)
  -  Merge Left Join  (cost=497.46..5521612.64 rows=5709243 width=693) (actual 
time=7767.717..7790.274 rows=1000 loops=1)
Merge Cond: (this_.id = (companymea2_.company_id)::bigint)
-  Merge Left Join  (cost=404.31..4544508.54 rows=5709243 width=625) 
(actual time=4211.501..4227.215 rows=1000 loops=1)
  Merge Cond: (this_.id = (ces3_.company_id)::bigint)
  -  Merge Join  (cost=37.87..2435536.00 rows=5709243 width=123) 
(actual time=0.069..8.584 rows=1000 loops=1)
Merge Cond: (this_.id = (this_1_.company_id)::bigint)
-  Index Scan using pk_comp_m on company this_  
(cost=0.00..2139324.53 rows=6886598 width=107) (actual time=0.031..1.876 
rows=1054 loops=1)
  Index Cond: (id  5000)
-  Index Scan using company_tag_company_id_idx on 
company_tag this_1_  (cost=0.00..207678.85 rows=5709243 width=16) (actual 
time=0.027..1.757 rows=1000 loops=1)
  Index Cond: ((this_1_.company_id)::bigint  5000)
  Filter: ((this_1_.tag_id)::bigint = 7)
  -  Index Scan using company_descrs on company_descr ces3_  
(cost=0.00..2073526.89 rows=1996612 width=502) (actual time=0.014..2576.013 
rows=1097096 loops=1)
-  Index Scan using pk_comp_ms on company_measures companymea2_  
(cost=0.00..889427.81 rows=6821672 width=68) (actual time=0.020..1946.255 
rows=1097096 loops=1)
Total runtime: 7794.692 ms

Limit  (cost=330.57..4888.97 rows=1000 width=693) (actual time=0.125..26.011 
rows=1000 loops=1)
  -  Merge Join  (cost=330.57..5260651.29 rows=1153986 width=693) (actual 
time=0.122..23.215 rows=1000 loops=1)
Merge Cond: (this_.id = (this_1_.company_id)::bigint)
-  Merge Join  (cost=292.71..5023728.99 rows=1391960 width=677) 
(actual time=0.090..16.615 rows=1054 loops=1)
  Merge Cond: ((companymea2_.company_id)::bigint = this_.id)
  -  Merge Join  (cost=0.00..2846769.89 rows=1685017 width=570) 
(actual time=0.063..9.534 rows=1054 loops=1)
Merge Cond: ((companymea2_.company_id)::bigint = 
(ces3_.company_id)::bigint)
-  Index Scan using pk_comp_ms on company_measures 
companymea2_  (cost=0.00..848312.00 rows=565 width=68) (actual 
time=0.033..1.973 rows=1054 loops=1)
  Index Cond: ((company_id)::bigint  5000)
-  Index Scan using company_descrs on company_descr ces3_  
(cost=0.00..1963172.96 rows=1685017 width=502) (actual time=0.020..2.152 
rows=1054 loops=1)
  Index Cond: ((ces3_.company_id)::bigint  5000)
  -  Index Scan using pk_comp_m on company this_  
(cost=0.00..2139324.53 rows=6886598 width=107) (actual time=0.022..1.855 
rows=1054 loops=1)
Index Cond: (this_.id  5000)
-  Index Scan using company_tag_company_id_idx on company_tag this_1_  
(cost=0.00..207678.85 rows=5709243 width=16) (actual time=0.025..1.742 
rows=1000 loops=1)
  Index Cond: ((this_1_.company_id)::bigint  5000)
  Filter: ((this_1_.tag_id)::bigint = 7)
Total runtime: 27.547 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Віталій Тимчишин
16 квітня 2010 р. 11:31 Yeb Havinga yebhavi...@gmail.com написав:

 Віталій Тимчишин wrote:

 Hello.

 I have a query that performs very poor because there is a limit on join
 column that is not applied to other columns:

 select * from company this_ left outer join company_tag this_1_ on
 this_.id=this_1_.company_id left outer join company_measures companymea2_ on
 this_.id=companymea2_.company_id left outer join company_descr ces3_ on
 this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id5000 and
 this_1_.company_id5000
 order by this_.id asc limit 1000;

 (plan1.txt)
 Total runtime: 7794.692 ms

 At the same time if I apply the limit (5000) to other columns in
 query itself it works like a charm:

 select * from company this_ left outer join company_tag this_1_ on
 this_.id=this_1_.company_id left outer join company_measures companymea2_ on
 this_.id=companymea2_.company_id left outer join company_descr ces3_ on
 this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id5000 and
 this_1_.company_id5000
 and companymea2_.company_id5000 and ces3_.company_id5000
 order by this_.id asc limit 1000;

 (plan2.txt)
 Total runtime: 27.547 ms

 I've thought and someone in this list've told me that this should be done
 automatically.

 Yes, if you have in a query a=b and b=c, then the optimizer figures out
 that a=c as well. (a,b and c are then member of the same equivalence class).

 However both queries are not the same, since the joins you're using are
 outer joins. In the first it's possible that records are returned for
 company records with no matching ces3_ records, the ces3_ records is null in
 that case. In the second query no NULL ces3_ information may be returned.


OK, but when I move limit to join condition the query is still fast:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id
left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id and companymea2_.company_id5000
left outer join company_descr ces3_ on this_.id=ces3_.company_id and
ces3_.company_id5000
where this_1_.tag_id = 7 and this_.id5000
and this_1_.company_id5000
order by this_.id asc limit 1000;

(plan3.txt),
Total runtime: 26.327 ms
BTW: Changing slow query to inner joins do not make it fast



 Another thing is it seems that the number of rows guessed is far off from
 the actual number of rows, is the number 500 artificial or are you're
 statistics old or too small histogram/mcv's?


Nope, I suppose this is because of limit. If I remove the limit, the
estimations are quite correct. There are ~6 millions of row in each table.
Limit  (cost=293.40..1028.60 rows=1000 width=1209) (actual time=0.128..24.751 
rows=1000 loops=1)
  -  Merge Left Join  (cost=293.40..4197731.11 rows=5709243 width=1209) 
(actual time=0.124..21.968 rows=1000 loops=1)
Merge Cond: (this_.id = (companymea2_.company_id)::bigint)
-  Merge Left Join  (cost=246.59..3681230.10 rows=5709243 width=1141) 
(actual time=0.099..15.284 rows=1000 loops=1)
  Merge Cond: (this_.id = (ces3_.company_id)::bigint)
  -  Merge Join  (cost=37.87..2435536.00 rows=5709243 width=639) 
(actual time=0.074..8.487 rows=1000 loops=1)
Merge Cond: (this_.id = (this_1_.company_id)::bigint)
-  Index Scan using pk_comp_m on company this_  
(cost=0.00..2139324.53 rows=6886598 width=623) (actual time=0.037..1.860 
rows=1054 loops=1)
  Index Cond: (id  5000)
-  Index Scan using company_tag_company_id_idx on 
company_tag this_1_  (cost=0.00..207678.85 rows=5709243 width=16) (actual 
time=0.027..1.758 rows=1000 loops=1)
  Index Cond: ((this_1_.company_id)::bigint  5000)
  Filter: ((this_1_.tag_id)::bigint = 7)
  -  Index Scan using cws_company_descr_unique on company_descr 
ces3_  (cost=0.00..1169511.92 rows=5757068 width=502) (actual time=0.020..1.788 
rows=1054 loops=1)
Index Cond: ((ces3_.company_id)::bigint  5000)
-  Index Scan using pk_comp_ms on company_measures companymea2_  
(cost=0.00..440945.79 rows=565 width=68) (actual time=0.019..1.729 
rows=1054 loops=1)
  Index Cond: ((companymea2_.company_id)::bigint  5000)
Total runtime: 26.327 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Віталій Тимчишин
16 квітня 2010 р. 11:25 Hannu Krosing ha...@2ndquadrant.com написав:

 On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:
  Hello.
 
 
  I have a query that performs very poor because there is a limit on
  join column that is not applied to other columns:
 
 
  select * from company this_ left outer join company_tag this_1_ on
  this_.id=this_1_.company_id left outer join company_measures
  companymea2_ on this_.id=companymea2_.company_id left outer join
  company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
  = 7 and this_.id5000
  and this_1_.company_id5000
  order by this_.id asc limit 1000;
 
 
  (plan1.txt)
  Total runtime: 7794.692 ms
 
 
  At the same time if I apply the limit (5000) to other columns in
  query itself it works like a charm:
 
 
  select * from company this_ left outer join company_tag this_1_ on
  this_.id=this_1_.company_id left outer join company_measures
  companymea2_ on this_.id=companymea2_.company_id left outer join
  company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
  = 7 and this_.id5000
  and this_1_.company_id5000
  and companymea2_.company_id5000 and ces3_.company_id5000
  order by this_.id asc limit 1000;

 The queries are not the same.

 2nd variant will not return the rows where there are no matching rows
 inthis_1_ , companymea2_ or ces3_.company_id

 A query equivalent to first one would be:


 select * from company this_
  left outer join company_tag this_1_
   on (this_.id=this_1_.company_id
and this_1_.company_id5000)
   left outer join company_measures companymea2_
   on (this_.id=companymea2_.company_id
and companymea2_.company_id5000)
   left outer join company_descr ces3_
   on (this_.id=ces3_.company_id
and ces3_.company_id5000)
  where this_1_.tag_id = 7
   and this_.id5000
  order by this_.id asc
  limit 1000;


And it's still fast (see plan in another mail), while inner join variant
of original query is still slow.




 I'm not sure that planner considers the above form of plan rewrite, nor
 that it would make much sense to do so unless there was a really small
 number of rows where x_.company_id5000

 Actually no,
select id  5000, count(*) from company group by 1
f,1096042
t,5725630

I don't know why the planner wishes to perform few merges of 1000 to a
million of records (and the merges is the thing that takes time) instead of
taking a 1000 of records from main table and then doing a nested loop. And
it must read all the records that DO NOT match the criteria for secondary
tables before getting to correct records if it do not filter secondary
tables with index on retrieve.

set enable_mergejoin=false helps original query, but this is another problem
and first solution is simpler and can be used by planner automatically,
while second requires rethinking/rewrite of LIMIT estimation logic
(Plan of nested loop attached)
Limit  (cost=0.00..2369.36 rows=1000 width=1209) (actual time=0.179..41.155 
rows=1000 loops=1)
  -  Nested Loop  (cost=0.00..15727940.41 rows=6638046 width=1209) (actual 
time=0.174..38.312 rows=1000 loops=1)
-  Nested Loop  (cost=0.00..11165483.75 rows=6701224 width=1141) 
(actual time=0.134..26.421 rows=1000 loops=1)
  -  Nested Loop  (cost=0.00..5763844.03 rows=6765004 width=639) 
(actual time=0.066..14.389 rows=1000 loops=1)
-  Index Scan using pk_comp_m on company this_  
(cost=0.00..1152936.77 rows=6886598 width=623) (actual time=0.038..1.908 
rows=1054 loops=1)
  Index Cond: (id  5000)
-  Index Scan using company_tag_company_id_idx on 
company_tag this_1_  (cost=0.00..0.66 rows=1 width=16) (actual 
time=0.004..0.006 rows=1 loops=1054)
  Index Cond: ((this_1_.company_id)::bigint = this_.id)
  Filter: ((this_1_.tag_id)::bigint = 7)
  -  Index Scan using cws_company_descr_unique on company_descr 
ces3_  (cost=0.00..0.79 rows=1 width=502) (actual time=0.004..0.006 rows=1 
loops=1000)
Index Cond: ((ces3_.company_id)::bigint = this_.id)
-  Index Scan using pk_comp_ms on company_measures companymea2_  
(cost=0.00..0.67 rows=1 width=68) (actual time=0.004..0.006 rows=1 loops=1000)
  Index Cond: ((companymea2_.company_id)::bigint = this_.id)
Total runtime: 42.940 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Віталій Тимчишин
16 квітня 2010 р. 16:21 Yeb Havinga yebhavi...@gmail.com написав:

 Віталій Тимчишин wrote:


 BTW: Changing slow query to inner joins do not make it fast

 I'm interested to see the query andplan of the slow query with inner joins.


 Here you are. The query:

select * from company this_ inner join company_tag this_1_ on
this_.id=this_1_.company_id
inner join company_measures companymea2_ on
this_.id=companymea2_.company_id
inner join company_descr ces3_ on this_.id=ces3_.company_id
where this_1_.tag_id = 7 and this_.id5000
order by this_.id asc
limit 1000
;
Total runtime: 14088.942 ms
(plan is attached)

Best regards, Vitalii Tymchyshyn
Limit  (cost=227.15..883.22 rows=1000 width=1209) (actual 
time=14062.106..14087.375 rows=1000 loops=1)
  -  Merge Join  (cost=227.15..4355277.70 rows=6638046 width=1209) (actual 
time=14062.101..14084.577 rows=1000 loops=1)
Merge Cond: (this_.id = (this_1_.company_id)::bigint)
-  Merge Join  (cost=185.14..4025978.59 rows=6757358 width=1193) 
(actual time=10692.975..10708.923 rows=1054 loops=1)
  Merge Cond: ((companymea2_.company_id)::bigint = this_.id)
  -  Merge Join  (cost=0.00..1784574.44 rows=6821672 width=570) 
(actual time=0.111..9138.804 rows=1097096 loops=1)
Merge Cond: ((companymea2_.company_id)::bigint = 
(ces3_.company_id)::bigint)
-  Index Scan using pk_comp_ms on company_measures 
companymea2_  (cost=0.00..456350.36 rows=6821672 width=68) (actual 
time=0.066..1747.291 rows=1097096 loops=1)
-  Index Scan using cws_company_descr_unique on 
company_descr ces3_  (cost=0.00..1225899.00 rows=6821672 width=502) (actual 
time=0.033..1822.085 rows=1097096 loops=1)
  -  Index Scan using pk_comp_m on company this_  
(cost=0.00..2139324.53 rows=6886598 width=623) (actual time=0.026..1.853 
rows=1054 loops=1)
Index Cond: (this_.id  5000)
-  Index Scan using company_tag_company_id_idx on company_tag this_1_  
(cost=0.00..229167.56 rows=6765004 width=16) (actual time=0.028..1859.967 
rows=1075634 loops=1)
  Filter: ((this_1_.tag_id)::bigint = 7)
Total runtime: 14088.942 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Deferred constraint and delete performance

2010-02-14 Thread Віталій Тимчишин
2010/2/10 Tom Lane t...@sss.pgh.pa.us

 Franck Routier franck.rout...@axege.com writes:
  I am wondering if deferring foreign key constraints (instead of
  disableing them) would increase performance, compared to non deferred
  constraints

 No, it wouldn't make any noticeable difference AFAICS.  It would
 postpone the work from end-of-statement to end-of-transaction,
 but not make the work happen any more (or less) efficiently.

 What about disc access? Won't working with one table, then another be
faster than working with both at the same time?


Re: [PERFORM] Queries within a function

2010-02-03 Thread Віталій Тимчишин
2010/2/2 Mridula Mahadevan mmahade...@stratify.com

  Hi,

  I am running a bunch of queries within a function, creating some temp
 tables and populating them. When the data exceeds say, 100k the queries
 start getting really slow and timeout (30 min). when these are run outside
 of a transaction(in auto commit mode), they run in a few seconds. Any ideas
 on what may be going on and any postgresql.conf parameters etc that might
 help?

 Thanks

Have you tried to analyze temp tables after you've populated them? Because
AFAIK it won't do it automatically for tables created, filled and then used
 in same transaction.


Re: [PERFORM] Constraint propagating for equal fields

2010-02-01 Thread Віталій Тимчишин
30 січня 2010 р. 04:30 Greg Stark gsst...@mit.edu написав:

 2010/1/28 Віталій Тимчишин tiv...@gmail.com
 
  I've always thought that PostgreSQL would propagate constraint from
 field1 to field2 if condition says field1=field2, but this does not seem the
 case:

 version?


PostgreSQL 8.3.7 on amd64-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1
20070719  [FreeBSD]


[PERFORM] Constraint propagating for equal fields

2010-01-28 Thread Віталій Тимчишин
Hello.

I've always thought that PostgreSQL would propagate constraint from field1
to field2 if condition says field1=field2, but this does not seem the case:
dict=# explain select * from domain_list,title.domains where processed_at is
not null and key=groupid and key  100 and groupid  100;
QUERY PLAN

--
 Hash Join  (cost=2179918.87..4529994.61 rows=4616 width=318)
   Hash Cond: (domain_list.key = domains.groupid)
   -  Bitmap Heap Scan on domain_list  (cost=26253.02..2310541.55
rows=870759 width=123)
 Recheck Cond: (key  100)
 -  Bitmap Index Scan on domain_list_new_pkey  (cost=0.00..26035.33
rows=870759 width=0)
   Index Cond: (key  100)
   -  Hash  (cost=2119232.34..2119232.34 rows=864201 width=195)
 -  Bitmap Heap Scan on domains  (cost=16674.34..2119232.34
rows=864201 width=195)
   Recheck Cond: (groupid  100)
   Filter: (processed_at IS NOT NULL)
   -  Bitmap Index Scan on dgroup  (cost=0.00..16458.29
rows=890154 width=0)
 Index Cond: (groupid  100)
(12 rows)

dict=# explain select * from domain_list,title.domains where processed_at is
not null and key=groupid and key  100 ;
   QUERY PLAN


 Hash Join  (cost=2337583.04..18222634.81 rows=845372 width=318)
   Hash Cond: (domains.groupid = domain_list.key)
   -  Seq Scan on domains  (cost=0.00..5423788.20 rows=158280964 width=195)
 Filter: (processed_at IS NOT NULL)
   -  Hash  (cost=2310541.55..2310541.55 rows=870759 width=123)
 -  Bitmap Heap Scan on domain_list  (cost=26253.02..2310541.55
rows=870759 width=123)
   Recheck Cond: (key  100)
   -  Bitmap Index Scan on domain_list_new_pkey
 (cost=0.00..26035.33 rows=870759 width=0)
 Index Cond: (key  100)
(9 rows)

dict=# explain select * from domain_list,title.domains where processed_at is
not null and key=groupid and groupid  100;
 QUERY PLAN


 Hash Join  (cost=2153665.85..16943819.35 rows=862710 width=318)
   Hash Cond: (domain_list.key = domains.groupid)
   -  Seq Scan on domain_list  (cost=0.00..6887257.54 rows=162753054
width=123)
   -  Hash  (cost=2119232.34..2119232.34 rows=864201 width=195)
 -  Bitmap Heap Scan on domains  (cost=16674.34..2119232.34
rows=864201 width=195)
   Recheck Cond: (groupid  100)
   Filter: (processed_at IS NOT NULL)
   -  Bitmap Index Scan on dgroup  (cost=0.00..16458.29
rows=890154 width=0)
 Index Cond: (groupid  100)
(9 rows)


The first query is the fastest one, but it is equal to both 2 and 3 and I
thought PostgreSQL can perform such propagation by itself.

Best regards, Vitalii Tymchyshyn.


Re: [PERFORM] Should the optimiser convert a CASE into a WHERE if it can?

2010-01-27 Thread Віталій Тимчишин
2010/1/26 Matthew Wakeling matt...@flymine.org

 On Tue, 26 Jan 2010, Richard Neill wrote:

 SELECT SUM (case when id  120 and id  121 then 1 else 0 end)
 from tbl_tracker;

 Explain shows that this does a sequential scan.


 I'd defer to Tom on this one, but really, for Postgres to work this out, it
 would have to peer deep into the mysterious SUM function, and realise that
 the number zero is a noop. I suppose it would be possible, but you'd have to
 define noops for each of the different possible functions, *and* make the
 planner clever enough to spot the noop-matching number in the else and
 convert the WHEN into a WHERE.

 Hello.

How  about SELECT SUM (case when id  120 and id  121 then 1 end)
from tbl_tracker;
It gives same result (may be unless there are no records at all) and
optimizer already knows it need not to call function for null input. Such an
optimization would cover much more cases. It would look like:
 * Check only for aggregate subselects
 * All the functions should be noop for null input
 * Add ORed constraint for every function input is not null (in this example
(case when id  A1 and id  B1 then 1 end is not null) or (case when id  A2
and id  B2 then 1 end is not null) or ... or (case when id  An and id  Bn
then 1 end is not null)
 * Know special case (case when id  A1 and id  B1 then 1 end is not
null) = (id  A1 and id  B1)
by ORing all the when conditions case when C1 then D1 when C2 then D2 ...
when Cm then Dm end is not null = C1 or C2 or ... or Cm.
Event without last part it may give bonuses even for select count(field)
from table transformed into select count(field) from table where field is
not null and using [partial] indexes.
As of last *, replacing COUNT with SUM(CASE()) is used often enough when
multiple count calculations are needed.

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] Should the optimiser convert a CASE into a WHERE if it can?

2010-01-27 Thread Віталій Тимчишин
27 січня 2010 р. 19:01 Matthew Wakeling matt...@flymine.org написав:

 On Wed, 27 Jan 2010, Віталій Тимчишин wrote:

 How  about SELECT SUM (case when id  120 and id  121 then 1 end)
 from tbl_tracker;


 That is very interesting.


  * All the functions should be noop for null input


 Alas, not true for COUNT(*), AVG(), etc.

 select avg(b), count(b), count(*) from (values (2),(null))a(b)
gives  (2.0, 1, 2) for me, so AVG is in game. Sure, it won't work for
count(*), but optimizer already knows which aggregates are strict and which
are not, so no new information is needed.

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] View based upon function won't use index on joins

2009-11-20 Thread Віталій Тимчишин
How about

CREATE OR REPLACE VIEW value_codes_view AS
select * from (
 SELECT value_codes.id_nbr,
  value_codes.id_qfr,
  (ARRAY[val_1_cd_1, ... , val_2_cd_12])[i] as value_code,
  (ARRAY[val_1_amt_1, ... , val_2_amt_12])[i] as value_amount,
   FROM value_codes, generate_series(1,24) i) a
where value_code is not null and value_code != '';
?


Re: [PERFORM] View based upon function won't use index on joins

2009-11-20 Thread Віталій Тимчишин
20 листопада 2009 р. 17:01 Jonathan Foy the...@gmail.com написав:

 This seems to result in the same problem; should I attempt to pull for a
 specific id_nbr/id_qfr, postgres uses the index without a problem. If I try
 to join the two tables/views however, it insists on doing a sequential scan
 (actually two in this case) and will not use the index.  Any other
 ideas/explanations?


Have you tried to do same (join) when not using the viewes or converting
columns into records? May be the problem is not in conversion, but in
something simplier, like statistics or index bloat?

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] Modeling a table with arbitrary columns

2009-10-30 Thread Віталій Тимчишин
2009/10/29 Andreas Hartmann andr...@apache.org

 Hi everyone,

 I want to model the following scenario for an online marketing application:

 Users can create mailings. The list of recipients can be uploaded as
 spreadsheets with arbitrary columns (each row is a recipient). I expect the
 following maximum quantities the DB will contain:

 I see basically two approaches to store the recipients:

 A) A single table with a fixed number of generic columns. If the
 spreadsheet has less columns than the table, the values will be null.

 B) Two tables, one for the recipients and one for the values:


One more option is to use arrays (and single table).


Re: [PERFORM] Getting a random row

2009-10-14 Thread Віталій Тимчишин
2009/10/14 Scott Marlowe scott.marl...@gmail.com

 On Wed, Oct 14, 2009 at 1:20 AM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  2009/10/14 Thom Brown thombr...@gmail.com:
  2009/10/14 Scott Marlowe scott.marl...@gmail.com:
  Why not just do something like:
 
  SELECT thisfield, thatfield
  FROM my_table
  WHERE thisfield IS NOT NULL
  ORDER BY RANDOM()
  LIMIT 1;
 
 
  this works well on small tables. On large tables this query is extremely
 slow.

 Exactly.  If you're running that query over and over your performance
 test is on how well pgsql can run that very query. :)  Anything else
 you do is likely to be noise by comparison.


What I am using often to get a set of random rows is
SELECT thisfield, thatfield
FROM my_table
WHERE random()  rowsneeded::float8/(select count * from my_table);
Of course it does not give exact number of rows, but close enough for me.
As of taking one row I'd try:
select * from (
SELECT thisfield, thatfield
FROM my_table
WHERE random()  100.0/(select count * from my_table))
a order by random() limit 1

I'd say probability of returning no rows is quite low and query can be
extended even more by returning first row from table in this rare case.


Re: [PERFORM] possible wrong query plan on pg 8.3.5,

2009-09-15 Thread Віталій Тимчишин
May be you have very bad disk access times (e.g. slow random access)? In
this case everything should be OK while data in cache and awful, when not.
Could you check disk IO speed  IO wait while doing slow  fast query.

BTW: In this case, increasing shared buffers may help. At least this will
prevent other applications  AFAIK sequence scans to move your index data
from cache.

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] Query tuning

2009-08-20 Thread Віталій Тимчишин
2009/8/19 Kevin Kempter kev...@consistentstate.com


 We do have an index on url_hits.time

 not sure why timestamps were not used, I was not here for the design phase.


What's type of time column? I don't like it casts it to double in explain.
If it is integer, may be you need to change

and time = extract ('epoch' from timestamp '2009-08-12')
and time  extract ('epoch' from timestamp '2009-08-13' )

to

and time = extract ('epoch' from timestamp '2009-08-12')::int4
and time  extract ('epoch' from timestamp '2009-08-13' )::int4

for the index to be used?


Re: [PERFORM] Can Postgres use an INDEX over an OR?

2009-07-27 Thread Віталій Тимчишин
27 липня 2009 р. 13:53 Robert Haas robertmh...@gmail.com написав:


 Hmm.  What you're suggesting here is that we could consider
 implementing OR conditions by rescanning the inner side for each index
 qual and then unique-ifying the results on the index column.  That's
 probably possible, but it doesn't sound easy, especially since our
 selectivity-estimation code for OR conditions is not very good, so we
 might choose to do it this way when that's not actually the best plan.

 ...Robert


Actually what I am talking about is to make OR with UNION (or UNION-like
because it's a little different depending on input rows uniqueness) as an
option. All of OR parts can use/not use different strategies (including
multiple different idexes or hash joins).
In cases when conditions are complex this can drastically increase
performance by winning over sequence scan.

As of selectivity, I'd say this is general problem - sometimes it is
estimated OK, sometimes not, but this should not prevent from trying
different plans. (From my current work: it does wrong estimations of filter
selectivity, introduces HASH join and kills the server with OOM).

Best regards, Vitaliy Tymchyshyn.


Re: [PERFORM] Can Postgres use an INDEX over an OR?

2009-07-27 Thread Віталій Тимчишин
27 липня 2009 р. 15:02 Robert Haas robertmh...@gmail.com написав:


 The problem, though, is that it won't ALWAYS be right to implement OR
 using UNION, so you have to have some way of deciding which is better.


That's easy - you propose both ways to planner and it's up to it to decide.
Yes, it can decide wrong way, but we are returning to statistics problem. At
least one can tune costs and enable_ settings. Now one have to rewrite query
that may be not possible/too complex.


Re: [PERFORM] Can Postgres use an INDEX over an OR?

2009-07-27 Thread Віталій Тимчишин
27 липня 2009 р. 17:18 Tom Lane t...@sss.pgh.pa.us написав:

 =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= tiv...@gmail.com writes:
  Actually what I am talking about is to make OR with UNION (or UNION-like
  because it's a little different depending on input rows uniqueness) as an
  option. All of OR parts can use/not use different strategies (including
  multiple different idexes or hash joins).

 AFAICS you're proposing re-inventing the old implementation of OR'd
 indexscans.  We took that out when we added bitmap scans because it
 didn't have any performance advantage over BitmapOr.


It's not tied to indexscans at all. Different parts can do (as in UNION)
totally different strategy - e.g. perform two hash joins or perform merge
join for one part and nested loop for another or ...

As of performance - see above in this thread. UNION now often provides much
better performance when different parts of OR expression involve different
additional tables.


Re: [PERFORM] Can Postgres use an INDEX over an OR?

2009-07-20 Thread Віталій Тимчишин
2009/7/20 Robert James srobertja...@gmail.com


 Hi. I notice that when I do a WHERE x, Postgres uses an index, and when I
 do WHERE y, it does so as well, but when I do WHERE x OR y, it doesn't. Why
 is this so?


It's not clever enough.

And how can I shut this off?


Use UNION/UNION ALL if possible in your case.


Re: [PERFORM] Can Postgres use an INDEX over an OR?

2009-07-20 Thread Віталій Тимчишин
20 липня 2009 р. 11:02 Chris dmag...@gmail.com написав:

 Віталій Тимчишин wrote:



 2009/7/20 Robert James srobertja...@gmail.com mailto:
 srobertja...@gmail.com


Hi. I notice that when I do a WHERE x, Postgres uses an index, and
when I do WHERE y, it does so as well, but when I do WHERE x OR y,
it doesn't. Why is this so?

 It's not clever enough.


 Of course it is.


For simple cases



 I'm running 8.3.7.

 create table t1(id int primary key);
 insert into t1(id) select a from generate_series(1, 50) as s(a);
 analyze t1;


explain analyze select * from t1 where
id  1

Index Scan using t1_pkey on t1  (cost=0.00..322.51 rows=9612 width=4)
(actual time=0.030..3.700 rows= loops=1)
  Index Cond: (id  1)
Total runtime: 4.835 ms

explain analyze select * from t1 where
id in (select (random() * 50)::int4 from generate_series(0,10))

Nested Loop  (cost=32.50..1341.49 rows=200 width=4) (actual
time=15.353..67.014 rows=11 loops=1)
  -  HashAggregate  (cost=32.50..34.50 rows=200 width=4) (actual
time=0.028..0.043 rows=11 loops=1)
-  Function Scan on generate_series  (cost=0.00..20.00 rows=1000
width=0) (actual time=0.014..0.020 rows=11 loops=1)
  -  Index Scan using t1_pkey on t1  (cost=0.00..6.52 rows=1 width=4)
(actual time=6.083..6.084 rows=1 loops=11)
Index Cond: (t1.id = (((random() * 50::double
precision))::integer))
Total runtime: 67.070 ms

explain analyze select * from t1 where
id in (select (random() * 50)::int4 from generate_series(0,10))
or
id  1

Seq Scan on t1  (cost=22.50..9735.50 rows=254806 width=4) (actual
time=0.049..148.947 rows=10010 loops=1)
  Filter: ((hashed subplan) OR (id  1))
  SubPlan
-  Function Scan on generate_series  (cost=0.00..20.00 rows=1000
width=0) (actual time=0.014..0.019 rows=11 loops=1)
Total runtime: 150.123 ms

explain analyze
select * from t1 where
id in (select (random() * 50)::int4 from generate_series(0,10))
union
select * from t1 where
id  1

Unique  (cost=2412.68..2461.74 rows=9812 width=4) (actual
time=89.190..95.014 rows=10010 loops=1)
  -  Sort  (cost=2412.68..2437.21 rows=9812 width=4) (actual
time=89.189..91.167 rows=10010 loops=1)
Sort Key: public.t1.id
Sort Method:  quicksort  Memory: 854kB
-  Append  (cost=32.50..1762.13 rows=9812 width=4) (actual
time=16.641..76.338 rows=10010 loops=1)
  -  Nested Loop  (cost=32.50..1341.49 rows=200 width=4)
(actual time=16.641..70.051 rows=11 loops=1)
-  HashAggregate  (cost=32.50..34.50 rows=200 width=4)
(actual time=0.033..0.049 rows=11 loops=1)
  -  Function Scan on generate_series
(cost=0.00..20.00 rows=1000 width=0) (actual time=0.020..0.026 rows=11
loops=1)
-  Index Scan using t1_pkey on t1  (cost=0.00..6.52
rows=1 width=4) (actual time=6.359..6.361 rows=1 loops=11)
  Index Cond: (public.t1.id = (((random() *
50::double precision))::integer))
  -  Index Scan using t1_pkey on t1  (cost=0.00..322.51
rows=9612 width=4) (actual time=0.023..4.075 rows= loops=1)
Index Cond: (id  1)
Total runtime: 112.694 ms

So, if it founds out anything complex, it sadly falls back to Sequence scan.


Re: [PERFORM] - Slow Query

2009-07-03 Thread Віталій Тимчишин
Sorry, it was an error in previous letter.

3 липня 2009 р. 14:22 Віталій Тимчишин tiv...@gmail.com написав:



 2009/7/1 Mike Ivanov mi...@activestate.com



  LEFT outer JOIN ville ON ville.uid = bien.ref_ville
  LEFT outer JOIN freguesia_ville ON freguesia_ville.ref_ville =ville.uid

 This is not enough. You have to add this condition as well:

 AND bien.ref_ville = freguesia_ville.ref_ville

 In other words, when you link three tables by a common field, all three
 relationships should be explicitly expressed, otherwise you'll have this
 type of explosive row multiplication.


 Why so? Is not changing freguesia_ville.ref_ville =ville.uid to
 freguesia_ville.ref_ville =bien.ref_ville enough (to prevent cases when
 ville.uid is null as result of join)?





Re: [PERFORM] Pointers needed on optimizing slow SQL statements

2009-06-09 Thread Віталій Тимчишин
I'd prefer ALTER VIEW name SET ANALYZE=true; or CREATE/DROP ANALYZE SQL;
Also it should be possible to change statistics target for analyzed columns.

Such a statement would allow to analyze multi-table correlations. Note that
for view planner should be able to use correlation information even for
queries that do not use view, but may benefit from the information.


Re: [PERFORM] Unexpected query plan results

2009-06-02 Thread Віталій Тимчишин
2009/6/2 Robert Haas robertmh...@gmail.com

 On Mon, Jun 1, 2009 at 4:53 PM, Anne Rosset aros...@collab.net wrote:
  On Mon, Jun 1, 2009 at 2:14 PM, Anne Rosset aros...@collab.net wrote:
  SELECT SUM(1) FROM item WHERE is_deleted = 'f'; sum - 1824592
 (1
  row)
  SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
  /sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701'; sum
 
  122412 (1 row)
  SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
  /sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701' AND
 is_deleted
  =
  'f'; sum - 71 (1 row)
  SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
  /sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701' AND
 is_deleted
  =
  't'; sum  122341 (1 row)
 
  The item table has 2324829 rows

 So 1824592/2324829 = 78.4% of the rows have is_deleted = false, and
 0.06709% of the rows have the relevant folder_id.  Therefore the
 planner assumes that there will be 2324829 * 78.4% * 0.06709% =~
 96,000 rows that satisfy both criteria (the original explain had
 97,000; there's some variability due to the fact that the analyze only
 samples a random subset of pages), but the real number is 71, leading
 it to make a very bad decision.  This is a classic hidden
 correlation problem, where two columns are correlated but the planner
 doesn't notice, and you get a terrible plan.

 Unfortunately, I'm not aware of any real good solution to this
 problem.  The two obvious approaches are multi-column statistics and
 planner hints; PostgreSQL supports neither.


How about partial index (create index idx on item(folder_id) where not
is_deleted)? Won't it have required statistics (even if it is not used in
plan)?


Re: [PERFORM] Very inefficient query plan with disjunction in WHERE clause

2009-06-01 Thread Віталій Тимчишин
2009/6/1 Koen Martens pg...@metro.cx


 Now, when I split up the OR in two distinct queries, everything is nice and
 fast. Both queries run in sub-second time.


Hi.

PostgreSQL simply do not like ORs (can't use indexes in this case), so
UNION/UNION ALL is your friend.

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] Nested query performance issue

2009-04-09 Thread Віталій Тимчишин
OK, got to my postgres. Here you are:

create or replace function explode_array(in_array anyarray) returns setof
anyelement as
$$
select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
$$
language sql immutable;

SELECT s.* FROM score s
WHERE s.id IN (
  select
  -- Get the high scoring score ID for each game:
  explode_array(ARRAY(
  -- Get the high score for game g:
  SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 5
  ))
  FROM game g
);

It takes ~64ms for me

Best regards, Vitaliy Tymchyshyn


Re: [PERFORM] Nested query performance issue

2009-04-08 Thread Віталій Тимчишин
2009/4/9 Glenn Maynard glennfmayn...@gmail.com

 (This is related to an earlier post on -sql.)

 I'm querying for the N high scores for each game, with two tables:
 scores and games.

 CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY);
 CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
 game_id INTEGER REFERENCES game (id));
 -- test data: 1000 games, 10 scores
 INSERT INTO game (id) select generate_series(1,1000);
 INSERT INTO score (game_id, score) select game.id, random() from game,
 generate_series(1,100);
 CREATE INDEX score_idx1 ON score (game_id, score desc);
 ANALYZE;


How about

select s1.*
from score s1 join score s2 on s1.game_id=s2.game_id and s2.score =
s1.score
group by s1.*
having count(s2.*) = N

Note: you can have problems if you have same scores - you will loose last
group that overlap N

In any case, you don't need to join game since all you need is game_id you
already have in score.

P.S. EXPLAIN ANALYZE could help

Best regards, Vitalii Tymchyshyn


Re: [PERFORM] Very specialised query

2009-03-31 Thread Віталій Тимчишин


 The outer nested join has the VALUES as the main loop, and the complicated
 join as the leaf. So, the complicated overlap-finding join gets run twice.


That's weird. What do you have as statistics target? Planner is incorrect
few orders of magnitude, so increasing it may help.
BTW: One of constraints is redundant l1.start = l2.start implies l1.start
= l2.end, so latter can be removed as for me.




 Oh, there's also the great big sort and unique, but I think I can get rid
 of that.


As far as I can see, duplicates will occur if and only if l1.start ==
l2.start  l1.end == l2.end.
That can be easily filtered by adding where n=1 or l1.start != l2.start or
l1.end != l2.end to outer select.


Re: [PERFORM] Very specialised query

2009-03-30 Thread Віталій Тимчишин
Hi.

Look, what I did mean by symmetric is that you don't need to make second
part of query because you will get just same results simply by

select
case when n == 1 then id1 else id2 end,
case when n == 2 then id1 else id2 end

from (
SELECT
   l1.id AS id1,
   l2.id AS id2
FROM
   location l1,
   location l2
WHERE
   l1.objectid = 22893
   AND l2.objectid = 22893
   AND l1.id  l2.id
   AND l1.start  l2.end
   AND l1.end  l2.start
   AND l1.start  l2.start) a, (values (1),(2)) b(n)

(I may miss some border cases like when l1.start=l2.start and/or
l1.end=l2.end, but this can be fixed by adding = to query).

Look,  You can have 4 types of intersections:
a)  1s 2s 2e 1e - 2 inside 1
b)  2s 1s 1e 2e - 1 inside 2 (symmetric to (a), if you have 1,2 from (a) you
can generate 2,1 for (b))
c)  1s 2s 1e 2e - 1 to the left of 2
d)  2s 1s 2e 1e - 2 to the left of 1 (symmetric to (c), if you have 1,2 from
(c) you can generate 2,1 for (d))

The query above gives you results for (a) and (c) and you don't need  any
second part - simply add symmetric results.

Correct me if I miss something.

Best Regards, Vitalii Tymchyshyn


Re: [PERFORM] Very specialised query

2009-03-30 Thread Віталій Тимчишин


 Yeah, that's nice.

 However, it is still the case that we can't trust the database to choose
 the correct plan. It is currently only choosing the correct plan now by
 chance, and some time later it may by chance switch to one that takes 40
 minutes.


What is the bad plan? Is it like the first plan from your first message?
You can sometimes tweak optimizer to make sure it will do correct plan. E.g.
when your database fits in memory, you can tweak page access costs. Also
don't forget to raise statistics target.

BTW: About aggregates: they can return arrays, but I can't imagine what you
can group by on... May be windowing functions from 8.4 could help.

Also, if your maximum length (select max(end-start) from location) is low
enough, you can try adding some more constraints to make optimizer happy
(have it more precise row count to select correct plan).


Re: [PERFORM] Very specialised query

2009-03-27 Thread Віталій Тимчишин
Hello.

You could try  addingAND l2.start  l1.start to the first query.  This
will drop symmetric half of intersections (the ones that will remain are l2
inside or to the left of l1), but you can redo results by
id1,id2 union all id2, id1 and may allow to use start index for between,
for my like test this looks like the next:

  -  Index Scan using location__start on location l2  (cost=0.00..756.34
rows=37787 width=12)
Index Cond: ((l2.start  l1.eend) AND (l2.start  l1.start))

also an index on (objectid, start) would help resulting in :

  -  Index Scan using lt on location l2  (cost=0.00..0.84 rows=20
width=16)
Index Cond: ((l2.objectid = l1.objectid) AND (l2.start  l1.eend)
AND (l2.start  l1.start))

Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Query much slower when run from postgres function

2009-03-16 Thread Віталій Тимчишин
2009/3/14 decibel deci...@decibel.org

 On Mar 10, 2009, at 12:20 PM, Tom Lane wrote:

 f...@redhat.com (Frank Ch. Eigler) writes:

 For a prepared statement, could the planner produce *several* plans,
 if it guesses great sensitivity to the parameter values?  Then it
 could choose amongst them at run time.


 We've discussed that in the past.  Choose at runtime is a bit more
 easily said than done though --- you can't readily flip between plan
 choices part way through, if you've already emitted some result rows.


 True, but what if we planned for both high and low cardinality cases,
 assuming that pg_stats indicated both were a possibility? We would have to
 store multiple plans for one prepared statement, which wouldn't work well
 for more complex queries (if you did high and low cardinality estimates for
 each table you'd end up with 2^r plans, where r is the number of relations),
 so we'd need a way to cap it somehow. Of course, whether that's easier than
 having the ability to throw out a current result set and start over with a
 different plan is up for debate...

 On a related note, I wish there was a way to tell plpgsql not to pre-plan a
 query. Sure, you can use EXECUTE, but building the query plan is a serious
 pain in the rear.


I'd say it would be great for PostgreSQL to replan each execution of query
automatically if execution plan tells it would take some factor (say, x100,
configurable) more time to execute query then to plan. In this case it would
not spend many time planning for small queries, but will use the most
efficient plan possible for long queries. And even if a query can't be run
better, it would spend only 1/factor time more (1% more time for factor of
100).


Re: [PERFORM] PostgreSQL NOT IN performance

2008-11-19 Thread Віталій Тимчишин
2008/11/19 Stephan Szabo [EMAIL PROTECTED]


 On Wed, 19 Nov 2008, [ISO-8859-5] Віталій Тимчишин wrote:

  Query 1:
  select * from t1 where id not in (select id from t2);
 
  Query 2 (gives same result as Q1):
  select * from t1 except all (select id from t2);

 It gives the same result as long as no nulls are in either table. If
 either table can have a null, the conversion changes the results.

 In addition, a conversion like the above only happens to work because t1
 only has an id column. If t1 had two columns you'd get an error because
 the two sides of except all must have the same number of columns.


Actually It can be done even for multi-column mode if the selection is done
on unique key. It would look like:

select * from t1 inner join (
select id from t1 except select id from t2) talias on t1.id = talias.id

And it would produce better results then not in for large counts in t1 and
t2.


Re: [PERFORM] PostgreSQL NOT IN performance

2008-11-19 Thread Віталій Тимчишин
2008/11/19 DANIEL CRISTIAN CRUZ [EMAIL PROTECTED]

 Something weird with your example which doesn't have the same result, see
 row count with explain analyze:

My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be
used only when either operation is done on unique key on t1 or result is
going to be made unique.

 cruz=# SELECT version();
   version
 
  PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 
 4.3.2-1) 4.3.2
 (1 registro)

 cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);
 QUERY PLAN
 --
  Seq Scan on t1  (cost=1643.00..4928.00 rows=10 width=4) (actual 
 time=256.687..585.774 rows=73653 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
  -  Seq Scan on t2  (cost=0.00..1393.00 rows=10 width=4) (actual 
 time=0.052..86.867 rows=10 loops=1)
  Total runtime: 625.471 ms
 (5 registros)

 cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);
QUERY PLAN
 -
  SetOp Except All  (cost=34469.90..35969.90 rows=3 width=4) (actual 
 time=2598.574..3663.712 rows=126733 loops=1)
-  Sort  (cost=34469.90..35219.90 rows=30 width=4) (actual 
 time=2598.550..3178.387 rows=30 loops=1)
  Sort Key: *SELECT* 1.id
  Sort Method:  external merge  Disk: 5864kB
  -  Append  (cost=0.00..7178.00 rows=30 width=4) (actual 
 time=0.037..1026.367 rows=30 loops=1)
-  Subquery Scan *SELECT* 1  (cost=0.00..4785.00 
 rows=20 width=4) (actual time=0.035..439.507 rows=20 loops=1)
  -  Seq Scan on t1  (cost=0.00..2785.00 rows=20 
 width=4) (actual time=0.029..161.355 rows=20 loops=1)
-  Subquery Scan *SELECT* 2  (cost=0.00..2393.00 
 rows=10 width=4) (actual time=0.107..255.160 rows=10 loops=1)
  -  Seq Scan on t2  (cost=0.00..1393.00 rows=10 
 width=4) (actual time=0.097..110.639 rows=10 loops=1)
  Total runtime: 3790.831 ms
 (10 registros)

 Sometimes I got a better result (on older versions) with this kind of
 query, but in this case it doesn't:

 cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE 
 t2.id IS NULL;
   QUERY PLAN
 ---
  Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual 
 time=850.232..1671.091 rows=73653 loops=1)
Merge Cond: (t2.id = t1.id)
Filter: (t2.id IS NULL)
-  Sort  (cost=9697.82..9947.82 rows=10 width=4) (actual 
 time=266.501..372.560 rows=10 loops=1)
  Sort Key: t2.id
  Sort Method:  quicksort  Memory: 4392kB
  -  Seq Scan on t2  (cost=0.00..1393.00 rows=10 width=4) (actual 
 time=0.029..78.087 rows=10 loops=1)
-  Sort  (cost=20394.64..20894.64 rows=20 width=4) (actual 
 time=583.699..855.427 rows=273364 loops=1)
  Sort Key: t1.id
  Sort Method:  quicksort  Memory: 8784kB
  -  Seq Scan on t1  (cost=0.00..2785.00 rows=20 width=4) (actual 
 time=0.087..155.665 rows=20 loops=1)
  Total runtime: 1717.062 ms
 (12 registros)


Yes, your method is even better on 8.3.3 I have. I will try to update to
8.3.5 to see if there was optimizer improvements. You could try increasing
values, say, by 10 in table filling to see if NOT IT will switch to slow
version (for me it starts being slow from some magic row count in t2). I
suppose it is the moment it switches from hashed subplan to subplan. For
me for 1 values it is hashed subplan (and it is momentary fast), for
10 - it is subplan and it is sloow.
BTW: Which (memory?) configuration variable can affect such a switch?


Re: [PERFORM] PostgreSQL OR performance

2008-11-17 Thread Віталій Тимчишин
2008/11/15 Tom Lane [EMAIL PROTECTED]

 =?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?= [EMAIL PROTECTED] writes:
  I am not. I can't see how materialize can multiply number of rows it gets
  from sort by 100.

 Is it the right-hand input of a merge join?  If so you're looking at
 mark/restore rescans, ie, repeated fetches of the same tuples.  There
 must be a huge number of duplicate join keys in that relation to make
 for such an increase though.  Normally the planner avoids putting a
 table with lots of duplicates as the RHS of a merge, but if it doesn't
 have good statistics for the join key then it might not realize the
 problem.


OK, thanks for cleaning-up some mystery.
But, returning to original Q: Do anyone known why does it choose plan from *
OR-plan.txt* instead of *union-plan.txt*? The first is
cost=4588.13..960900482668.95, the latter is cost=266348.42..272953.14
according to statistics postgres have, so I suppose planner would select it
if it could evaluate it.


Re: [PERFORM] PostgreSQL OR performance

2008-11-15 Thread Віталій Тимчишин
Sorry, for delayed response - It was very busy week.

2008/11/7 David Wilson [EMAIL PROTECTED]

 On Fri, Nov 7, 2008 at 4:14 AM, Віталій Тимчишин [EMAIL PROTECTED] wrote:
  Merge Join  (cost=518771.07..62884559.80 rows=1386158171 width=32)
 (actual
  time=30292.802..755751.242 rows=34749 loops=1)

 Have you tried increasing the default_statistics_target? The planner
 is expecting 1.3 billion rows to be produced from a query that's only
 actually producting 35k, which probably indicates some very bad
 statistics.


 The planner seems to think that every second pair from company-company
join will succeed with this join expression (1386158171 ~  52648^2 / 2).
That is not true.
Anyway, I've tried to set default_statistics_target to 1000, then analyze.
Nothing've changed

At the same time, the materialize step produces 242
 million rows when the planner only expects to produce 2.3, indicating
 a similar problem in the opposite direction. This probably means that
 the planner is choosing plans that would be optimal if it was making
 good guesses but are decidedly sub-optimal for your actual data.


That is even more strange, because materialize step must produce exactly the
rows it takes from sort, that is 2316503, so I don't get how table scan +
sort + materialize can multiply number of rows by 100.


Re: [PERFORM] PostgreSQL OR performance

2008-11-06 Thread Віталій Тимчишин
My main message is that I can see this in many queries and many times. But
OK, I can present exact example.

2008/11/5 Jeff Davis [EMAIL PROTECTED]

 On Wed, 2008-11-05 at 13:12 +0200, Віталій Тимчишин wrote:
  For a long time already I can see very poor OR performance in
  postgres.
  If one have query like select something from table where condition1
  or condition2 it may take ages to execute while
  select something from table where condition1 and select something
  from table where condition2 are executed very fast and
  select something from table where condition1 and not condition2 union
  all select something from table where condition2 gives required
  results fast
 

 What version are you using?


Server version 8.3.3




 Have you run VACUUM ANALYZE?


I have autovacuum, but for this example I did vacuum analyze of the whole
DB.
The real-life query (autogenerated) looks like the next:
select t0.id as pk1,t1.id as pk2 ,t0.run_id as f1_run_id,t1.run_id as
f2_run_id
from tmpv_unproc_null_production_company_dup_cons_company as t0, (select *
from production.company where run_id in (select id from production.run where
name='test')) as t1
where
t0.name = t1.name
or
(t0.name,t1.name) in (select s1.name, s2.name from atom_match inner join
atoms_string s1 on atom_match.atom1_id = s1.id  inner join atoms_string s2
on atom_match.atom2_id = s2.id where s1.atom_type_id = -1 and
match_function_id = 2)

with tmpv_unproc_null_production_company_dup_cons_company:

create temporary view tmpv_unproc_null_production_company_dup_cons_company
as select * from production.company where 1=1 and status='unprocessed' and
run_id in (select id from production.run where name='test')



 Next, do:

 EXPLAIN ANALYZE select something from table where condition1 or
 condition2;


without analyze is in OR-plan.txt
Also plans for only condition1, only condition2 and union is attached
Nested Loop  (cost=4588.13..960900482668.95 rows=1386158171 width=32)
  Join Filter: (((production.company.name)::text = 
(production.company.name)::text) OR (subplan))
  -  Hash IN Join  (cost=1.56..73814.22 rows=52648 width=30)
Hash Cond: ((production.company.run_id)::bigint = production.run.id)
-  Seq Scan on company  (cost=0.00..64599.29 rows=2316503 width=30)
  Filter: ((status)::text = 'unprocessed'::text)
-  Hash  (cost=1.55..1.55 rows=1 width=8)
  -  Seq Scan on run  (cost=0.00..1.55 rows=1 width=8)
Filter: ((name)::text = 'test'::text)
  -  Nested Loop  (cost=1183.27..39219.67 rows=52648 width=30)
-  HashAggregate  (cost=1.55..1.56 rows=1 width=8)
  -  Seq Scan on run  (cost=0.00..1.55 rows=1 width=8)
Filter: ((name)::text = 'test'::text)
-  Bitmap Heap Scan on company  (cost=1181.72..38435.51 rows=62608 
width=30)
  Recheck Cond: ((production.company.run_id)::bigint = 
production.run.id)
  -  Bitmap Index Scan on comp_run  (cost=0.00..1166.07 
rows=62608 width=0)
Index Cond: ((production.company.run_id)::bigint = 
production.run.id)
  SubPlan
-  Materialize  (cost=3403.29..4005.74 rows=35745 width=28)
  -  Hash Join  (cost=928.57..3122.55 rows=35745 width=28)
Hash Cond: ((atom_match.atom1_id)::integer = s1.id)
-  Hash Join  (cost=445.80..1880.19 rows=35745 width=18)
  Hash Cond: ((atom_match.atom2_id)::integer = s2.id)
  -  Seq Scan on atom_match  (cost=0.00..674.81 
rows=35745 width=8)
Filter: ((match_function_id)::integer = 2)
  -  Hash  (cost=260.91..260.91 rows=14791 width=18)
-  Seq Scan on atoms_string s2  
(cost=0.00..260.91 rows=14791 width=18)
-  Hash  (cost=297.89..297.89 rows=14791 width=18)
  -  Seq Scan on atoms_string s1  (cost=0.00..297.89 
rows=14791 width=18)
Filter: ((atom_type_id)::integer = (-1))

Merge Join  (cost=89373.23..97526.15 rows=525975 width=32) (actual 
time=276.869..523.669 rows=34749 loops=1)
  Merge Cond: ((production.company.name)::text = 
(production.company.name)::text)
  -  Sort  (cost=44764.87..44896.49 rows=52648 width=30) (actual 
time=120.036..144.925 rows=15507 loops=1)
Sort Key: production.company.name
Sort Method:  external merge  Disk: 704kB
-  Nested Loop  (cost=1183.27..39376.19 rows=52648 width=30) (actual 
time=1.898..72.693 rows=15507 loops=1)
  -  HashAggregate  (cost=1.55..1.56 rows=1 width=8) (actual 
time=0.020..0.022 rows=1 loops=1)
-  Seq Scan on run  (cost=0.00..1.55 rows=1 width=8) 
(actual time=0.010..0.012 rows=1 loops=1)
  Filter: ((name)::text = 'test'::text)
  -  Bitmap Heap Scan on company  (cost=1181.72..38592.03 
rows=62608 width=30) (actual time=1.873..29.251 rows=15507 loops=1)
Recheck

Re: [PERFORM] PostgreSQL OR performance

2008-11-06 Thread Віталій Тимчишин
2008/11/6 Helio Campos Mello de Andrade [EMAIL PROTECTED]

 For what i see in four OR-plan.txt tou are doing too much sequencial scan
 . Create some indexes for those tables using the fields that you use an it
 may help you.

 OBS: If you already have lots of indexes in your tables it may be a good
 time for you re-think your strategy because it´s ot working.
 Tips:
   1 - create indexes for the tables with the fields that you will use in
 the query if it is your most important query. If you have others querys that
 are used please post those here and we can help you to desing a better plan.


As you can see from other plans, it do have all the indexes to perform it's
work fast (when given part by part). It simply do not wish to use them. My
question: Is this a configuration problem or postgresql optimizer simply
can't do such a query rewrite?

Actually I did rewrite the query to work properly as you can see from
union-plan.txt. My question is if postgresql can do this automatically
because such a rewrite is not always easy/possible (esp. for generated
queries)?


Re: [PERFORM] PostgreSQL OR performance

2008-11-06 Thread Віталій Тимчишин
2008/11/6 Richard Huxton [EMAIL PROTECTED]

 Віталій Тимчишин wrote:
  As you can see from other plans, it do have all the indexes to perform
 it's
  work fast (when given part by part). It simply do not wish to use them.
 My
  question: Is this a configuration problem or postgresql optimizer simply
  can't do such a query rewrite?

 I must admit, I haven't managed to figure out what your query is trying
 to do, but then that's a common problem with autogenerated queries.


That's easy - I am looking for duplicates from subset of companies. Two
companies are equal when there names are simply equal or there is an entry
in match table for names.




 The main question that needs answering is why the planner thinks you're
 going to get 1.3 billion rows in the or query:

 Nested Loop  (cost=4588.13..960900482668.95 rows=1386158171 width=32)

 You don't show explain analyse for this query, so there's no way of
 knowing how many rows get returned but presumably you're expecting
 around 88000. What does explain analyse return?


Yes, the query should output exactly same result as in Union plan. I will
run slow explain analyze now and will repost after it will complete
(tomorrow?).
BTW: I'd say planner should think rows estimated as sum of ORs estimation
minus intersection, but no more then sum or ORs (if intersection is 0). For
first condition it has rows=525975, for second it has rows=2403 (with other
plans, of course), so it's strange it has such a high estimation It's
exactly 50% of full cartesian join of merge, so it does think that every
second pair would succeed, that is not true.


[PERFORM] PostgreSQL OR performance

2008-11-05 Thread Віталій Тимчишин
Hello.

For a long time already I can see very poor OR performance in postgres.
If one have query like select something from table where condition1 or
condition2 it may take ages to execute while
select something from table where condition1 and select something from
table where condition2 are executed very fast and
select something from table where condition1 and not condition2 union all
select something from table where condition2 gives required results fast

For example, in my current query for condition1 optimizer gives 88252, for
condition1 and not condition2 it is 88258, for condition2 it is 99814.
And for condition1 or condition2 it is 961499627680. And it does perform
this way.

All is more or less good when select part is easy and query can be easily
rewritten. But for complex queries it looks ugly and if the query is
autogenerated, moving autogeneration mechanism from creating simple clean
where to unions is not an easy task.

So the question is: Do I miss something? Can this be optimized?


Re: [PERFORM] Index bloat, reindex weekly, suggestions etc?

2008-10-18 Thread Віталій Тимчишин
2008/10/17 Tory M Blue [EMAIL PROTECTED]


 The real issue is my index growth and my requirement for weekly
 re-indexing  (which blocks and therefore is more or less a manual
 process in a live production environment (fail over, changing vips
 etc).


BTW: Can't you simply recreate indexes online? Since postgresql accepts
multiple indexes of same definition, this may look like:
1) create index concurrently index_alt
2) analyze index_alt
3) drop index_orig
Both index_alt and index_orig having same definition


Re: [PERFORM] Delete performance again

2008-10-10 Thread Віталій Тимчишин
BTW: Have just tried clean (without any foreign keys constraints)
peformance of
delete from tbl where field not in (select)
vs
create temporary table tmp(id)  as select distinct field from tbl; delete
from tmp where id in (select); delete from tbl where field in (select id
from tmp).
both tbl and select are huge.
tbl cardinality is ~5 million, select is ~1 milliion. Number of records to
delete is small.
select is simply select id from table2.

First (simple) one could not do in a night, second did in few seconds.


Re: [PERFORM] Delete performance again

2008-10-09 Thread Віталій Тимчишин
OK, I did try you proposal and correlated subselect.
I have a database ~90 companies.
First try was to remove randomly selected 1000 companies
Uncorrelated subselect: 65899ms
Correlated subselect: 97467ms
using: 9605ms
my way: 104979ms. (without constraints recreate)
My is the worst because it is oriented on massive delete.
So I thought USING would perform better, so I did try 1 companies
my way: 190527ms. (without constraints recreate)
using: 694144ms
I was a little shocked, but I did check plans and found out that it did
switch from Nested Loop to Hash Join.
I did disable Hash Join, it not show Merge Join. This was also disabled
and I've got 747253ms.
Then I've tried combinations: Without hash join it was the best result of
402629ms, without merge join it was 1096116ms.

My conclusion: Until optimizer would take into account additional actions
needed (like constraints check/cascade deletes/triggers), it can not make
good plan.


[PERFORM] Delete performance again

2008-10-02 Thread Віталій Тимчишин
Hello.

I have a database with company table that have a bunch of related
(delete=cascade) tables.
Also it has 1-M relation to company_descr table.
Once we've found that ~half of our companies do not have any description and
we would like to remove them.
First this I've tried was
delete from company where id not in (select company_id from company_descr);
I've tried to analyze command, but unlike to other RDBM I've used it did not
include cascade deletes/checks into query plan. That is first problem.
It was SLOW. To make it faster I've done next thing:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company where id in (select id from comprm);

That was much better. So the question is why postgresql can't do such a
thing.
But it was better only until removing dataset was small (~5% of all
table).
As soon as I've tried to remove 50% I've got speed problems. I've ensured I
have all indexes for both ends of foreign key.
I've tried to remove all cascaded entries by myself, e.g.:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company_alias where company_id in (select id from comprm);
...
delete from company where id in (select id from comprm);

It did not help until I drop all constraints before and recreate all
constraints after.
Now I have it work for 15minutes, while previously it could not do in a day.

Is it OK? I'd say, some (if not all) of the optimizations could be done by
postgresql optimizer.


Re: [PERFORM] Delete performance again

2008-10-02 Thread Віталій Тимчишин
2008/10/2 Tom Lane [EMAIL PROTECTED]

 =?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?= [EMAIL PROTECTED] writes:
  delete from company where id not in (select company_id from
 company_descr);
  I've tried to analyze command, but unlike to other RDBM I've used it did
 not
  include cascade deletes/checks into query plan. That is first problem.
  It was SLOW.

 Usually the reason for that is having forgotten to make an index on the
 referencing column(s) ?


Not at all. As you can see below in original message, simply extending the
query to what should have been done by optimizer helps. I'd say optimizer
always uses fixed plan not taking into account that this is massive update
and id doing index lookup of children records for each parent record, while
it would be much more effective to perform removal of all children records
in single table scan.

It's like trigger for each record instead of for each statement.