Re: [PERFORM] postgres performance

2013-12-06 Thread Richard Huxton

On 06/12/13 17:36, chidamparam muthusamy wrote:

I rather think Alan is right - you either want a lot more RAM or faster 
disks. Have a look at your first query...



Query:
EXPLAIN (analyze, buffers) SELECT text(client) as client, text(gateway)
as gateway,count(*)::bigint as total_calls, (avg(duration)/1000.0)
::numeric(10,2) as acd, (avg(pdd)) ::numeric(10,2) as pdd,
sum(call_duration_recv)/1000.0 as duration_recv,
sum(call_duration_pay)/1000.0 as duration_pay, sum(call_amount_recv) as
call_amount_recv, sum(call_amount_pay) as call_amount_
pay FROM detailed_report WHERE end_time='2013-05-01 00:00' and
end_time'2013-07-01 00:00' and group_id='admin' and client ='CHOICE'
GROUP by client, gateway ORDER BY call_amount_recv DESC;



QUERY PLAN
--
Sort (cost=3422863.06..3422868.69 rows=2254 width=44) (actual
time=137852.474..137852.474 rows=5 loops=1)
Sort Key: (sum(call_amount_recv))
Sort Method: quicksort Memory: 25kB
Buffers: shared read=2491664



- HashAggregate (cost=3422664.28..3422737.53 rows=2254 width=44)
(actual time=137852.402..137852.454 rows=5 loops=1)
Buffers: shared read=2491664



- Bitmap Heap Scan on detailed_report (cost=644828.11..3399506.87
rows=1029218 width=44) (actual time=4499.558..125443.122 rows=5248227
loops=1)
Recheck Cond: ((end_time = '2013-05-01 00:00:00+00'::timestamp with
time zone) AND (end_time  '2013-07-01 00:00:00+00'::timestamp with time
zone) AND ((group_id)::text = 'adm
in'::text) AND ((client)::text = 'CHOICE'::text))
Buffers: shared read=2491664



- Bitmap Index Scan on endtime_groupid_client_tsidx_detail_report
(cost=0.00..644570.81 rows=1029218 width=0) (actual
time=3418.754..3418.754 rows=5248227 loops=1)
Index Cond: ((end_time = '2013-05-01 00:00:00+00'::timestamp with time
zone) AND (end_time  '2013-07-01 00:00:00+00'::timestamp with time
zone) AND ((group_id)::text =
'admin'::text) AND ((client)::text = 'CHOICE'::text))
Buffers: shared read=95055



Total runtime: *137868.946 ms*
(13 rows)


The index is being used, but most of your time is going on the Bitmap 
Heap Scan. You're processing 5.2 million rows in about 120 seconds - 
that's about 43 rows per millisecond - not too bad. It's not getting any 
cache hits though, it's having to read all the blocks. Looking at the 
number of blocks, that's ~2.5 million at 8KB each or about 20GB. You 
just don't have the RAM to cache that.


If you have lots of similar reporting queries to run, you might get away 
with dropping the index and letting them run in parallel. Each 
individual query would be slow but they should be smart enough to share 
each other's sequential scans - the disks would basically be looping 
through you data continuously.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.2.4 very slow on laptop with windows 8

2013-08-23 Thread Richard Huxton

On 22/08/13 12:30, girish subbaramu wrote:


I am running PostgreSQL 9.2.4 on windows 8  , 64 bit operating system ,
4GB RAM.



A laptop with i3 - 3110M , 2.4 GHZ .
The database  came bundled with wapp stack 5.4.17-0. We have an php
application that serves data from PostgreSQL 9.2.4.

The configuration runs with very good performance (3 sec response php +
db ) on windows 7   32, 64 bit OS , 4GB RAM (desktops with i3-2100 3.10
GHZ ) .
But take around 25 seconds to render on windows 8 , the laptop.

I was able to eliminate php , as the performance was as expected.
(without DB calls)
On the other part the database calls take more than 100 ms for simple
queries (Example a table with just 10 row sometimes takes around 126
ms).  This information i was able to collect from the pg_log.


First step - check the antivirus / security tools aren't interfering. 
That can slow you down immensely.


Second step - have a quick look in your performance monitoring (you can 
get to it through


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to properly index hstore tags column to faster search for keys

2013-07-08 Thread Richard Huxton

On 07/07/13 08:28, Radu-Stefan Zugravu wrote:

Each node has a geometry column called geom and a hstore column
called tags. I need to extract nodes along a line that have certain
keys in the tags column. To do that I use the following query:



SELECT id, tags
FROM nodes
WHERE ST_DWithin(nodes.geom, ST_MakeLine('{$geom1}', '{$geom2}'), 0.001)
AND tags ? '{$type}';



CREATE  INDEX  nodes_tags_idx ON nodes USING GIN(tags);

After creating the index I searched again for nodes using the same first
query but there is no change in performance.

How can I properly use GIN and GIST to index tags column so I can faster
search for nodes that have a certain key in tags column?


Your index definition looks OK. Try showing the output of EXPLAIN 
ANALYSE for your query - that way we'll see if the index is being used. 
You can always paste explain output to: http://explain.depesz.com/ if 
it's too long for the email.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to properly index hstore tags column to faster search for keys

2013-07-08 Thread Richard Huxton

On 08/07/13 10:20, Radu-Stefan Zugravu wrote:

Any improvement is welcomed. The overall performance of the application
is not very good. It takes about 200 seconds to compute a path for not
so far star and end points.


So you have to call this query 1000 times with different start and end 
points?


 I want to improve this query as much as I can.

There's only two ways I can see to get this much below 20ms. This will 
only work if you want a very restricted range of tags.


Drop the tag index and create multiple geometry indexes instead:

CREATE INDEX node_geo_tourist_idx index details WHERE tags ? 'tourist';
CREATE INDEX node_geo_tourist_idx index details WHERE tags ? 'history';
etc.

This will only work if you have a literal WHERE clause that checks the 
tag. It should be fast though.



The second way would be to delete all the nodes that aren't tagged 
tourist or history. That assumes you are never interested in them of course.



How exactly should I post the explain without the index? Do I have to
drop all created indexes for the tags column? It takes some time to
create them back.


Not important - I was just curious.

--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] WHERE 1 = 2 OR ... makes planner choose a very inefficient plan

2013-04-18 Thread Richard Huxton

On 18/04/13 15:20, dmitry potapov wrote:

Hello,

I recently stumbled upon on what could be a planner bug or a corner
case. If false condition OR ... is added to WHERE clause of SELECT
query, then the planner chooses a very inefficient plan. Consider a query:



If I comment out 1=2 OR, then the plan changes dramatically:


What happens if you substitute:
1. 1=3   OR
2. false OR

--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PsqL8.3

2013-04-12 Thread Richard Huxton

On 11/04/13 22:09, Thiyagarajan, Palaniappan wrote:

All,

We have performance issue in Psql8.3 vacuum run.  The CPU usage going
300+ and application getting slow.  How do we avoid high cpu and
smooth vacuum tables.


I'm afraid this isn't nearly enough information for anyone to help.

1. Full version details, some idea of hardware and database size might 
be useful. Exactly when this happens etc.


2. Is this a manual vacuum or autovacuum?

3. Are you familiar with the manual page regarding the vacuum settings?
http://www.postgresql.org/docs/8.3/static/runtime-config-autovacuum.html

4. If so, what changes have you made?

5. You are aware that 8.3 is end-of-life? And you are running 8.3.23 
until you upgrade, aren't you?

http://www.postgresql.org/support/versioning/


Typically you'd expect disk i/o to be the limiting factor with vacuum 
rather than CPU. However, it might just be that I've misunderstood your 
description. More details please.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fighting the planner :-(

2013-02-01 Thread Richard Huxton

On 01/02/13 17:54, Casey Allen Shobe wrote:
My apologies - I included the wrong version of the query 
before...during  testing I had tried deparameterizing a few of the 
input parameters.  I also accidentally left out the schema for the 
network_config_tot2 table from the initial paste.


Here is an updated paste, which shows the correct query in a prepare 
statements.  The explain plans are from explain execute hewitt_test (...):

http://pgsql.privatepaste.com/00c582c840

Here is the correct explain plan for this statement (still bad):
http://explain.depesz.com/s/c46


Three quick observations before the weekend.

1. You said config_id was now smallint in your email, but it reads 
int in the pastes above.

   Doesn't matter much which, but just checking we've got the right pastes.

2. The total estimated cost of both queries is about the same 
(477,225.19 for the varchar, 447,623.86 for the int).
   This suggests something about your configuration doesn't match the 
performance of your machine, since presumably the int version is taking 
at least twice as long as the varchar one.


3. Interestingly, the config_id search on both plans seems to be using a 
Bitmap Index, so I'm not sure that's the root cause. However, the 
varchar version seems to have a literal string it's matching against. If 
you've manually substituted in a literal value, that could be skewing 
the tests.


And two things for you to try if you would:

1. Can you just check and see if any of the row estimates are horribly 
off for any particular clause in the query?


2. You mention your config settings are mostly at default. What's your 
work_mem and can you increase it? You can issue a SET for the current 
session, no need to change it globally. If you've got the RAM try 
doubling it, then double it again. See what happens to your plan then.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Triggers and transactions

2013-01-28 Thread Richard Huxton

On 28/01/13 18:54, Craig James wrote:
If I drop and then recreate a trigger inside of a single transaction, 
how does it affect other processes trying to use the same table?  Can 
they just merrily go along their way using the table, or will they be 
blocked by an exclusive lock?



I *think* it blocks, but in any case, read on...

We have a trigger that detects illegal drugs and dangerous chemicals 
(such as explosives and flammable compounds that can't be shipped by air).


pedantry mode=fulldetects a reference to illegal... (unless you've 
hooked your RDBMS up to some sort of x-ray scanner, in which case I 
salute you sir)/pedantry


Unfortunately, the trigger's function is necessarily heavyweight and 
slow.


The drop-and-restore-trigger operation is needed when we're copying 
data one server to another. 


Run the copy as a different user than ordinary applications (a good idea 
anyway). Then the function can just check current_user and exit for the 
copy.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Savepoints in transactions for speed?

2012-11-27 Thread Richard Huxton

On 27/11/12 22:04, Mike Blackwell wrote:
I need to delete about 1.5 million records from a table and reload it 
in one transaction.


The data to reload the table is coming from a Perl DBI connection to a 
different database (not PostgreSQL) so I'm not sure the COPY 
alternative applies here.

No reason why it shouldn't.

https://metacpan.org/module/DBD::Pg#COPY-support

--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] transactions start time

2012-07-24 Thread Richard Huxton

On 24/07/12 12:14, Aleksei Arefjev wrote:

Hi,

In statistical reports gathered by PgBadger on our PostgreSQL databases
almost always we have in Queries that took up the most time report
table information about transactions start time ('BEGIN;' command).
Something like that in example below:

23h34m52.26s48,556,1670.00sBEGIN;

 0.82s | BEGIN;
 0.82s | BEGIN;
 0.82s | BEGIN;
 0.81s | BEGIN;
 0.81s | BEGIN;
 0.81s | BEGIN;
 0.80s | BEGIN;
 0.80s | BEGIN;
 0.79s | BEGIN;
 0.79s | BEGIN;


I'm not sure if I'm reading this right, but are there more than 48 
million BEGINs that took 0s each (presumably rounded down) and then a 
handful taking about 0.8s?


If so, then it's likely nothing to do with the BEGIN and just that the 
machine was busy doing other things when you started a transaction.



Databases placed on different hardware, OS - Debian GNU/Linux,
PostgreSQL 9.1

So, questions are:
1. Is this a normal situation with transactions start time ( BEGIN method) ?


See above


2. How can we reduce transactions start time if it's possible in principle?


Below 0.00? Probably not


3. What happens in PostgreSQL on transaction starting time? Can someone
describe this process in detail? (of course, I saw in PostgreSQL source
code, for example, definition such kind functions, like StartTransaction
function, but it's not so easy to understand for third-party researcher,
that all of these operations mean in real for performance)


Well there are two important things to understand:
1. All* commands run in a transaction
2. I think most of the work in getting a new snapshot etc gets pushed 
back until it's needed.


So - the overall impact of issuing BEGIN should be close to zero.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Create tables performance

2012-07-06 Thread Richard Huxton

On 06/07/12 16:15, Sylvain CAILLET wrote:

Hi to all,

I run Postgresql 8.3.9 on a dedicated server running with Debian 5.0.4,
a strong bi quad-proc with RAM 16Go. My biggest db contains at least 100
000 tables.


That is a *lot* of tables and it's probably going to be slow whatever 
you do.



Last time, I started a Java process I use to make some
change on it, it created 170 new tables and it took one full minute.


What are you using all these tables for? I'm assuming most of them have 
identical structure.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] scale up (postgresql vs mssql)

2012-04-16 Thread Richard Huxton

On 15/04/12 13:43, Eyal Wilde wrote:


version;PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit


You've probably checked this, but if not it's worth making sure your 
disk I/O is roughly equivalent for the two operating-systems. It might 
be poor drivers on the CentOs system.


Do you have two equivalent machines, or are you dual-booting?

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Richard Huxton

On 23/02/12 08:39, Reuven M. Lerner wrote:
(4) I tried chunking the deletes, such that instead of trying to 
delete all of the records from the B table, I would instead delete 
just those associated with 100 or 200 rows from the R table.  On a 1 
GB subset of the data, this seemed to work just fine.  But on the 
actual database, it was still far too slow.


This is the approach I'd take. You don't have enough control / access to 
come up with a better solution. Build a temp table with 100 ids to 
delete. Time that, and then next night you can increase to 200 etc until 
it takes around 3 hours.


Oh - and get the Windows admins to take a look at disk activity - the 
standard performance monitor can tell you more than enough. If it is 
swapping constantly, performance will be atrocious but even if the disks 
are just constantly busy then updates and deletes can be very slow.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unlogged tables

2011-11-14 Thread Richard Huxton

On 14/11/11 08:10, Sergey Konoplev wrote:

Hi,

On 12 November 2011 00:18, Stephen Frostsfr...@snowman.net  wrote:

In a crash, unlogged tables are automatically truncated.


BTW I wonder what for they are truncated in a crash case?


Because they bypass the transaction-log (WAL), hence unlogged.

There's no way to know whether there were partial updates applied when 
the system restarts.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unlogged tables

2011-11-14 Thread Richard Huxton

On 14/11/11 10:08, Sergey Konoplev wrote:

On 14 November 2011 12:58, Richard Huxtond...@archonet.com  wrote:

Because they bypass the transaction-log (WAL), hence unlogged.
There's no way to know whether there were partial updates applied when the
system restarts.


I probably did not understand the truncate meaning correct. It
truncates all the records of the table or several recent records only?


All.

Let's say you were doing something like UPDATE unlogged_table SET x=1 
WHERE y=2. If a crash occurs during this command, there's no guarantee 
that the affected disk pages were all updated. Worse, a single page 
might be partially updated or even have rubbish in it (depending on the 
nature of the crash).


Without the WAL there's no way to check whether the table is good or 
not, or even to know what the last updates were. So - the only safe 
thing to do is truncate the unlogged tables.


In the event of a normal shutdown, we can flush all the writes to disk 
so we know all the data has been written, so there is no need to truncate.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] where clause + function, execution order

2011-11-11 Thread Richard Huxton

On 11/11/11 15:54, Julius Tuskenis wrote:

On 2011.11.11 17:38, Sorin Dudui wrote:

I have the following function:

CREATE OR REPLACE FUNCTION xxx(text)

[snip]

LANGUAGE sql STABLE



Function execute plan is prepared when creating it, so the where
clause should check the function result not altering its execution..


Not true for SQL functions. They can be inlined, but I'm not sure if 
this one will be.


What does EXPLAIN ANALYSE show for this query?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] where clause + function, execution order

2011-11-11 Thread Richard Huxton

On 11/11/11 16:28, Sorin Dudui wrote:

Hi,

this is the EXPLAIN ANALYSE output:


Merge Left Join  (cost=0.00..2820.34 rows=23138 width=777) (actual 
time=0.049..317.935 rows=26809 loops=1)
  Merge Cond: ((a.admin10)::text = (b.link_id)::text)
  -   Index Scan using admin_lookup_admin10 on admin_lookup a  (cost=0.00..845.04 
rows=5224 width=742) (actual time=0.015..40.263 rows=8100 loops=1)
Filter: (((admin40)::text  '-1'::text) AND (((admin40)::text = 
'ITA10'::text) OR ((admin40)::text = 'ITA15'::text) OR ((admin40)::text = 'ITA19'::text) OR 
((admin40)::text = 'ITA04'::text) OR ((admin40)::text = 'ITA09'::text) OR ((admin40)::text = 
'ITA03'::text) OR ((admin40)::text = 'ITA08'::text) OR ((admin40)::text = 'ITA17'::text) OR 
((admin40)::text = 'ITA02'::text) OR ((admin40)::text = 'ITA18'::text) OR ((admin40)::text = 
'ITA01'::text) OR ((admin40)::text = 'ITA20'::text) OR ((admin40)::text = 'ITA13'::text) OR 
((admin40)::text = 'ITA11'::text) OR ((admin40)::text = 'ITA14'::text) OR ((admin40)::text = 
'ITA16'::text) OR ((admin40)::text = 'ITA07'::text) OR ((admin40)::text = 'ITA06'::text) OR 
((admin40)::text = 'ITA12'::text) OR ((admin40)::text = 'ITA05'::text)))
  -   Index Scan using reg_data_a08id_copy on registrations_data b  
(cost=0.00..1496.89 rows=24174 width=45) (actual time=0.008..70.408 rows=24174 loops=1)
Total runtime: 372.765 ms


That certainly looks like it's been inlined. You are testing for 
ITA10, ITA15 etc outside the function-call, no? It's pushing those 
tests down, using index admin_lookup_admin10 to test for them then 
joining afterwards.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to boost performance of queries containing pattern matching characters

2011-02-13 Thread Richard Huxton

On 14/02/11 06:59, Gnanakumar wrote:


How can we boost performance of queries containing pattern matching
characters?



QUERY:  DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'



As it is clear from the above query, email is matched partially and
case-insensitively, which my application requirement demands.


Well, for that exact pattern you're not going to find an index that's 
much help. Do you really need something so wide-ranging though? The 
above will match all of the following:


us...@domain.com
us...@sub.domain.com
us...@domain.com.au
us...@unrelated-domain.com
us...@unrelated-domain.com.au
us...@sub.domain.com.au
us...@sub.unrelated-domain.com
us...@sub.unrelated-domain.com.au
us...@sub.unrelated-domain.completely-wrong.com

Is that really what you are after? Or, did you just want to match:
  us...@domain.com
  us...@sub.domain.com

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to boost performance of queries containing pattern matching characters

2011-02-13 Thread Richard Huxton

On 14/02/11 07:28, Gnanakumar wrote:

Is that really what you are after? Or, did you just want to match:
us...@domain.com
us...@sub.domain.com


I understand that because I've (%) at the beginning and end, it's going to
match unrelated domains, etc., which as you said rightly, it is
wide-ranging.  But my point here is that how can I improve performance of
the queries containing pattern matching characters.


If you really need to match all those options, you can't use an index. A 
substring-matching index would need to have multiple entries per 
character per value (since it doesn't know what you will search for). 
The index-size becomes unmanageable very quickly.


That's why I asked what you really wanted to match.

So, I'll ask again: do you really want to match all of those options?

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to boost performance of queries containing pattern matching characters

2011-02-13 Thread Richard Huxton

On 14/02/11 07:38, Artur Zając wrote:

I had almost the same problem.
To resolve it, I created my own text search parser (myftscfg) which divides
text in column into three letters parts, for example:

someem...@domain.com is divided to som, ome,mee,eem,ema,mai,ail,il@,
l@d,@do,dom,oma,mai,ain,in.,n.c,.co,com

There should be also index on email column:

CREATE INDEX email _fts on mytable using gin
(to_tsvector('myftscfg'::regconfig, email))

Every query like email ilike '%domain.com%' should be rewrited to:

WHERE
to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND
to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND
to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND

...

Looks like you've almost re-invented the trigram module:
  http://www.postgresql.org/docs/9.0/static/pgtrgm.html

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to boost performance of queries containing pattern matching characters

2011-02-13 Thread Richard Huxton

On 14/02/11 07:46, Gnanakumar wrote:

If you really need to match all those options, you can't use an index. A
substring-matching index would need to have multiple entries per
character per value (since it doesn't know what you will search for).
The index-size becomes unmanageable very quickly.



That's why I asked what you really wanted to match.

To be more specific, in fact, our current application allows to delete
email(s) with a minimum of 3 characters.  There is a note/warning also given
for application Users' before deleting, explaining the implication of this
delete action (partial  case-insensitive, and it could be wide-ranging
too).


So, I'll ask again: do you really want to match all of those options?

Yes, as explained above, I want to match all those.


Then you can't use a simple index. If you did use an index it would 
probably be much slower for com or yah or gma and so on.


The closest you can do is something like Artur's option (or the pg_trgm 
module - handy since you are looking at 3-chars and up) to select likely 
matches combined with a separate search on '%domain.com%' to confirm 
that fact.


P.S. - I'd be inclined to just match the central domain parts, so for 
us...@europe.megacorp.com you would index europe and megacorp and 
only allow matching on the start of each string. Of course if your 
application spec says you need to match on p.c too then that's what 
you have to do.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance libpq vs JDBC

2010-12-16 Thread Richard Huxton

On 16/12/10 09:21, Werner Scholtes wrote:

I assume that the wire protocol of PostgreSQL allows to transmit
multiple rows at once, but libpq doesn't have an interface to access it.
Is that right?


Sounds wrong to me. The libpq client is the default reference 
implementation of the protocol. If there were large efficiencies that 
could be copied, they would be.


Anyway - you don't need to assume what's in the protocol. It's 
documented here:

  http://www.postgresql.org/docs/9.0/static/protocol.html

I'd stick wireshark or some other network analyser on the two sessions - 
see exactly what is different.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance libpq vs JDBC

2010-12-16 Thread Richard Huxton

On 16/12/10 12:28, Werner Scholtes wrote:

Thanks a lot for your advice. I found the difference: My Java program
sends one huge SQL string containing 1000 INSERT statements separated
by ';' (without using prepared statements at all!), whereas my C++
program sends one INSERT statement with parameters to be prepared and
after that 1000 times parameters. Now I refactured my C++ program to
send also 1000 INSERT statements in one call to PQexec and reached
the same performance as my Java program.


So - it was the network round-trip overhead. Like Divakar suggested, 
COPY or VALUES (),(),() would work too.


You mention multiple updates/deletes too. Perhaps the cleanest and 
fastest method would be to build a TEMP table containing IDs/values 
required and join against that for your updates/deletes.



I just wonder why anyone should use prepared statements at all?


Not everything is a simple INSERT. Preparing saves planning-time on 
repeated SELECTs. It also provides some SQL injection safety since you 
provide parameters rather than building a SQL string.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to get FK to use new index without restarting the database

2010-12-16 Thread Richard Huxton

On 16/12/10 12:12, Eric Comeau wrote:


The problem surfaced at one of our client installs where a maintenance
DELETE query was running for over 24 hrs. We have since then identified
the missing indexes and have sent the client a script to create them,
but in our testing we could not been able to get postgres to use the new
index for the FK cascade delete without bouncing the database.


Well, an ongoing DELETE isn't going to see a new index. I'd have thought 
a new connection should though.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using more tha one index per table

2010-07-22 Thread Richard Huxton

On 22/07/10 03:27, Greg Smith wrote:

Steve Atkins wrote:

If http://postgresql.org/docs/9.0/* were to 302 redirect to
http://postgresql.org/docs/current/* while 9.0 is the current release
(and similarly for 9.1 and so on) I suspect we'd find many more links
to current and fewer links to specific versions after a year or two.


True, but this would leave people with no way to bookmark a permanent
link to whatever is the current version, which will represent a
regression for how some people want the site to work.


Having a quick look at the website, a simple change might be to have a 
large CURRENT MANUALS link above all the versioned links. That should 
help substantially.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance Tuning Large PL/PGSQL Stored Procedure

2010-03-26 Thread Richard Huxton

On 26/03/10 03:56, Eliot Gable wrote:


I really am chasing milliseconds here, and I appreciate all your feedback.
You've given me a relatively large number of possible optimizations I can
try out. I will definitely try out the libpqtypes. That sounds like a
promising way to further cut down on execution time. I think most of my
performance penalty is in transfering the results back to the C++
application.


In addition to all of Merlin's good advice, if the client is on a 
different machine to the server then try sticking wireshark or similar 
onto the connection. That should make it pretty clear where the main 
costs are in getting your data back.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Multiple data base on same server

2010-02-26 Thread Richard Huxton

On 26/02/10 09:37, elias ghanem wrote:

Hi,

I'm using postgresql 8.4

I need to install multiple postgresql dbs on one server but I have some
questions:

-Is there any problems (performance wise or other) if I have 10 to 15 DBs on
the same server?


Clearly that's going to depend on what they're all doing and how big a 
server you have. There's no limitation in PostgreSQL that stops you though.



-Each DB needs 10 tablespaces, so if I create 10 different tablespaces for
each DB I will have 100 to 150 table space on the same server. So can this
also cause any problems?


Do you have 200-300+ disks to put these tablespaces on? If not, I'm not 
clear what you are trying to do. Why does each DB need 10 tablespaces?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Multiple data base on same server

2010-02-26 Thread Richard Huxton

On 26/02/10 12:45, elias ghanem wrote:

Hi,
Thanks for your answer,
Concerning the second point, each db have different table that are logically
related (for ex, tables for configuration, tables for business...) plus I'm
planning to put the indexes on their own tablespaces.
Concerning the disks I will maybe stored on multiple disks (but surely not
200-300). So I'm just wondering If this big number of tablespaces on a same
db server may cause problems,


If the tablespaces aren't on different disks, I'm not sure what the 
point is.


Do you perhaps mean schemas? So you have e.g. a system schema with 
tables users, activity_log etc? There's no problem with 20-30 
schemas per database.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Extracting superlatives - SQL design philosophy

2010-02-25 Thread Richard Huxton

On 24/02/10 23:37, Dave Crooke wrote:

1. The city temps table is a toy example, not meant to be realistic :-)


You knew that and I guessed it, but it's worth stating these things for 
people who read the archives a year from now.



2. Yes, my (Java) algorithm is deterministic ... it will return
exactly one row per city, and that will be the row (or strictly, *a*
row) containing the highest temp. Temp value ties will break in favour
of earlier rows in Guinness Book of Records tradition :-) It's
equivalent to a HashAggregate implementation.


But not when you add in other columns (which is what you're trying to do).


The following two query plans (from my real schema) illustrate the
itch I am trying to scratch  I want the functionality of the 2nd
one, but with the execution plan structure of the first:

# explain analyse select a, max(b) from perf_raw_2010_02_23 group by a;
   QUERY
PLAN
--
  HashAggregate  (cost=117953.09..117961.07 rows=639 width=8) (actual
time=10861.845..10863.008 rows=1023 loops=1)
-   Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139
loops=1)
  Total runtime: 10863.856 ms
(3 rows)

Time: 10864.817 ms
# explain analyse select distinct on (a) * from perf_raw_2010_02_23
order by a, b desc ;


One big bit of the cost difference is going to be the ordering you need 
to get a repeatable result.



  QUERY
PLAN
-
  Unique  (cost=1059395.04..1085775.73 rows=639 width=28) (actual
time=46011.204..58428.210 rows=1023 loops=1)
-   Sort  (cost=1059395.04..1072585.39 rows=5276139 width=28)
(actual time=46011.200..53561.112 rows=5276139 loops=1)
  Sort Key: a, b
  Sort Method:  external merge  Disk: 247584kB
-- actually OS RAM buffers


Even if the sort never actually reaches a physical disk, you should 
still see an increase by increasing sort_mem for the duration of the one 
query. It's not going to be the magnitude you want, but probably worth 
doing.



  -   Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139
loops=1)
  Total runtime: 58516.185 ms
(6 rows)

Time: 58517.233 ms

The only difference between these two is that the second query returns
the whole row. The *ratio* in cost between these two plans increases
in proportion to log(n) of the table size ... at 5.5m rows its
livable, at 500m it's probably not :-!


If performance on this query is vital to you, and the table doesn't 
change after initial population (which I'm guessing is true) then try an 
index on (a asc, b desc) and CLUSTER that index. Depending on the ratio 
of distinct a:b values that could be what you're after.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Extracting superlatives - SQL design philosophy

2010-02-24 Thread Richard Huxton

On 24/02/10 22:47, Dave Crooke wrote:

I'd imagine it would be possible to have a query planner optimization
that would convert Garrett's DISTINCT ON syntax to do what I was
trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is
going to return the the one row for each X which has the highest value
of Y, and so use a MAX-structured accumulation instead of a sort.


Why is there only one row? For city temperatures, that seems unlikely.

In the event of more than one row does your algorithm give repeatable 
results?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL

2010-01-25 Thread Richard Huxton

On 22/01/10 19:06, Tory M Blue wrote:

 Here is the explain plan for the query. Actual rows that the query
 returns is 6369

Actually, it processes 19,799 rows (see the actual rows= below).


SLOW



  -   Bitmap Heap Scan on userstats  (cost=797.69..118850.46
rows=13399 width=8) (actual time=281.604..31190.290 rows=19799
loops=1)



Total runtime: 31219.536 ms



FAST



  -   Bitmap Heap Scan on userstats a  (cost=802.66..118855.43
rows=33276 width=23) (actual time=55.400..3807.908 rows=2606 loops=1)



Total runtime: 3813.626 ms


OK - so the first query processes 19,799 rows in 31,219 ms (about 1.5ms 
per row)


The second processes 2,606 rows in 3,813 ms (about 1.3ms per row).

You are asking for DISTINCT user-ids, so it's seems reasonable that it 
will take slightly longer to check a larger set of user-ids.


Otherwise, both queries are the same. I'm still a little puzzled by the 
bitmap scan, but the planner probably knows more about your data than I do.


The main time is spent in the bitmap heap scan which is where it's 
grabbing actual row data (and presumably building a hash over the uid 
column). you can see how long in the actual time the first number 
(e.g. 281.604) is the time spent before it starts, and the second is the 
total time at finish (31190.290). If loops was greater than 1 you 
would multiply the times by the number of loops to get a total.


So - there's nothing wrong in the sense that the second query does the 
same as the first. Let's take a step back. What you really want is your 
reports to be faster.


You mentioned you were running this query thousands of times with a 
different makeid each time. Running it once for all possible values 
and stashing the results in a temp table will probably be *much* faster. 
The planner can just scan the whole table once and build up its results 
as it goes.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL

2010-01-22 Thread Richard Huxton

On 22/01/10 18:03, Tory M Blue wrote:

On Fri, Jan 22, 2010 at 1:42 AM, Richard Huxtond...@archonet.com  wrote:


On 21/01/10 22:15, Tory M Blue wrote:



2. If it's mostly tagged=true you are interested in you can always use a
partial index: CREATE INDEX ... (makeid) WHERE tagged
This might be a win even if you need a second index with WHERE NOT tagged.



Partial index doesn't seem to fit here due to the fact that there are 35-40%
Marked True.

Didn't think about creating a second index for false, may give that a shot.


If you're mostly search tagged=true, try the partial index - it'll mean 
the planner is just scanning the index for the one term.



Also, either I've not had enough cofee yet, or a bitmap scan is an odd
choice for only ~ 13000 rows out of 100 million.

  *-Bitmap Index Scan on


idx_retargetuserstats_makeidtag  (cost=0.00..360.20 rows=13175 width=0)*

*  Index Cond: ((makeid =
'b1mw-ferman'::text)
AND (tagged = true))*



Otherwise, see what Craig said.

I'm assuming this isn't the query that is CPU bound for a long time. Unless
your table is horribly bloated, there's no reason for that judging by this
plan.


It is, but not always, only when there are 10K more matches. And the explain
unfortunately is sometimes way high or way low, so the expalin is hit and
miss.

But the same sql that returns maybe 500 rows is pretty fast, it's the return
of 10K+ rows that seems to stall and is CPU Bound.


Hmm - might be able to push that cross-over point up a bit by tweaking 
various costs, but you've got to be careful you don't end up making all 
your other queries worse. It'd be good to figure out what the problem is 
first.


Looking at the query there are four stages:
 1. Scan the index, build a bitmap of heap pages with matching rows
 2. Scan those pages, find the rows that match
 3. Run DISTINCT on the uids
 4. Count them
I wonder if it could be the DISTINCT. What happens with a count(*) or 
count(uid) instead? Also - you might find EXPLAIN ANALYZE more useful 
than straight EXPLAIN here. That will show actual times for each stage.


On Craig's branch of this thread, you say you call it 6000 times with 
different makeids. Any reason why you can't join to a temp table and 
just do it in one query?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inserting 8MB bytea: just 25% of disk perf used?

2010-01-19 Thread Richard Huxton

On 19/01/10 10:50, fka...@googlemail.com wrote:

However, the deeper question is (sounds ridiculous): Why am
I I/O bound *this much* here. To recall: The write
performance in pg is about 20-25% of the worst case serial
write performance of the disk (and only about 8-10% of the
best disk perf) even though pg_xlog (WAL) is moved to
another disk, only 10 simple INSERT commands, a simple table
of 5 columns (4 unused, one bytea) and one index for OID, no
compression since STORAGE EXTERNAL, ntfs tweaks (noatime
etc), ...


I'm no Windows expert, but the sysinternals tools (since bought by 
Microsoft) have always proved useful to me.


Diskmon should show you what's happening on your machine:
http://technet.microsoft.com/en-us/sysinternals/bb896646.aspx

Be aware that this will generate a *lot* of data very quickly and you'll 
need to spend a little time analysing it. Try it without PG running to 
see what your system is up to when idle first to get a baseline.


Unfortunately it doesn't show disk seek times (which is probably what 
you want to measure) but it should let you decode what reads/writes are 
taking place when. If two consecutive disk accesses aren't adjacent then 
that implies a seek of course. Worst case you get two or more processes 
each accessing different parts of the disk in an interleaved arrangement.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-10-23 Thread Richard Huxton
jes...@krogh.cc wrote:
 That structure isn't exposed to the planner though, so it doesn't
 benefit from any re-ordering the planner would normally do for normal
 (exposed) AND/OR clauses.

 Now, to_tsquery() can't re-order the search terms because it doesn't
 know what column it's being compared against. In fact, it might not be a
 simple column at all.
 
 I cant follow this logic based on explain output, but I may have
 misunderstood something. The only difference in these two query-plans is
 that we have an additional or'd term in the to_tsquery().

Hmm - I've had a poke through the source. I've slightly misled you...

 What we see is that, the query-planner indeed has knowledge about changes
 in the row estimates based on changes in the query to to_tsquery(). 

Yes, new in 8.4 - sorry, thought that hadn't made it in.

The two plan-nodes in question are in:
  backend/executor/nodeBitmapIndexscan.c
  backend/executor/nodeBitmapHeapscan.c
The actual tsearch stuff is in
  src/backend/utils/adt/ts*.c

It looks like TS_execute (tsvector_op.c) is the bit of code that handles
the tsquery tree. That uses a callback to actually check values
(checkcondition_gin). The gin_extract_tsquery function is presumably the
extractQuery function as described in the manuals (Ch 52).

So, I'm guessing you would want to do is generate a reduced query tree
for the indexscan (A  B  C = A if A is an uncommon word) and use the
full query tree for the heap check. Now, what isn't clear to me on first
glance is how to determine which phase of the bitmap scan we are in.

HTH

Just checking, because I don't think it's useful in this case. But, you
don know about gin_fuzzy_search_limit?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-10-23 Thread Richard Huxton
Jeff Davis wrote:
 On Fri, 2009-10-23 at 09:26 +0100, Richard Huxton wrote:
 That structure isn't exposed to the planner though, so it doesn't
 benefit from any re-ordering the planner would normally do for normal
 (exposed) AND/OR clauses.
 
 I don't think that explains it, because in the second plan you only see
 a single index scan with two quals:
 
Index Cond: ((ftsbody_body_fts @@
  to_tsquery('commonterm'::text)) AND (ftsbody_body_fts @@
  to_tsquery('spellerror'::text)))
 
 So it's entirely up to GIN how to execute that.

http://www.postgresql.org/docs/8.4/static/gin-extensibility.html
Datum *extractQuery(...)
Returns an array of keys given a value to be queried; that is, query is
the value on the right-hand side of an indexable operator whose
left-hand side is the indexed column

So - that is presumably two separate arrays of keys being matched
against, and the AND means if the first fails it'll never check the second.

What I'm not sure about is if tsquery('commonterm  spellerror')
produces two sets of keys or if it just produces one.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance problems with DISTINCT ON

2009-09-29 Thread Richard Huxton
Sgarbossa Domenico wrote:
 I need to retrieve the most recent prices per products from a price list 
 table:

 select distinct on (articolo) articolo,data_ent,prezzo from 
 listini_anagrafici order by articolo, data_ent desc
 
 but it seems that this query runs slowly... about 5/6 seconds.
 the table contains more or less 500K records, PostgreSQL version is 8.1.11 
 and the server has 4gb of RAM entirely dedicate to the db.

 'Unique  (cost=73893.89..76551.25 rows=88312 width=24) (actual 
 time=4022.578..5076.206 rows=193820 loops=1)'
 '  -  Sort  (cost=73893.89..75222.57 rows=531472 width=24) (actual 
 time=4022.574..4505.538 rows=531472 loops=1)'
 'Sort Key: articolo, data_ent'
 '-  Seq Scan on listini_anagrafici  (cost=0.00..16603.72 rows=531472 
 width=24) (actual time=0.009..671.797 rows=531472 loops=1)'
 'Total runtime: 5217.452 ms'

You've got 531472 rows in the table and the query is going to output
193820 of them. Scanning the whole table is almost certainly the way to go.

If the table doesn't change much, you could try running a CLUSTER on the
index you've created. That will lock the table while it re-orders the
physical layout of the rows based on your index though, so it's no good
if the table is updated much.

Failing that, you could try issuing set work_mem = ... before the
query with increasing sizes for work_mem. That might make the sort
faster too.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Possible causes of sometimes slow single-row UPDATE with trivial indexed condition?

2009-09-17 Thread Richard Huxton
Vlad Romascanu wrote:
 Problem occurs when running (in production) Postgres 8.3.7 64-bit (from
 RPM) on Ubuntu 8.04.2, on an Amazon EC2 (xen) Large instance (8GB
 RAM), with the DB on a 50GB EC2 block device.

Hmm - don't know what the characteristics of running PG on EC2 are. This
might be something peculiar to that.

 Problem does not occur when running (in staging/pre-production) Postgres
 8.3.5 32-bit (from RPM) on Ubuntu 8.04.1, on a less beefy Amazon EC2
 (xen) Small instance, with the DB on a 5GB EC2 block device.
 
 I am running with slow query logging on, and somewhat inexplicably I've
 been getting the following slow UPDATE query several times in the past
 weeks (I'm also including some context lines above and below):

 2009-09-14 08:12:30.363 UTC u...@database pid=26474
 ip=127.0.0.1(58364) sid=4aadf58d.676a:1 LOG:  duration: 13472.892 ms
   statement: UPDATE document_sets SET status = E'rejected',

 This is one of the faster occurrences; at times the query has been
 logged as having taken 100+ seconds:

That's *very* slow indeed, and clearly the query itself is simple enough.

Typically in a situation like this you might suspect checkpointing was
the problem. Lots of dirty disk pages being flushed to disk before a
checkpoint. The stats for disk activity you give don't support that
idea, although 10 minute intervals is quite far apart.

Your table-stats show this is a small table. If it's updated a lot then
it might be that your autovacuum settings aren't high enough for this
table. The log_autovacuum_min_duration setting might be worth enabling
too - to see if autovacuum is taking a long time over anything.

Another thing that can cause irregular slowdowns is if you have a
trigger with some custom code that takes an unexpectedly long time to
run (takes locks, runs a query that plans badly occasionally). I don't
know if that's the case here.

Oh, if you don't have indexes on status or updated_at then you might
want to read up on HOT and decrease your fill-factor on the table too.
That's unrelated to this though.


It looks like the problem is common enough that you could have a small
script check pg_stat_activity once every 10 seconds and dump a snapshot
of pg_locks, vmstat etc. If you can catch the problem happening that
should make it easy to diagnose.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Under the hood of views

2009-08-13 Thread Richard Huxton

David Kerr wrote:


create view test as
select a,b,c,d,e,f,g from testtable;

select a from test;

(does the engine retrieve b-g?)


Shouldn't - the query just gets rewritten macro-style. I don't think it 
eliminates joins if you don't need any columns, but that's not possible 
without a bit of analysis.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton

Suvankar Roy wrote:

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 


The other poster suggested contacting Greenplum and I can only agree.


The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red)


There is no red, particularly if viewing messages as plain text (which 
most people do on mailing lists). Consider indicating a line some other 
way next time (commonly below the line you put something like this is 
line 7 ^)


The most common problem I get with YAML files though is when a tab is 
accidentally inserted instead of spaces at the start of a line.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton

Suvankar Roy wrote:

Hi Richard,

I sincerely regret the inconvenience caused.


No big inconvenience, but the lists can be very busy sometimes and the 
easier you make it for people to answer your questions the better the 
answers you will get.



%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1

USER: gpadmin
DEFINE: 
- INPUT: #** This the line which is causing the error **#

 NAME: doc
 TABLE: documents

If it looks fine, always check for tabs. Oh, and you could have cut out 
all the rest of the file, really.


I have learnt that unnecessary TABs can the cause of this, so trying to 
overcome that, hopefully the problem will subside then


I'm always getting this. It's easy to accidentally introduce a tab 
character when reformatting YAML. It might be worth checking if your 
text editor has an option to always replace tabs with spaces.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incr/Decr Integer

2009-07-16 Thread Richard Huxton

William Scott Jordan wrote:

Hey all!

Is there a better way to increase or decrease the value of an integer 
than doing something like:



UPDATE the_table SET the_int = the_int + 1 WHERE the_id = 123 ;


No.

We seem to be getting a lot of deadlocks using this method under heavy 
load.  Just wondering if we should be doing something different.


You can't get deadlocks with that - it only references one table.

What is the purpose of this query - how are you using it?

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very big insert/join performance problem (bacula)

2009-07-14 Thread Richard Huxton

Marc Cousin wrote:



Your effective_cache_size is really small for the system you seem to have -
its the size of IO caching your os is doing and uses no resources itself.
And 800MB of that on a system with that amount of data seems a bit unlikely
;-)

Using `free` you can see the amount of io caching your OS is doing atm. in
the 'cached' column.

That possibly might tip some plans in a direction you prefer.

What kind of machine are you running this on?


I played with this parameter too, and it didn't influence the plan. Anyway, the 
doc says it's the OS cache available for one query,


No they don't. I'm guessing you're getting mixed up with work_mem.

 and there may be a lot of
insert queries at the same time, so I chose to be conservative with this 
value. I tried it with 8GB too, the plans were the same.


The OS cache is around 8-10GB by the way.


That's what you need to set effective_cache_size to then.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very big insert/join performance problem (bacula)

2009-07-14 Thread Richard Huxton

Marc Cousin wrote:


Temporarily I moved the problem at a bit higher sizes of batch by changing 
random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an 
apprentice sorcerer with this, as I told postgreSQL that fetching rows from 
disk are much cheaper than they are. These values are, I think, completely 
abnormal.


They certainly don't have anything to do with reality. Try putting them 
back to (say) seq_page_cost=1 and random_page_cost=2.



So, finally, to my questions :
- Is it normal that PostgreSQL is this off base on these queries (sorry I 
don't have the plans, if they are required I'll do my best to get some, but 
they really are the two obvious plans for this kind of query). What could 
make it choose the hash join for too small batch tables ?


No point in speculating without plans.


- Is changing the 2 costs the way to go ?


Not the way you have.

- Is there a way to tell postgreSQL that it's more costly to sort than it 
thinks ? (instead of telling it that fetching data from disk doesn't cost 
anything).


That's what the configuration settings do. But if you put a couple way 
off from reality it'll be pure chance if it gets any estimates right.



Here are the other non-default values from my configuration :

shared_buffers = 2GB
work_mem = 64MB


Set this *much* higher when you are running your bulk imports. You can 
do it per-connection. Try 256MB, 512MB, 1GB (but keep an eye on total 
memory used).



maintenance_work_mem = 256MB
max_fsm_pages = 1500 # There are quite big deletes with bacula ...
effective_cache_size = 800MB


See other emails on this one.


default_statistics_target = 1000


Probably don't need this for all columns, but it won't cause problems 
with these queries.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Data caching

2009-07-09 Thread Richard Huxton

Martin Chlupac wrote:

Hello everybody,
I have a simple query which selects data from not very large table (
434161 rows) and takes far more time than I'd expect. I believe it's
due to a poor disk performance because when I execute the very same
query for a second time I get much better results (caching kicks in?).
Can you please confirm my theory or do you see any other possible
explanation?


Yep - it's the difference between fetching from memory and from disk.


 -  Bitmap Heap Scan on
records_f4f23ca0-9c35-43ac-bb0d-1ef3784399ac  (cost=76.75..3819.91
rows=1912 width=206) (actual time=329.416..3677.521 rows=2161 loops=1)



 -  Bitmap Heap Scan on
records_f4f23ca0-9c35-43ac-bb0d-1ef3784399ac  (cost=76.75..3819.91
rows=1912 width=206) (actual time=1.616..10.369 rows=2161 loops=1)


The plan scans the index, and builds up a bitmap of which disk-blocks 
contain (potential) matches. It then has to read the blocks (the heap 
scan above), confirm they match and then return the rows. If you look at 
the actual time above you can see about 90% of the slow query is spent 
doing this.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] slow DELETE on 12 M row table

2009-06-25 Thread Richard Huxton

Greg Stark wrote:

waiting means it's blocked trying to acquire a lock. Some open
transaction has the table you're trying to index locked. Look in
pg_locks and pg_stat_activity to find out who.


Or you might find CREATE INDEX CONCURRENTLY fits your situation.

http://www.postgresql.org/docs/8.3/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] EXPLAIN understanding? (restarted from Censorship)

2009-06-10 Thread Richard Huxton

On Wed, 10 Jun 2009, Richard Huxton wrote:
Send it to the list again, and cc: me directly if you like. If it doesn't 
show up in the next 20 minutes, I'll try sending it.


Okay, here we go. I have (per Tom's advice) found some acknowledgement
knobs on Majordomo. Here follows my original rejected mail:

Hi. I thought by now I would be fairly good at understanding EXPLAIN
ANALYSE results, but I can't quite figure this one out. Perhaps someone
could help me.

EXPLAIN ANALYSE SELECT *
FROM GeneGoAnnotation a1, GOAnnotation a2, OntologyTermRelations a3
WHERE a1.GoAnnotation = a2.id AND a2.ontologyTermId = a3.OntologyTerm;

  QUERY PLAN
---
  Nested Loop  (cost=0.00..673587.67 rows=330437962 width=95)
   (actual time=0.056..1924645.797 rows=344491124 loops=1)
-  Merge Join  (cost=0.00..28369.58 rows=361427 width=87)
(actual time=0.039..4620.912 rows=361427 loops=1)
  Merge Cond: (a1.goannotation = a2.id)
  -  Index Scan using genegoannotation__goannotation on 
genegoannotation a1

(cost=0.00..9710.32 rows=361427 width=8)
(actual time=0.015..840.547 rows=361427 loops=1)
  -  Index Scan using goannotation_pkey on goannotation a2
(cost=0.00..13133.12 rows=403323 width=79)
(actual time=0.014..1427.179 rows=403323 loops=1)
-  Index Scan using ontologytermrelations__ontologyterm on 
ontologytermrelations a3

(cost=0.00..1.20 rows=47 width=8)
(actual time=0.022..1.908 rows=953 loops=361427)
  Index Cond: (a3.ontologyterm = a2.ontologytermid)
  Total runtime: 2524647.064 ms
(8 rows)

If I look at the actual results of the outer-most join, the nested loop,
then I can take the number rows=344491124 and divide it by loops=361427 to
get rows=953. Clearly this means that on average each index scan on a3
returned 953 rows.

However, if I apply the same logic to the estimated results, it all falls
apart. The total estimated number of rows is remarkably accurate, as is
the estimated number of loops (results from the merge join). However the
average number of rows expected to be returned from the index scan is only
47. I don't know how the planner is getting its accurate final estimate of
rows=330437962, because it is not from multiplying rows=361427 by rows=47.
That would only give 16987069 rows.

Any ideas/explanations?

Matthew

--
Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers.-- Computer Science Lecturer


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] EXPLAIN understanding? (restarted from Censorship)

2009-06-10 Thread Richard Huxton

Please ignore - Matthew has discovered what was blocking this message.
Use his thread instead.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using index for bitwise operations?

2009-06-01 Thread Richard Huxton

Shaul Dar wrote:

Hi,

I have at column that is a bit array of 16, each bit specifying if a certain
property, out of 16, is present or not. Our typical query select 300
random rows (could be located in different blocks) from the table based on
another column+index, and then filters them down to ~50 based on this the
bit field.

[snip]
 W/o an index this might be overly expensive,
 even as a filter (on selected 300 rows).

Have you _tried_ just not having an index at all? Since you are only 
accessing a relatively small number of rows to start with, even an 
infinitely efficient index isn't going to make that much difference. 
Combine that with the fact that you're going to have the indexes 
competing with the table for cache space and I'd see how much difference 
it makes just not having it.


Failing that, perhaps have an index on a single bit if there is one you 
always/mostly check against.


The relational way to do this would be one or more property tables 
joined to your main table. If the majority of your properties are not 
set then this could be faster too.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any better plan for this query?..

2009-05-06 Thread Richard Huxton

Dimitri wrote:

Hi Chris,

the only problem I see here is it's 2 times slower vs InnoDB


How do you know? This isn't just based on the explain values reported, 
is it?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any better plan for this query?..

2009-05-06 Thread Richard Huxton

Dimitri wrote:

Hi Richard,

no, of course it's not based on explain :-)
I've run several tests before and now going in depth to understand if
there is nothing wrong. Due such a single query time difference InnoDB
is doing 2-3 times better TPS level comparing to PostgreSQL..


And you are satisfied that it is the planned query time that is the 
dominant factor here, and not parsing time, connection time, data 
transport, disk bandwidth etc?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any better plan for this query?..

2009-05-06 Thread Richard Huxton

Dimitri wrote:

I'll try to answer all mails at once :-))

- query is running fully in RAM, no I/O, no network, only CPU time

- looping 100 times the same query gives 132ms total time (~1.32ms per
query), while it's 44ms on InnoDB (~0.44ms per query)


Well, assuming you're happy that PG is tuned reasonably for your machine 
and that MySQL's query cache isn't returning the results here it looks 
like MySQL is faster for this particular query.


The only obvious place there could be a big gain is with the hashing 
algorithm. If you remove the ORDER BY and the query-time doesn't fall by 
much then it's the hash phase.


The other thing to try is to alter the query to be a SELECT count(*) 
rather than returning rows - that will let you measure the time to 
transfer the result rows.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-04-21 Thread Richard Huxton

da...@lang.hm wrote:

On Tue, 21 Apr 2009, Stephen Frost wrote:


* da...@lang.hm (da...@lang.hm) wrote:

while I fully understand the 'benchmark your situation' need, this isn't
that simple.


It really is.  You know your application, you know it's primary use
cases, and probably have some data to play with.  You're certainly in a
much better situation to at least *try* and benchmark it than we are.


rsyslog is a syslog server. it replaces (or for debian and fedora, has 
replaced) your standard syslog daemon. it recieves log messages from 
every app on your system (and possibly others), filters, maniulates 
them, and then stores them somewhere. among the places that it can store 
the logs are database servers (native support for MySQL, PostgreSQL, and 
Oracle. plus libdbi for others)


Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth 
is a long-term admin thing.


And possibly
5. Have two connections, one for fatal/error etc and one for info/debug 
level log statements (configurable split?). Then you can use the 
synchronous_commit setting on the less important ones. Might buy you 
some performance on a busy system.


http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

other apps then search and report on the data after it is stored. what 
apps?, I don't know either. pick your favorite reporting tool and you'll 
be a step ahead of me (I don't know a really good reporting tool)


as for sample data, you have syslog messages, just like I do. so you 
have the same access to data that I have.


how would you want to query them? how would people far less experianced 
that you want to query them?


I can speculate that some people would do two columns (time, everything 
else), others will do three (time, server, everything else), and others 
will go further (I know some who would like to extract IP addresses 
embedded in a message into their own column). some people will index on 
the time and host, others will want to do full-text searches of everything.


Well, assuming it looks much like traditional syslog, I would do 
something like: (timestamp, host, facility, priority, message). It's 
easy enough to stitch back together if people want that.


PostgreSQL's full-text indexing is quite well suited to logfiles I'd 
have thought, since it knows about filenames, urls etc already.


If you want to get fancy, add a msg_type column and one subsidiary table 
for each msg_type. So - you might have smtp_connect_from (hostname, 
ip_addr). A set of perl regexps can match and extract the fields for 
these extra tables, or you could do it with triggers inside the 
database. I think it makes sense to do it in the application. Easier for 
users to contribute new patterns/extractions. Meanwhile, the core table 
is untouched so you don't *need* to know about these extra tables.


If you have subsidiary tables, you'll want to partition those too and 
perhaps stick them in their own schema (logs200901, logs200902 etc).


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Need help with one query

2009-03-20 Thread Richard Huxton
Anne Rosset wrote:
 EXPLAIN ANALYZE
 SELECT
   audit_change.id AS id,
   audit_change.audit_entry_id AS auditEntryId,
   audit_entry.object_id   AS objectId,
   audit_change.property_name  AS propertyName,
   audit_change.property_type  AS propertyType,
   audit_change.old_value  AS oldValue,
   audit_change.new_value  AS newValue,
   audit_change.flexfield  AS flexField
 FROM
   audit_entry audit_entry, audit_change audit_change
 WHERE
   audit_change.audit_entry_id = audit_entry.id
   AND audit_entry.object_id = 'artf414029';
[query reformatted to make it more readable]

Not quite clear why you are aliasing the tables to their own names...

 -
 
 Hash Join  (cost=8.79..253664.55 rows=4 width=136) (actual
 time=4612.674..6683.158 rows=4 loops=1)
   Hash Cond: ((audit_change.audit_entry_id)::text = (audit_entry.id)::text)
   -  Seq Scan on audit_change  (cost=0.00..225212.52 rows=7584852
 width=123) (actual time=0.009..2838.216 rows=7584852 loops=1)
   -  Hash  (cost=8.75..8.75 rows=3 width=45) (actual time=0.049..0.049
 rows=4 loops=1)
 -  Index Scan using audit_entry_object on audit_entry 
 (cost=0.00..8.75 rows=3 width=45) (actual time=0.033..0.042 rows=4 loops=1)
   Index Cond: ((object_id)::text = 'artf414029'::text)
 Total runtime: 6683.220 ms

Very odd. It knows the table is large and that the seq-scan is going to
be expensive.

Try issuing set enable_seqscan = off and run the explain analyse
again. That should show the cost of using the indexes.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] LIMIT confuses the planner

2009-02-23 Thread Richard Huxton
Kouber Saparev wrote:

 db=# EXPLAIN ANALYZE
 SELECT
   *
 FROM
   login_attempt
 WHERE
   username='kouber'
 ORDER BY
   login_attempt_sid DESC;
 
 QUERY PLAN
 --
 
  Sort  (cost=1415.15..1434.93 rows=7914 width=38) (actual
 time=0.103..0.104 rows=2 loops=1)
Sort Key: login_attempt_sid
Sort Method:  quicksort  Memory: 25kB
-  Index Scan using login_attempt_username_idx on login_attempt
 (cost=0.00..902.71 rows=7914 width=38) (actual time=0.090..0.091 rows=2
 loops=1)
  Index Cond: ((username)::text = 'kouber'::text)
  Total runtime: 0.140 ms

It's expecting 7914 rows returned and is getting only 2. That is
probably the root of the problem.

 However when I add a LIMIT clause to the same query the planner no
 longer uses the right index, hence the query becomes very slow:
 
 
 db=# EXPLAIN ANALYZE
 SELECT
   *
 FROM
   login_attempt
 WHERE
   username='kouber'
 ORDER BY
   login_attempt_sid DESC LIMIT 20;

Since it's expecting 7914 rows for kouber it thinks it will find the
20 rows you want fairly quickly by just looking backward through the
login_attempt_pkey index.

Try increasing the stats on the username column.

ALTER TABLE login_attempt ALTER COLUMN username SET STATISTICS 100;
ANALYZE login_attempt;

You can try different values of statistics up to 1000, but there's no
point in setting it too high.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] col1 ILIKE 'foo%' not behaving the same as lower(col1) LIKE 'foo%'

2009-02-12 Thread Richard Huxton
milos d wrote:
 Hello,
 
 I have a table 'foo_bar' with a column 'col1' defined as
 'col1 varchar(512)'. This column is indexed using an expression index
 defined as
 
 CREATE INDEX ix_foo_bar_by_col1 ON foo_bar(lower(col1) col1 
 varchar_pattern_ops)
 
 The
 problem is when I try matching using ILIKE, (col1 ILIKE 'foo%') 
 PostgreSQL does not use an index scan but a Seq scan of the whole
 table, but when I try (lower(col1) LIKE 'foo%')
 PostgreSQL uses an index scan.

Why should it use the index? They're not even equivalent queries:

SELECT ... WHERE lower(col1) LIKE 'FOO%'

SELECT ... WHERE col1 ILIKE 'FOO%'

One is guaranteed to return no rows, the other not.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Can't locate Test/Parser/Dbt2.pm in DBT2 tests

2009-02-06 Thread Richard Huxton
Rohan Pethkar wrote:
 Hi All,
 
 I am conductingDBT2 tests on PostgreSQL. After completing the test while
 analyzing and creating the results I am getting following error:
 
 ./dbt2-run-workload: line 514:   731 Terminated  dbt2-client
 ${CLIENT_COMMAND_ARGS} -p ${PORT} -o ${CDIR} 
 ${CLIENT_OUTPUT_DIR}/`hostname`/client-${SEG}.out 21
 waiting for server to shut down done
 server stopped
 Can't locate Test/Parser/Dbt2.pm in @INC (@INC contains:
 /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
 /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi
 /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi
 /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi
 /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7
 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5
 /usr/lib/perl5/site_perl
 /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
 /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi
 /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi
 /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi
 /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7
 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5
 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi
 /usr/lib/perl5/5.8.8 .) at
 /home/rohan/NEW_DBT2/Installer/DBT2_SETUP/bin/dbt2-post-process line 13.

Well, if Test::Parser::Dbt2 isn't in somewhere in that list of
directories, you'll need to tell perl where to look. Simplest is
probably just to:
  export PERL5LIB=/path/to/extra/libs
before running your tests.

 Can't exec gnuplot: No such file or directory at
 /home/rohan/NEW_DBT2/Installer/DBT2_SETUP/bin/dbt2-pgsql-analyze-stats line
 113.

It also looks like you're missing gnuplot for your charts.

 I ma not sure why it doesn't find Test/Parser/Dbt2.pm even if I have
 installed DBT2 completely. Did I miss any steps? Do I need to install some
 extra packages? If any then please let me know.

You can always perldoc perlrun for more info (google it if you don't
have docs installed locally).

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG performance in high volume environment (many INSERTs and lots of aggregation reporting)

2009-01-28 Thread Richard Huxton
Phoenix Kiula wrote:
 [Ppsted similar note to PG General but I suppose it's more appropriate
 in this list. Apologies for cross-posting.]
 
 Hi. Further to my bafflement with the count(*) queries as described
 in this thread:
 
 http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php
 
 It seems that whenever this question has come up, Postgresql comes up
 very short in terms of count(*) functions.

Sorry - I'm confused. That thread doesn't seem to contain a slow
count(*) query. You seem to be saying you're having problems with the
query taking 10-15 seconds, but the example takes less then half a
second. How have you identified the count() as being the problem here?

 The performance is always slow, because of the planner's need to guess
 and such. I don't fully understand how the statistics work (and the
 explanation on the PG website is way too geeky) but he columns I work
 with already have a stat level of 100. Not helping at all.

But your own email says it's slow sometimes:
  My queries are fast in general *except* the first time
I'm not sure how the planner comes into this.

 We are now considering a web based logging functionality for users of
 our website. This means the table could be heavily INSERTed into. We
 get about 10 million hits a day, and I'm guessing that we will have to
 keep this data around for a while.
 
 My question: with that kind of volume and the underlying aggregation
 functions (by product id, dates, possibly IP addresses or at least
 countries of origin..) will PG ever be a good choice?

A good choice compared to what?

 Or should I be
 looking at some other kind of tools? I wonder if OLAP tools would be
 overkill for something that needs to look like a barebones version of
 google analytics limited to our site..

Typically you'd summarise the data by hour/day via triggers / a
scheduled script if you weren't going towards a pre-packaged OLAP
toolkit. Otherwise you're going to have to scan the hundreds of millions
of rows you've accumulated.

 Appreciate any thoughts. If possible I would prefer to tone down any
 requests for MySQL and such!

I'm not sure MySQL is going to help you here - if you were running lots
of small, simple queries it might make sense. If you want to aggregate
data by varying criteria I don't think there is any sensible
optimisation (other than pre-calculating summaries).

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sort causes system to freeze

2008-12-02 Thread Richard Huxton
Don't reply to another message when starting a new thread. People will
miss your message.

Craig James wrote:
 Maybe this is an obviously dumb thing to do, but it looked reasonable to
 me.

Looks reasonable here too - except I'm not sure what I'd do with 2
million rows of sorted table in my console. I'm guessing you're piping
the output into something.

  The problem is, the seemingly simple sort below causes a fairly
 powerful computer to completely freeze for 5-10 minutes.  During the
 sort, you can't login, you can't use any shell sessions you already have
 open, the Apache server barely works, and even if you do nice -20 top
 before you start the sort, the top(1) command comes to a halt while the
 sort is proceeding!  As nearly as I can tell, the sort operation is
 causing a swap storm of some sort -- nothing else in my many years of
 UNIX/Linux experience can cause a nice -20 process to freeze.

Nothing should cause that to your machine. I've never seen top just
freeze unless you set up some sort of fork-bomb and ramp the load up so
fast it can't cope. Oh, and nice-ing the client isn't going to do
anything to the backend actually doing the sorting.

 The sort operation never finishes -- it's always killed by the system. 
 Once it dies, everything returns to normal.

You're running out of memory then. It'll be the out-of-memory killer
(assuming you're on Linux).

 This is 8.3.0.  (Yes, I'll upgrade soon.)

Make soon more urgent than it has been up to now - no point in risking
all your data to some already fixed bug is there? Unless you've been
carefully tracking the release notes and have established that there's
no need in your precise scenario.

 Is this a known bug, or do I
 have to rewrite this query somehow?  Maybe add indexes to all four
 columns being sorted?

Indexes won't necessarily help if you're sorting the whole table. Maybe
if you had one on all four columns.

 = explain select * from plus order by supplier_id, compound_id, units,
 price;

 max_connections = 1000
 shared_buffers = 2000MB
 work_mem = 256MB

So can you support (1000 * 256 * 2) + 2000 MB of RAM?

 effective_cache_size = 4GB

...while leaving 4GB free for disk caching?

 Machine: Dell, 8x64-bit CPUs, 8GB ram, Perc6i battery-backed RAID
 controller, 8 disks as RAID10

It appears not. Remember that work_mem is not only per-connection, a
single query can use multiples of it (hence the *2 above). If you
genuinely have a lot of connections I'd drop it down to (say) 4MB to
make sure you don't swap on a regular basis (should probably be even
lower to be truly safe).

Then, for the odd case when you need a large value, issue a SET work_mem
before the query.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
 
 So patter index is 10 .. 20 times (!) faster always.
 No idea why.

Because you don't have a normal index on the product_id column? You
can't use xxx_pattern_ops indexes for non-pattern tests.

 Test data creation script:

The only change to the script was the obvious char(nn) = varchar(nn)
and I didn't use TEMP tables (so I could see what I was doing). Then, I
created the standard index on order_products.product_id.

EXPLAIN ANALYSE from my slow dev box are listed below. Database is in
LATIN9 encoding with locale=C.


 QUERY PLAN

 Aggregate  (cost=2993.69..2993.70 rows=1 width=0) (actual
time=2.960..2.960 rows=1 loops=1)
   -  Nested Loop  (cost=10.81..2993.23 rows=182 width=0) (actual
time=0.972..2.901 rows=189 loops=1)
 -  Index Scan using products_pkey on products
(cost=0.00..8.27 rows=1 width=18) (actual time=0.017..0.019 rows=1 loops=1)
   Index Cond: ((product_id)::text = '3370'::text)
 -  Nested Loop  (cost=10.81..2983.14 rows=182 width=18)
(actual time=0.951..2.785 rows=189 loops=1)
   -  Bitmap Heap Scan on orders_products
(cost=10.81..942.50 rows=251 width=22) (actual time=0.296..0.771
rows=261 loops=1)
 Recheck Cond: ((product_id)::text =
'3370'::text)
 -  Bitmap Index Scan on
order_product_pattern_eq_idx  (cost=0.00..10.75 rows=251 width=0)
(actual time=0.230..0.230 rows=261 loops=1)
   Index Cond: ((product_id)::text =
'3370'::text)
   -  Index Scan using orders_pkey on orders
(cost=0.00..8.12 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=261)
 Index Cond: (orders.order_id =
orders_products.order_id)
 Filter: (orders.order_date  '2006-01-01'::date)
 Total runtime: 3.051 ms
(13 rows)


QUERY PLAN
---
 Aggregate  (cost=25.56..25.57 rows=1 width=0) (actual time=8.244..8.245
rows=1 loops=1)
   -  Nested Loop  (cost=0.00..25.55 rows=1 width=0) (actual
time=1.170..8.119 rows=378 loops=1)
 -  Nested Loop  (cost=0.00..17.17 rows=1 width=4) (actual
time=0.043..4.167 rows=522 loops=1)
   -  Index Scan using order_product_pattern_eq_idx on
orders_products  (cost=0.00..8.88 rows=1 width=22) (actual
time=0.029..1.247 rows=522 loops=1)
 Index Cond: (((product_id)::text =
'3370'::text) AND ((product_id)::text 
'3371'::text))
 Filter: ((product_id)::text ~~
'3370%'::text)
   -  Index Scan using products_pkey on products
(cost=0.00..8.27 rows=1 width=18) (actual time=0.004..0.004 rows=1
loops=522)
 Index Cond: ((products.product_id)::text =
(orders_products.product_id)::text)
 -  Index Scan using orders_pkey on orders  (cost=0.00..8.37
rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=522)
   Index Cond: (orders.order_id = orders_products.order_id)
   Filter: (orders.order_date  '2006-01-01'::date)
 Total runtime: 8.335 ms
(12 rows)


-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
 Richard and Mario,
 
 You can't use xxx_pattern_ops indexes for non-pattern tests.
 
 I missed regular index. Sorry for that. Now issue with testcase is
 solved. Thank you very much.
 
 I researched issue in live 8.1.4 db a bit more.
 Performed vacuum and whole db reindex.
 Tried several times to run two same pattern queries in quiet db.

And the results were?

 additonal condition

One problem at a time. Let's get the pattern-matching speed problems on
your live server sorted, then we can look at different queries.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
 Richard,
 
 And the results were?
 
 Results are provided in bottom of the message to which you replied.

No - the explains there were contrasting a date test BETWEEN versus =.

 One problem at a time. Let's get the pattern-matching speed problems on
 your live server sorted, then we can look at different queries.
 
 First message in this thread described the issue with query having
 additional condition
 
 AND dok.kuupaev BETWEEN '2008-11-21' AND  '2008-11-21'

Ah, I think I understand. The test case was *missing* this clause.

 It seems that this problem occurs when pattern matching and BETWEEN
 conditions are used in same query.
 
 According to Scott Garey great recommendation I added
 
 ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000;
 
 This fixes testcase in live server, see my other message.
 Is it OK to run
 
 ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000
 
 in prod database or should I try to decrease 1000 to smaller value ?
 rid is big increasing table and is changed frequently, mostly by adding
 rows.

This will try to track the 1000 most-common values of toode, whereas
the default is to try to track the most common 10 values. Tracking more
values means the planner has more accurate information but makes ANALYSE
take longer to run, and also makes planning each query take slightly longer.

Try 100, 200, 500 and see if they work *for a range of queries* -
there's no point in having it much higher than it needs to be.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
 Richard,
 
 Results are provided in bottom of the message to which you replied.

 No - the explains there were contrasting a date test BETWEEN versus =.
 
 I changed rid.toode statitics target to 100:
 
 ALTER TABLE firma2.rid ALTER COLUMN toode SET STATISTICS 100;
 analyze firma2.rid;
 
 Analyze takes 3 seconds and testcase rans fast.
 I'm planning to monitor results by looking log file for queries which
 take longer than 10 seconds.

Sensible. I don't know if 10 seconds is the right value for your
database, but there will be a point that filters out most of your
traffic but still gives enough to find problems.

 Do you still need results ?
 If yes, which query and how many times should I run?

If changing the statistics seems to help, you're not going to want to go
back just to repeat tests.

 Ah, I think I understand. The test case was *missing* this clause.
 
 I added  this clause to testcase. Also added char(70) colums containing
 padding characters to all three tables. Cannot still reproduce this issue
 in testcase in fast devel 8.3 notebook.
 In testcase order_products contains product_id values in a very regular
 order, maybe this affects the results. No idea how to use random() to
 generate random
 products for every order.

Ideally you don't even want random products. You want a distribution of
products that matches the same shape as you have in your production
database.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-24 Thread Richard Huxton
Andrus wrote:
 Both queries return same result (19) and return same data.
 Pattern query is a much slower (93 sec) than  equality check (13 sec).
 How to fix this ?
 Using 8.1.4, utf-8 encoding, et-EE locale

They're different queries. The fact that they return the same results is
a coincidence.

This

   WHERE rid.toode = '9910' 

Is a different condition to this

   WHERE rid.toode like '9910%'

You aren't going to get the same plans.

Anyway, I think the problem is in the dok JOIN rid bit look:

 Aggregate  (cost=43.09..43.10 rows=1 width=0) (actual
 time=12674.675..12674.679 rows=1 loops=1)
   -  Nested Loop  (cost=29.57..43.08 rows=1 width=0) (actual
 time=2002.045..12673.645 rows=19 loops=1)
 -  Nested Loop  (cost=29.57..37.06 rows=1 width=24) (actual
 time=2001.922..12672.344 rows=19 loops=1)

 Aggregate  (cost=15.52..15.53 rows=1 width=0) (actual
 time=92966.501..92966.505 rows=1 loops=1)
   -  Nested Loop  (cost=0.00..15.52 rows=1 width=0) (actual
 time=24082.032..92966.366 rows=19 loops=1)
 -  Nested Loop  (cost=0.00..9.50 rows=1 width=24) (actual
 time=24081.919..92965.116 rows=19 loops=1)

These are the same but the times are different. I'd be very surprised if
you can reproduce these times reliably.


Can I give you some wider-ranging suggestions Andrus?
1. Fix the vacuuming issue in your hash-join question.
2. Monitor the system to make sure you know if/when disk activity is high.
3. *Then* start to profile individual queries and look into their plans.
Change the queries one at a time and monitor again.

Otherwise, it's very difficult to figure out whether changes you make
are effective.

HTH
-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hash join on int takes 8..114 seconds

2008-11-21 Thread Richard Huxton
Andrus wrote:
 - what's the size of the dataset relative to the RAM ?
 
 Db size is 7417 MB
 relevant table sizes in desc by size order:
 
  140595 dok 2345 MB


  2 1214 pg_shdepend 2259 MB
  6 1232 pg_shdepend_depender_index  795 MB
  7 1233 pg_shdepend_reference_index 438 MB

These three are highly suspicious. They track dependencies between
system object (so you can't drop function F because trigger T depends on
it).

http://www.postgresql.org/docs/8.3/static/catalog-pg-shdepend.html

You've got 3.5GB of data there, which is a *lot* of dependencies.

Try SELECT count(*) FROM pg_shdepend.

If it's not a million rows, then the table is bloated. Try (as postgres
or some other db superuser) vacuum full pg_shdepend and a reindex
pg_shdepend.

If it is a million rows, you'll need to find out why. Do you have a lot
of temporary tables that aren't being dropped or something similar?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hash join on int takes 8..114 seconds

2008-11-20 Thread Richard Huxton
Andrus wrote:
 Query below seems to use indexes everywhere in most optimal way.
 dokumnr column is of type int
 
 Speed of this query varies rapidly:
 
 In live db fastest response I have got is 8 seconds.
 Re-running same query  after 10 seconds may take 60 seconds.
 Re-running it again after 10 seconds may take 114 seconds.
 
 Any idea how to speed it up ?
 
 Is it possible to optimize it, will upgrading to 8.3.5 help or should I
 require to add more RAM, disk or CPU speed ?

At a quick glance, the plans look the same to me. The overall costs are
certainly identical. That means whatever is affecting the query times it
isn't the query plan.

 Aggregate  (cost=234278.53..234278.54 rows=1 width=0) (actual
 time=62164.496..62164.500 rows=1 loops=1)
 Total runtime: 62164.789 ms

 Aggregate  (cost=234278.53..234278.54 rows=1 width=0) (actual
 time=40185.499..40185.503 rows=1 loops=1)
 Total runtime: 40186.102 ms

 Aggregate  (cost=234278.53..234278.54 rows=1 width=0) (actual
 time=29650.398..29650.402 rows=1 loops=1)
 Total runtime: 29650.696 ms

 Aggregate  (cost=234278.53..234278.54 rows=1 width=0) (actual
 time=11131.392..11131.396 rows=1 loops=1)
 Total runtime: 11131.694 ms

So - what other activity is happening on this machine? Either other
queries are taking up noticeable resources, or some other process is (it
might be disk activity from checkpointing, logging some other application).

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hash join on int takes 8..114 seconds

2008-11-20 Thread Richard Huxton
Andrus wrote:
 Richard,
 
 At a quick glance, the plans look the same to me. The overall costs are
 certainly identical. That means whatever is affecting the query times it
 isn't the query plan.

 So - what other activity is happening on this machine? Either other
 queries are taking up noticeable resources, or some other process is (it
 might be disk activity from checkpointing, logging some other
 application).
 
 Thank you.
 This is dedicated server running only PostgreSql which serves approx 6
 point of sales at this time.
 
 Maybe those other clients make queries which invalidate lot of data
 loaded into server cache.
 In next time server must read it again from disk which causes those
 perfomance differences.

In addition to top below, you'll probably find vmstat 5 useful.

 top output is currently:
 
 top - 13:13:10 up 22 days, 18:25,  1 user,  load average: 0.19, 0.12, 0.19
 Tasks:  53 total,   2 running,  51 sleeping,   0 stopped,   0 zombie
 Cpu(s): 13.7% us,  2.0% sy,  0.0% ni, 78.3% id,  6.0% wa,  0.0% hi, 
 0.0% si
 Mem:   2075828k total,  2022808k used,53020k free,0k buffers
 Swap:  3911816k total,   88k used,  3911728k free,  1908536k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 5382 postgres  15   0  144m  43m  40m S 15.0  2.2   0:00.45 postmaster
 5358 postgres  15   0  152m  87m  75m S  0.3  4.3   0:00.97 postmaster
1 root  16   0  1480  508  444 S  0.0  0.0   0:01.35 init

Looks pretty quiet.

 in few seconds later:
 
 top - 13:14:01 up 22 days, 18:26,  1 user,  load average: 1.72, 0.53, 0.32
 Tasks:  52 total,   2 running,  50 sleeping,   0 stopped,   0 zombie
 Cpu(s):  5.3% us,  3.0% sy,  0.0% ni,  0.0% id, 91.0% wa,  0.0% hi, 
 0.7% si
 Mem:   2075828k total,  2022692k used,53136k free,0k buffers
 Swap:  3911816k total,   88k used,  3911728k free,  1905028k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 1179 postgres  18   0  155m 136m 122m D  6.7  6.7   1:32.52 postmaster
 4748 postgres  15   0  145m 126m 122m D  1.3  6.2   0:14.38 postmaster
 5358 postgres  16   0  160m  98m  81m D  0.7  4.9   0:01.21 postmaster
1 root  16   0  1480  508  444 S  0.0  0.0   0:01.35 init

Here you're stuck waiting for disks (91.0% wa). Check out vmstat and
iostat to see what's happening.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Difference in query plan

2008-11-14 Thread Richard Huxton
Patrice Beliveau wrote:
 I have a database in a production server (8.1.9) with to schema
 containing the sames table same index, same every thing, but with
 different data. When I execute a query in one schema, it take much more
 time to execute then the other schema.
[snip]
 I'm wondering where to start searching to fix this problem

 Production server schema 1 query plan:
 Nested Loop  (cost=569.23..634.43 rows=1 width=121) (actual
 time=1032.811..1032.811 rows=0 loops=1)
[snip]
 Total runtime: 1034.204 ms

 Production server schema 2 query plan:
 Nested Loop  (cost=133.42..793.12 rows=1 width=123) (actual
 time=0.130..0.130 rows=0 loops=1)
[snip]
 Total runtime: 0.305 ms

Well there's something strange - the estimated costs are fairly similar
(643.43 vs 793.12) but the times are clearly very different (1034 vs 0.3ms)

The suspicious line from the first plan is:
   -  Seq Scan on mrp m  (cost=0.00..119.92 rows=5892
 width=39) (actual time=0.343..939.462 rows=5892 loops=1)

This is taking up almost all the time in the query and yet only seems to
be scanning 5892 rows.

Run a vacuum verbose against table mrp and see if it's got a lot of
dead rows. If it has, run VACUUM FULL and REINDEX against it and see if
that solves your problem.

I'm guessing you have / had a long-running transaction interfering with
vacuum on this table, or perhaps a bulk update/delete?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Difference in query plan

2008-11-14 Thread Richard Huxton
Patrice Beliveau wrote:
 Thanks,
 
 I'm already doing a vacuum full every night on all database, but the
 REINDEX fix it and now it's working fine

Are you sure it was the REINDEX? The plan was using a sequential scan.

 But this raise a question
 
 1) This table is cleared every night and recomputed, does this mean that
 I should REINDEX every night also

Looks like you should. Or drop the indexes, load the data, re-create the
indexes, that can be quicker.

 2) Why this thing didn't happen in the other schema

Have you re-loaded schema1 more often? It might even be the particular
order that rows are loaded - a btree can become unbalanced sometimes.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] slow full table update

2008-11-12 Thread Richard Huxton
[EMAIL PROTECTED] wrote:
 Hi,
 
 I've changed settings, 
 but with no effect on speed.
 
 I try explain query with this result
 for 10.000 rows  update songs set views = 0 where sid  2 and sid  3
 
 Bitmap Heap Scan on songs  (cost=151.59..6814.29 rows=8931 width=526) (actual 
 time=4.848..167.855 rows=8945 loops=1)

This query says t is taking 167 milli-seconds, not 10 minutes as your
first message said. Is this query actually slow?

 
   Recheck Cond: ((sid  2) AND (sid  3))
 
   -  Bitmap Index Scan on pk_songs2  (cost=0.00..151.59 rows=8931 width=0) 
 (actual time=4.071..4.071 rows=9579 loops=1)
 
 Index Cond: ((sid  2) AND (sid  3))
 
 Is there a way to run this query on sigle  throughpass with no Recheck Cond?

Only a sequential scan.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Oddity with view

2008-11-10 Thread Richard Huxton
Jim 'Decibel!' Nasby wrote:
 On Nov 10, 2008, at 7:06 AM, Tom Lane wrote:
 Jim 'Decibel!' Nasby [EMAIL PROTECTED] writes:
 loan_tasks effectively does SELECT * FROM loan_tasks_committed UNION
 ALL SELECT * FROM loan_tasks_pending;.

 You seem to have neglected to mention a join or two.
 
 
 Yeah, though I did show them at the end of the message...
 
  SELECT true AS committed, loan_tasks_committed.id, ...,
 loan_tasks_committed.task_amount
FROM loan_tasks_committed
 UNION ALL
  SELECT false AS committed, ltp.id, ..., NULL::unknown AS task_amount
FROM loan_tasks_pending ltp
JOIN loan_task_codes ltc ON ltp.loan_task_code_id = ltc.id;
 
 Thing is, there's no data to be had on that side. All of the time is
 going into the seqscan of loan_tasks_committed. But here's what's really
 disturbing...

-  Seq Scan on loan_tasks_committed  (cost=0.00..929345.35
 rows=26112135 width=0) (actual time=0.012..5116.776 rows=26115689 loops=1)

-  Seq Scan on loan_tasks_committed 
 (cost=0.00..929345.35 rows=26112135 width=162) (actual
 time=0.014..22531.902 rows=26115689 loops=1)

It's the width - the view is fetching all the rows. Is the true as
committed bit confusing it?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] server space increasing very fast but transaction are very low

2008-11-06 Thread Richard Huxton
1. Don't email people directly to start a new thread (unless you have a
support contract with them of course).

2. Not much point in sending to two mailing lists simultaneously. You'll
just split your responses.


brahma tiwari wrote:
 Hi all
 
 My database server db01 is on linux environment and size of base
 folder increasing very fast unexpectedly(creating renamed files of 1
 GB in base folder like 1667234568.10) details as below

These are the files containing your tables / indexes.
When a file gets larger than 1GB the file gets split and you get .1, .2 etc
on the end)

 what is significance of these files and how can i avoid it.can i
 delete these renamed files from base folder or any thing else. Please
 help

NEVER delete any files in .../data/base.

Since these files all seem to have the same number they are all the same
object (table or index). You can see which by looking in pg_class.
You'll want to use the number 1662209326 of course.

= SELECT relname,relpages,reltuples,relfilenode FROM pg_class WHERE
relfilenode=2336591;
 relname | relpages | reltuples | relfilenode
-+--+---+-
 outputs |3 |   220 | 2336591
(1 row)

This is the table outputs on mine which occupies 3 pages on disk and has
about 220 rows. You can find out the reverse (size of any table by name)
with some useful functions:
  select pg_size_pretty(pg_total_relation_size('my_table_name'));

I'm guessing what you've got is a table that's not being vacuumed
because you've had a transaction that's been open for weeks.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL OR performance

2008-11-06 Thread Richard Huxton
Віталій Тимчишин wrote:
 As you can see from other plans, it do have all the indexes to perform it's
 work fast (when given part by part). It simply do not wish to use them. My
 question: Is this a configuration problem or postgresql optimizer simply
 can't do such a query rewrite?

I must admit, I haven't managed to figure out what your query is trying
to do, but then that's a common problem with autogenerated queries.

The main question that needs answering is why the planner thinks you're
going to get 1.3 billion rows in the or query:

Nested Loop  (cost=4588.13..960900482668.95 rows=1386158171 width=32)

You don't show explain analyse for this query, so there's no way of
knowing how many rows get returned but presumably you're expecting
around 88000. What does explain analyse return?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 7.4 - basic tuning question

2008-10-03 Thread Richard Huxton
Simon Waters wrote:

The best advice is to upgrade at your earliest convenience with
performance questions and 7.4 - you're missing a *lot* of improvements.
You say you're planning to anyway, and I'd recommend putting effort into
the upgrade rather than waste effort on tuning a system you're leaving.

 I assume that the histogram_bounds for strings are alphabetical in order, so 
 that DEMOSTART falls between DELETE and IDEMAIL. Even on a worst case 
 of including both these common values, the planner ought to have assumed that 
 less than 10% of records were likely covered by the value selected, so it 
 seems unlikely to me that not using the index would be a good idea.

Well, the real question is how many blocks need to be read to find those
DEMOSTART rows. At some point around 5-10% of the table it's easier just
to read the whole table than go back and fore between index and table.
The precise point will depend on how much RAM you have, disk speeds etc.

 = SELECT COUNT(*) FROM log WHERE event='DEMOSTART';
 (...lots of time passes...)
  count
 ---
   1432
 (1 row)

OK, not many. The crucial bit is below though. These are the 10 values
it will hold stats on, and all it knows is that DEMOSTART has less than
57000 entries. OK, it's more complicated than that, but basically there
are values it tracks and everything else. So - it assumes that all other
 values have the same chance of occuring.

 = SELECT COUNT(*), event FROM log GROUP BY event ORDER BY count;
 
   count |   event
 +---
[snip]
   57022 | NEWUSR
   64907 | PUBREC0
   65449 | UNPUBLISH
   92843 | LOGOUT
   99018 | KILLSESS
  128900 | UPLOAD
  134994 | LOGIN
  137608 | NEWPAGE
  447556 | PUBREC1
  489572 | PUBLISH

Which is why it guesses 20436 rows below. If you'd done SET
enable_seqscan = off then run the explain again it should have
estimated a cost for the index that was more than 54317.14

 = EXPLAIN SELECT * FROM log WHERE event='DEMOSTART';
  QUERY PLAN
 
  Seq Scan on log  (cost=0.00..54317.14 rows=20436 width=93)
Filter: (event = 'DEMOSTART'::text)
 (2 rows)
 
 
 = ALTER TABLE log ALTER COLUMN events SET STATISTICS 50; ANALYSE
 LOG(event);
 ALTER TABLE
 ANALYZE
 
 
 = EXPLAIN SELECT COUNT(*) FROM log WHERE event='DEMOSTART';
 QUERY PLAN
 
 ---
  Aggregate  (cost=5101.43..5101.43 rows=1 width=0)
-  Index Scan using log_event on log  (cost=0.00..5098.15 rows=1310
 width=0)
  Index Cond: (event = 'DEMOSTART'::text)
 (3 rows)

Not bad - now it knows how many rows it will find, and it sees that the
index is cheaper. It's not completely accurate - it uses a statistical
sampling (and of course it's out of date as soon as you update the table).

HTH

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bizarre query performance question

2008-10-02 Thread Richard Huxton
H. William Connors II wrote:
 fa_assignment has 44184945 records
 fa_assignment_detail has 82196027 records
 
 explain select * from fa_assignment fa JOIN fa_assignment_detail fad ON
 (fad.assignment_id = fa.assignment_id) where fa.scenario_id = 0;
 
  QUERY
 PLAN 
 ---
 
 Hash Join  (cost=581289.72..4940729.76 rows=9283104 width=91)

Are you really expecting 9 million rows in the result? If so, this is
probably a reasonable plan.

   Hash Cond: (fad.assignment_id = fa.assignment_id)
   -  Seq Scan on fa_assignment_detail fad  (cost=0.00..1748663.60
 rows=82151360 width=61)
   -  Hash  (cost=484697.74..484697.74 rows=4995439 width=30)
 -  Bitmap Heap Scan on fa_assignment fa 
 (cost=93483.75..484697.74 rows=4995439 width=30)
   Recheck Cond: (scenario_id = 0)
   -  Bitmap Index Scan on fa_assignment_idx2 
 (cost=0.00..92234.89 rows=4995439 width=0)
 Index Cond: (scenario_id = 0)

It's restricting on scenario_id, building a bitmap to identify which
disk-blocks will contain one or more matching rows and then scanning
those. If those 5 million scenario_id=0 rows are spread over 10% of the
blocks then that's a good idea.

If it was expecting only a handful of rows with scenario_id=0 then I'd
expect it to switch to a standard index scan.

If your work_mem is small try something like:
  set work_mem = '50MB';
before running the query - maybe even larger.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] dedicated server postgresql 8.1 conf tunning

2008-10-02 Thread Richard Huxton
[EMAIL PROTECTED] wrote:
 Thanks,
 
 Unfornatly, i can't update pgsql to 8.3 since it's not in debian stable.

That's why backports.org was invented :-)
Or does can't mean not allowed to?

 So i'm going to play with work_mem  shared_buffers.
 
 With big shared_buffers pgsql tells me 
 shmget(cle=5432001, taille=11183431680, 03600).
 so i do echo 13183431680  /proc/sys/kernel/shmmax ( 10Go + 2Go just
 in case)
 
 but pgsql tells me again that it there's not enought shm..
 How can i compute the go shmmax for my server ?

I'm not seeing anything terribly wrong there. Are you hitting a limit
with shmall?

Oh - and I'm not sure there's much point in having more shared-buffers
than you have data.

Try much larger work_mem first, I think that's the biggest gain for you.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] dedicated server postgresql 8.1 conf tunning

2008-10-02 Thread Richard Huxton
Tommy Gildseth wrote:
 Richard Huxton wrote:
 [EMAIL PROTECTED] wrote:
 Thanks,

 Unfornatly, i can't update pgsql to 8.3 since it's not in debian stable.

 That's why backports.org was invented :-)
 Or does can't mean not allowed to?
 
 Well, running production servers from backports can be a risky
 proposition too, and can land you in situations like the one discussed
 in Debian packages for Postgres 8.2 from the General list.

Well, there's a reason why stable is a popular choice for production
servers. I must admit that I build from source for my PostgreSQL
packages (because I care which version I run). I was reading one of the
Perl fellows recommending the same.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] dedicated server postgresql 8.1 conf tunning

2008-10-01 Thread Richard Huxton
[EMAIL PROTECTED] wrote:
 Hello
 
 I'm running pgsql 8.1.11 (from debian stable) on a server with 16GB RAM
 (Linux helios 2.6.18-6-amd64 #1 SMP Tue Aug 19 04:30:56 UTC 2008 x86_64
 GNU/Linux).

Unless you're committed to this version, I'd seriously look into 8.3
from backports (or compiled yourself). I'd expect some serious
performance improvements for the workload you describe.

 I have a table tickets with 1 000 000 insert by month ( ~2600 each 2hours
 ) (for the moment 1300 rows for 5GB )
 and i have to extract statistics ( number of calls, number of calls less
 than X seconds, number of news calles, number of calls from the new
 callers, ...)

OK, so not a lot of updates, but big aggregation queries. You might want
to pre-summarise older data as the system gets larger.

 1°) The server will handle max 15 queries at a time.
 So this is my postgresql.conf
 
 max_connections = 15

Well, I'd allow 20 - just in case.

 shared_buffers =  995600 # ~1Go
 temp_buffers = 1000
 work_mem = 512000 # ~512Ko

I'd be tempted to increase work_mem by a lot, possibly even at the
expense of shared_buffers. You're going to be summarising large amounts
of data so the larger the better, particularly as your database is
currently smaller than RAM. Start with 5MB then try 10MB, 20MB and see
what difference it makes.

 maintenance_work_mem = 1048576 # 1Mo
 
 max_fsm_pages = 41522880 # ~40Mo
 max_fsm_relations = 8000 

See what a vacuum full verbose says for how much free space you need to
track.

 checkpoint_segments = 10
 checkpoint_timeout = 3600

With your low rate of updates shouldn't matter.

 effective_cache_size = 13958643712 # 13Go

Assuming that's based on what top or free say, that's fine. Don't
forget it will need to be reduced if you increase work_mem or
shared_buffers.

 stats_start_collector = on
 stats_command_string = on
 stats_block_level = on
 stats_row_level = on
 autovacuum = off

Make sure you're vacuuming if autovacuum is off.

 How can i optimize the configuration?

Looks reasonable, so far as you can tell from an email. Try playing with
work_mem though.

 2°) My queries look like
 SELECT tday AS n,
 COUNT(DISTINCT(a.appelant)) AS new_callers,
 COUNT(a.appelant) AS new_calls
 FROM cirpacks.tickets AS a
 WHERE LENGTH(a.appelant)  4
 AND a.service_id IN ( 95, 224, 35, 18 )
 AND a.exploitant_id = 66
 AND a.tyear = 2008
 AND a.tmonth = 08

Index on (tyear,tmonth) might pay off, or one on exploitant_id perhaps.

 AND EXISTS ( SELECT 1 FROM cirpacks.clients AS b WHERE b.appelant =
 a.appelant AND b.service_id IN ( 95, 224, 35, 18 ) AND b.heberge_id = 66
 HAVING to_char(MIN(b.premier_appel), 'MMDD') = to_char(a.date,
 'MMDD') )

It looks like you're comparing two dates by converting them to text.
That's probably not the most efficient way of doing it. Might not be an
issue here.

 GROUP BY n
 ORDER BY n;
 
 or select ... SUM( CASE WHEN condition THEN value ELSE 0) ... FROM
 cirpacks.tickets WHERE tyear = ... and tmonth = ... and tday = ... AND
 audiotel IN ( '...', '...' );
 or select ... SUM( CASE WHEN condition THEN value ELSE 0) ... FROM
 cirpacks.tickets WHERE '2007-01-01' = date AND date = '2008-08-31' AND
 audiotel IN ( '...', '...' );
 
 
 which indexes are the best ?

The only way to find out is to test. You'll want to run EXPLAIN after
adding each index to see what difference it makes. Then you'll want to
see what impact this has on overall workload.

Mostly though, I'd try out 8.3 and see if that buys you a free
performance boost.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Restoration of datas

2008-08-08 Thread Richard Huxton

dforums wrote:
COuld you advice me on which restoration method is the faster. To 
upgrade from postgresql 8.1.11 to 8.3.3.


Using the pg_dump from your 8.3 package, dump the database using -Fc to 
get a nicely compressed dump. Then use pg_restore to restore it. If you 
add a --verbose flag then you will be able to track it.


You might want to set fsync=off while doing the restore. This is safe 
since if the machine crashes during restore you just start again. Oh, 
and increase work_mem too - there's only going to be one process.


What will take the most time is the creating of indexes etc.

It will take a long time to do a full restore though - you've got 64GB 
of data and slow disks.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpectedly Long DELETE Wait

2008-08-07 Thread Richard Huxton

Volkan YAZICI wrote:

Hi,

Below command has been running since ~700 minutes in one of our
PostgreSQL servers.

  DELETE FROM mugpsreglog
WHERE NOT EXISTS (SELECT 1
FROM mueventlog
   WHERE mueventlog.eventlogid = 
mugpsreglog.eventlogid);

   Seq Scan on mugpsreglog  (cost=0.00..57184031821394.73 rows=6590986 width=6)
 Filter: (NOT (subplan))
SubPlan
 -  Seq Scan on mueventlog  (cost=0.00..4338048.00 rows=1 width=0)
 Filter: (eventlogid = $0)


Ouch - look at the estimated cost on that!


And there isn't any constraints (FK/PK), triggers, indexes, etc. on any
of the tables. (We're in the phase of a migration, many DELETE commands
similar to above gets executed to relax constraints will be introduced.)


Well there you go. Add an index on eventlogid for mugpsreglog.

Alternatively, if you increased your work_mem that might help. Try SET 
work_mem='64MB' (or even higher) before running the explain and see if 
it tries a materialize. For situations like this where you're doing big 
one-off queries you can afford to increase resource limits.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Plz Heeeelp! performance settings

2008-08-07 Thread Richard Huxton

dforum wrote:

Tx for your reply.

You mean that RAID use fsync method for keeping data's copy.


No, Merlin means PostgreSQL will issue a sync to force WAL to actual disk.


So you invite me to desactivate fsync to increase the performance ?


He means you might have to if you can't afford new hardware. Is disk 
activity the problem? Have you looked at the output of vmstat to check?


Desactivating fsync. my second disk will not be uptodate, 


No - the RAID stuff is happening in the operating-system.

 so if the
machine crash, I wont be able to get the server working quickly??? 


Not quickly, perhaps not at all.

 But
if I use a second machine to replicate the database, I escape this 
problem isn't it ?


You reduce the chance of a single failure causing disaster.

If I understand right, could you tell me how to do desactivate fsync 
please ?


There's an fsync = on setting in your postgresql.conf, but don't 
change it yet.


 I have a database of 38Go and take 6Go per week.

What do you mean by take 6Go per week? You update/delete that much 
data? It's growing by that amount each week?


 I have a lot of update and insert, especially in 8 tables. 2 tables are
 using for temporary storage, so I right something like 15000 request per
 2 minutes and empty it into 10 min.

I'm not sure what 15000 request per 2 minutes and empty it into 10 min 
means.


Do you have 7500 requests per minute?
Are these updates?
To the temporary storage?
What is this temporary storage - an ordinary table?

 I'm making some update or select on tables including more than 20
 millions of entrance.

Again, I'm not sure what this means.


Oh - *important* - which version of PostgreSQL are you running?
Is an upgrade practical?


Looking at your postgresql.conf settings:

  max_connections = 624

That's an odd number.
Do you usually have that many connections?
What are they doing? They can't all be active, the machine you've got 
wouldn't cope.


  shared_buffers = 25
  work_mem = 9000
  temp_buffers = 500

These three are important. The shared_buffers are workspace shared 
between all backends, and you've allocated about 2GB. You've also set 
work_mem=9MB, which is how much each backend can use for a single sort. 
That means it can use double or triple that in a complex query. If 
you're using temporary tables, then you'll want to make sure the 
temp_buffers setting is correct.


I can't say whether these figures are good or bad without knowing how 
the database is being used.


  effective_cache_size = 625000

That's around 5GB - is that roughly the amount of memory used for 
caching (what does free -m say for buffers/cache)?


  max_prepared_transactions = 200

Do you use a lot of prepared transactions in two-phase commit?
I'm guessing that you don't.

 I'm sure that it could be more optimised. I don't know any thing on
 WAL,
 autovacuum, fsm, bgwriter, kernel process, geqo or planner cost
 settings.

If you run a vacuum verbose it will recommend fsm settings at the end 
of its output. I think you probably need to make your autovacuum more 
aggressive, but that's something you'll be able to tell by monitoring 
your database.


It's quite likely that Merlin's right, and you need better hardware to 
cope with the number of updates you're making - that's something where 
you need fast disks. However, he's just guessing because you've not told 
us enough to tell where the problem really lies.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query Plan choice with timestamps

2008-08-07 Thread Richard Huxton

Giorgio Valoti wrote:
Hi, I have a timestamptz field that I want to use with a query, but I 
don’t need the full timestamp resolution, so I’ve created a 
day_trunc(timestamptz) immutable function which I’ll use with the query 
and with a new index:


logs= create index test_idx on blackbox (day_trunc(ts));

However, the query plan doesn’t use the index:


Does it use it ever? e.g. with
  SELECT * FROM blackbox WHERE day_trunk(ts) = '...'

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Plz Heeeelp! performance settings

2008-08-07 Thread Richard Huxton

dforums wrote:

vmstat is giving :
procs ---memory-- ---swap-- -io --system-- 
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy 
id wa
 0  2   1540  47388  41684 757897600   131   2590 1  9 
 3 82  7


This system is practically idle. Either you're not measuring it at a 
useful time, or there isn't a performance problem.



But
  if I use a second machine to replicate the database, I escape this
  problem isn't it ?
  You reduce the chance of a single failure causing disaster.
Not clear this reply. It's scare me 


If server A fails, you still have server B. If server A fails so that 
replication stops working and you don't notice, server B won't help any 
more.



  What do you mean by take 6Go per week? You update/delete that much
  data? It's growing by that amount each week?
YES


That wasn't a yes/no question. Please choose one of:
Are you updating 6Go per week?
Are you adding 6Go per week?


  I'm not sure what 15000 request per 2 minutes and empty it into 10 min
  means.
I insert 15000 datas every 2 min and delete 15000 every  10 min in those 
tables

 
  Do you have 7500 requests per minute?
should be that, But in fact I'm not treating the datas in real time, and 
I buffer the datas and push the data into the database every 2 min

  Are these updates?
during the delete the data are aggregated in other tables which make 
updates


OK, so every 2 minutes you run one big query that adds 15000 rows.
Every 10 minutes you run one big query that deletes 15000 rows.


  To the temporary storage?

  What is this temporary storage - an ordinary table?
Yes, I thied to use temporary tables but I never been able to connect 
this tables over 2 different session/connection, seems that is a 
functionnality of postgresql, or a misunderstanding from me.


That's correct - temporary tables are private to a backend (connection).


I'm making some update or select on tables including more than 20
millions of entrance.
 
  Again, I'm not sure what this means.

To aggregate the data, I have to check the presence of others 
information that are stores in 2 tables which includes 24 millions of 
entrance.


OK. I assume you're happy with the plans you are getting on these 
queries, since you've not provided any information about them.



  Oh - *important* - which version of PostgreSQL are you running?
8.1.11
  Is an upgrade practical?
We are working of trying to upgrade to 8.3.3, but we are not yet ready 
for such migration


OK


  Looking at your postgresql.conf settings:
 
max_connections = 624
 
  That's an odd number.
Now we could decrease this number, it's not so much usefull for now. we 
could decrease is to 350.


I don't believe you've got 350 active connections either. It will be 
easier to help if you can provide some useful information.



effective_cache_size = 625000
 
  That's around 5GB - is that roughly the amount of memory used for
  caching (what does free -m say for buffers/cache)?
total   used   free sharedbuffers cached
Mem:  7984   7828156  0 38   7349
-/+ buffers/cache:440   7544
Swap:  509  1508


Not far off - free is showing 7349MB cached. You're not running 350 
clients there though - you're only using 440MB of RAM.



I don't see anything to show a performance problem from these emails.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Plz Heeeelp! performance settings

2008-08-07 Thread Richard Huxton

dforums wrote:
The performance problem is really only on the insertion and even more on 
the treatment for the aggregation.


To treat the 3000 entrances and to insert, or update the tables it needs 
10 minutes.


As I told you I inject 14000 query every 2 minutes, and it needs 10 
minutes to treat 3000 of those query.


Sorry - I still don't understand. What is this treatment you are doing?


 
  OK. I assume you're happy with the plans you are getting on these
  queries, since you've not provided any information about them.

The plan seems ok as it use index as well.
here is the plan :

explain analyse SELECT insertUpdateTracks(137,2605, 852, ('2008-08-06 
19:28:54'::text)::date,3,'dailydisplay',2,NULL);

INFO:  method 1
 QUERY PLAN
 

 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=1.151..1.151 
rows=1 loops=1)

 Total runtime: 1.160 ms


There's nothing to do with an index here - this is a function call.


 Has you can see the runtime processs for an update in this table.

multiplying this per 1, it is too long.


So - are you calling this function 14000 times to inject your data? 
You're doing this in one transaction, yes?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Plz Heeeelp! performance settings

2008-08-07 Thread Richard Huxton

dforums wrote:
 The delete is global, the procedure is called for each line/tracks.
   So - are you calling this function 14000 times to inject your data?
   You're doing this in one transaction, yes?
 NO I have to make it 14000 times cause, I use some inserted information
 for other insert to make links between data.

Why does that stop you putting all 14000 calls in one transaction?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Nls sorting in Postgresql-8.3.3

2008-08-01 Thread Richard Huxton

Praveen wrote:

Hello,
I installed postgresql-8.3.3 in our local server  with option  --enable-nls . 
After successful installion , I create database and import data.But I am not 
aware how to use nls sort in this postgresql-8.3.3 .
Please tell me  syntax how to use nls sort in query , if some one know.


What is nls sort? What do you expect --enable-nls to do? It looks like 
 it's for multi-language message display rather than sorting.


The locale options are already built-in.

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] how to fix problem then when two queries run at the same time, it takes longer to complete then if run in sequence

2008-07-31 Thread Richard Huxton

Miernik wrote:

Might be worth turning off autovacuum and running a manual vacuum full
overnight if your database is mostly reads.


I run autovacum, and the database has a lot of updates all the time,
also TRUNCATING tables and refilling them, usually one or two
INSERTS/UPDATES per second.


OK


Finally, I think it's worth looking at pgpool or pgbouncer (as Alvaro
said) and set them to allow only one connection in the pool. I know
that pgbouncer offers per-transaction connection sharing which will
make this more practical. Even so, it will help if your application
can co-operate by closing the connection as soon as possible.


I just installed pgpool2 and whoaaa! Everything its like about 3 times
faster! My application are bash scripts using psql -c UPDATE 


Probably spending most of their time setting up a new connection, then 
clearing it down again.



I plan to rewrite it in Python, not sure if it would improve
performance, but will at least be a cleaner implementation.


Careful of introducing any more overheads though. If libraries end up 
using another 2.5MB of RAM then that's 10% of your disk-cache gone.



In /etc/pgpool.conf I used:

# number of pre-forked child process
num_init_children = 1

# Number of connection pools allowed for a child process
max_pool = 1


Might need to increase that to 2 or 3.


Wanted to install pgbouncer, but it is broken currently in Debian. And
why is it in contrib and not in main (speaking of Debian location)?


Not well known enough on the Debian side of the fence? It's simple 
enough to install from source though. Takes about one minute.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] how to fix problem then when two queries run at the same time, it takes longer to complete then if run in sequence

2008-07-31 Thread Richard Huxton

Miernik wrote:

Richard Huxton [EMAIL PROTECTED] wrote:

I just installed pgpool2 and whoaaa! Everything its like about 3 times
faster! My application are bash scripts using psql -c UPDATE 

Probably spending most of their time setting up a new connection, then
clearing it down again.


If I do it in Python it could do all queries in the same connection, so
should be faster? Besides that 'psql' is written in perl, so its also
heavy, by not using psql I get rid of perl library in RAM.


Nope - C all through.

 Also the

script uses wget to poll some external data sources a lot, also
needlessly opening new connection to the webserver, so I want to make
the script save the http connection, which means I must get rid of wget.
Maybe I should write some parts in C?

BTW, doesn't there exist any tool does what psql -c does, but is
written in plain C, not perl? I was looking for such psql replacement,
but couldn't find any


Well ECPG lets you embed SQL directly in your C.


# Number of connection pools allowed for a child process
max_pool = 1

Might need to increase that to 2 or 3.


Why? The website says:

max_pool

The maximum number of cached connections in pgpool-II children
processes. pgpool-II reuses the cached connection if an incoming
connection is connecting to the same database by the same username.

But all my connections are to the same database and the same username,
and I only ever want my application to do 1 connection to the database
at a time, so why would I want/need 2 or 3 in max_pool?


From the subject line of your question: how to fix problem then when 
two queries run at the same time...


Of course if you don't actually want to run two simultaneous queries, 
then max_pool=1 is what you want.



Not well known enough on the Debian side of the fence? It's simple
enough to install from source though. Takes about one minute.


But is there any advantage for me compared to pgpool2, which works
really nice?


Can't say. Given your limited RAM, it's probably worth looking at both 
and seeing which leaves you more memory. Your main concern has got to be 
to reduce wasted RAM.


 In some parts, like doing some count(*) stuff, it now does

things in about one second, which took a few minutes to finish before (if
the other part of the scripts where doing something else on the database
at the same time).


That will be because you're only running one query, I'd have thought. 
Two queries might be sending you into swap.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] how to fix problem then when two queries run at the same time, it takes longer to complete then if run in sequence

2008-07-31 Thread Richard Huxton

Miernik wrote:

Theo Kramer [EMAIL PROTECTED] wrote:

file `which psql`
/usr/bin/psql: ELF 32-bit LSB executable, Intel 80386, version 1  
(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9,  
stripped


[EMAIL PROTECTED]:~$ file `which psql`
/usr/bin/psql: symbolic link to `../share/postgresql-common/pg_wrapper'
[EMAIL PROTECTED]:~$ file /usr/share/postgresql-common/pg_wrapper
/usr/share/postgresql-common/pg_wrapper: a /usr/bin/perl -w script text 
executable


That's not psql though, that's Debian's wrapper around it which lets you 
install multiple versions of PostgreSQL on the same machine. Might be 
worth bypassing it and calling it directly.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database size Vs performance degradation

2008-07-30 Thread Richard Huxton

Dave North wrote:

Morning folks,
Long time listener, first time poster.


Hi Dave


Postgres 8.1.8
shared_buffers = 2000
max_fsm_pages = 40
Redhat Enterprise 4
Running on HP DL380 w/ 4GB RAM, dual 10K HDDs in RAID 0+1
Also running on the server is a tomcat web server and other ancillaries


The value of 2000 seems a bit low for shared_buffers perhaps. Oh, and 
8.1.13 seems to be the latest bugfix for 8.1 too.



Now, the problem.  We have an application that continually writes a
bunch of data to a few tables which is then deleted by a batch job each
night.  We're adding around 80,000 rows to one table per day and
removing around 75,000 that are deemed to be unimportant.

[snip]

We had this problem around a month ago and again yesterday.  Because the
application needs reasonably high availability, we couldn't full vacuum
so what we did was a dump and load to another system.  What I found here
was that after the load, the DB size was around 2.7GB - a decrease of
5GB.  Re-loading this back onto the main system, and the world is good.


Well, that's pretty much the definition of bloat. Are you sure you're 
vacuuming enough? I don't have an 8.1 to hand at the moment, but a 
vacuum verbose in 8.2+ gives some details at the end about how many 
free-space slots need to be tracked. Presumably you're not tracking 
enough of them, or your vacuuming isn't actually taking place.


Check the size of your database every night. It will rise from 2.7GB, 
but it should stay roughly static (apart from whatever data you add of 
course). If you can keep it so that most of the working-set of your 
database fits in RAM speed will stay just fine.



Yes, I know we need to upgrade to 8.3 but that's going to take some time
:)


I think you'll like some of the improvements, but it's probably more 
important to get 8.1.13 installed soon-ish.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database size Vs performance degradation

2008-07-30 Thread Richard Huxton

Dave North wrote:

-Original Message-
From: Richard Huxton [mailto:[EMAIL PROTECTED] 




Well, that's pretty much the definition of bloat. Are you sure you're
vacuuming enough?

DN: Well, the auto-vac is kicking off pretty darn frequently...around
once every 2 minutes.  However, you just made me think of the obvious -
is it actually doing anything?!  The app is pretty darn write intensive
so I wonder if it's actually able to vacuum the tables?


If you've got a big batch delete, it can't hurt to manually vacuum that 
table immediately afterwards.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Less rows - better performance?

2008-07-21 Thread Richard Huxton

Andreas Hartmann wrote:

Dear PostgreSQL community,

first some info about our application:

- Online course directory for a University
- Amount of data: complete dump is 27 MB
- Semester is part of primary key in each table
- Data for approx. 10 semesters stored in the DB
- Read-only access from web application (JDBC)

Our client has asked us if the performance of the application could be
improved by moving the data from previous years to a separate archive
application. 


If you had 27GB of data maybe, but you've only got 27MB - that's 
presumably all sitting in memory.


What in particular is slow?

--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Less rows - better performance?

2008-07-21 Thread Richard Huxton

Andreas Hartmann wrote:


Here's some info about the actual amount of data:

SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database where pg_database.datname = 'vvz_live_1';

datname|  size
---+-
 vvz_live_1| 2565 MB

I wonder why the actual size is so much bigger than the data-only dump - 
is this because of index data etc.?


I suspect Guillame is right and you've not been vacuuming. That or 
you've got a *LOT* of indexes. If the database is only 27MB dumped, I'd 
just dump/restore it.


Since the database is read-only it might be worth running CLUSTER on the 
 main tables if there's a sensible ordering for them.



What in particular is slow?


There's no particular bottleneck (at least that we're aware of). During 
the first couple of days after the beginning of the semester the 
application request processing tends to slow down due to the high load 
(many students assemble their schedule). The customer upgraded the 
hardware (which already helped a lot), but they asked us to find further 
approaches to performance optimiziation.


1. Cache sensibly at the application (I should have thought there's 
plenty of opportunity here).
2. Make sure you're using a connection pool and have sized it reasonably 
(try 4,8,16 see what loads you can support).
3. Use prepared statements where it makes sense. Not sure how you'll 
manage the interplay between this and connection pooling in JDBC. Not a 
Java man I'm afraid.


If you're happy with the query plans you're looking to reduce overheads 
as much as possible during peak times.


4. Offload more of the processing to clients with some fancy ajax-ed 
interface.
5. Throw in a spare machine as an app server for the first week of term. 
Presumably your load is 100 times average at this time.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Posible planner improvement?

2008-05-21 Thread Richard Huxton

Albert Cervera Areny wrote:

I've got a query similar to this:

select * from t1, t2 where t1.id  158507 and t1.id = t2.id;

That took  84 minutes (the query was a bit longer but this is the part that 
made the difference) after a little change the query took ~1 second:


select * from t1, t2 where t1.id  158507 and t2.id  158507 and t1.id = 
t2.id;


Try posting EXPLAIN ANALYSE SELECT ... for both of those queries and 
we'll see why it's better at the second one.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Posible planner improvement?

2008-05-21 Thread Richard Huxton

Luke Lonergan wrote:

The problem is that the implied join predicate is not being
propagated.  This is definitely a planner deficiency.


IIRC only equality conditions are propagated and gt, lt, between aren't. 
 I seem to remember that the argument given was that the cost of 
checking for the ability to propagate was too high for the frequency 
when it ocurred.


Of course, what was true for code and machines of 5 years ago might not 
be so today.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Big O notation for postgres?

2008-05-21 Thread Richard Huxton

Jonah H. Harris wrote:

On Wed, May 21, 2008 at 10:10 AM, H. Hall [EMAIL PROTECTED] wrote:

Does anyone know if there is a source that provides Big O notation for
postgres's aggregate functions and operations?  For example is count(*) =
O(1) or O(n)?


I don't know of any document containing the complexity of each
aggregate, but it's sometimes left as a comment in the souce code.


Recent max() and min() can be O(n) or O(1) depending on the where-clause 
and presence of an index too, just to muddy the waters.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Join runs for 10 hours and then fills up 1.3TB of disk space

2008-05-16 Thread Richard Huxton

kevin kempter wrote:

Hi List;

I have a table with 9,961,914 rows in it (see the describe of 
bigtab_stats_fact_tmp14 below)


I also have a table with 7,785 rows in it (see the describe of 
xsegment_dim below)


I'm running the join shown below and it takes  10 hours and eventually 
runs out of disk space on a 1.4TB file system



QUERY PLAN
--- 


Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)


Dumb question Kevin, but are you really expecting 3.2 billion rows in 
the result-set? Because that's approaching 400GB of result-set without 
any overheads.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Join runs for 10 hours and then fills up 1.3TB of disk space

2008-05-16 Thread Richard Huxton

kevin kempter wrote:
I'm expecting 9,961,914 rows returned. Each row in the big table should 
have a corresponding key in the smaller tale, I want to basically 
expand the big table column list by one, via adding the appropriate 
key from the smaller table for each row in the big table. It's not a 
cartesion product join.


Didn't seem likely, to be honest.

What happens if you try the query as a cursor, perhaps with an order-by 
on customer_id or something to encourage index use? Do you ever get a 
first row back?


In fact, what happens if you slap an index over all your join columns on 
xsegment_dim? With 7,000 rows that should make it a cheap test.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Join runs for 10 hours and then fills up 1.3TB of disk space

2008-05-16 Thread Richard Huxton

kevin kempter wrote:

Hi List;

I have a table with 9,961,914 rows in it (see the describe of 
bigtab_stats_fact_tmp14 below)


I also have a table with 7,785 rows in it (see the describe of 
xsegment_dim below)


Something else is puzzling me with this - you're joining over four fields.


from
bigtab_stats_fact_tmp14 f14,
xsegment_dim segdim
where
f14.customer_id = segdim.customer_srcid
and f14.show_id = segdim.show_srcid
and f14.season_id = segdim.season_srcid
and f14.episode_id = segdim.episode_srcid
and segdim.segment_srcid is NULL;


--- 


Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)



-  Sort  (cost=1570.35..1579.46 rows=3643 width=40)



-  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)


Here it's still expecting 320 matches against each row from the large 
table. That's ~ 10% of the small table (or that fraction of it that PG 
expects) which seems very high for four clauses ANDed together.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] db size

2008-04-17 Thread Richard Huxton

Adrian Moisey wrote:

Hi


Running VACUUM VERBOSE will give you a detailed view of space usage of
each individual table.


I did that.

Not too sure what I'm looking for, can someone tell me what this means:

INFO:  blahxxx: scanned 27 of 27 pages, containing 1272 live rows and 
0 dead rows; 1272 rows in sample, 1272 estimated total rows


This is a small table that takes up 27 pages and it scanned all of them. 
You have 1272 rows in it and none of them are dead (i.e. deleted/updated 
but still taking up space).



INFO:  free space map contains 4667977 pages in 1199 relations
DETAIL:  A total of 4505344 page slots are in use (including overhead).
4505344 page slots are required to track all free space.
Current limits are:  15537488 page slots, 1200 relations, using 91172 kB.


You are tracking ~ 4.6 million pages and have space to track ~ 15.5 
million, so that's fine. You are right up against your limit of 
relations (tables, indexes etc) being tracked though - 1200. You'll 
probably want to increase max_fsm_relations - see manual for details 
(server configuration / free space map).


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] db size

2008-04-17 Thread Richard Huxton

Adrian Moisey wrote:

Hi

INFO:  blahxxx: scanned 27 of 27 pages, containing 1272 live rows 
and 0 dead rows; 1272 rows in sample, 1272 estimated total rows


This is a small table that takes up 27 pages and it scanned all of 
them. You have 1272 rows in it and none of them are dead (i.e. 
deleted/updated but still taking up space).


I had a look through a few other tables...:

INFO:  table1: scanned 22988 of 22988 pages, containing 2713446 live 
rows and 895662 dead rows; 45000 rows in sample, 2713446 estimate

d total rows

INFO:  table2: scanned 24600 of 24600 pages, containing 270585 live 
rows and 65524 dead rows; 45000 rows in sample, 270585 estimated total rows


Is that dead rows an issue?  Should I try clean it out?  Will it improve 
performance ?


What you're hoping to see is that figure remain stable. The point of the 
free-space-map is to track these and allow the space to be re-used. If 
you find that the number of dead rows is increasing then either you are:

1. Just deleting rows
2. Not vacuuming enough - check your autovacuum settings

The effect on performance is that when you read in a page from disk 
you're reading dead rows along with the data you are after. Trying to 
keep 0 dead rows in a constantly updated table isn't worth the effort 
though - you'd end up wasting your disk I/O on maintenance rather than 
queries.


The figures above look high to me - 90,000 out of 270,000 and 65,000 out 
of 270,000. Of course, if these tables have just had bulk 
updates/deletes then that's fine. If there's a steady stream of updates 
though, you probably want to up your autovacuum settings.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


  1   2   3   4   5   >