Re: [PERFORM] scale up (postgresql vs mssql)

2012-06-20 Thread Andy Colson

On 6/20/2012 1:01 AM, Eyal Wilde wrote:

Hi, all.

this is an obligation from the past:
http://archives.postgresql.org/pgsql-performance/2012-05/msg00017.php

the same test, that did ~230 results, is now doing ~700 results. that
is, BTW even better than mssql.

the ultimate solution for that problem was to NOT to do ON COMMIT
DELETE ROWS for the temporary tables. instead, we just do DELETE FROM
temp_table1.

doing TRUNCATE temp_table1 is defiantly the worst case (~100 results
in the same test). this is something we knew for a long time, which is
why we did ON COMMIT DELETE ROWS, but eventually it turned out as far
from being the best.

another minor issue is that when configuring
  temp_tablespace='other_tablespace', the sequences of the temporary
tables remain on the 'main_tablespace'.

i hope that will help making postgres even better :)



Did you ever try re-writing some of the temp table usage to use 
subselect's/views/cte/etc?


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load average in 64-core server , no I/O wait and CPU is idle

2012-05-24 Thread Andy Colson

On 05/24/2012 12:26 AM, Rajesh Kumar. Mallah wrote:


- Claudio Freireklaussfre...@gmail.com  wrote:

| From: Claudio Freireklaussfre...@gmail.com
| To: Rajesh Kumar. Mallahmal...@tradeindia.com
| Cc: pgsql-performance@postgresql.org
| Sent: Thursday, May 24, 2012 9:23:43 AM
| Subject: Re: [PERFORM] High load average in 64-core server , no I/O wait and 
CPU is idle
|
| On Thu, May 24, 2012 at 12:39 AM, Rajesh Kumar. Mallah
|mal...@tradeindia.com  wrote:
|  The problem is that  sometimes there are spikes of load avg which
|  jumps to  50 very rapidly ( ie from 0.5 to 50  within 10 secs) and
|  it remains there for sometime and slowly reduces to normal value.
|
|  During such times of high load average we observe that there is no
| IO wait
|  in system and even CPU is 50% idle. In any case the IO Wait always
| remains  1.0 % and
|  is mostly 0. Hence the load is not due to high I/O wait which was
| generally
|  the case with our previous hardware.
|
| Do you experience decreased query performance?


Yes we do experience substantial application performance degradations.




Maybe you are hitting some locks?   If its not IO and not CPU then maybe 
something is getting locked and queries are piling up.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Postgres 9.1 on Windows

2012-05-01 Thread Andy Colson

On 5/1/2012 8:06 AM, Merlin Moncure wrote:

On Tue, May 1, 2012 at 7:51 AM, Walker, James Lesjawal...@cantor.com  wrote:

Exactly, if turning off fsync gives me 100 commits/sec then I know where my 
bottleneck is and I can attack it. Keep in mind though that I already turned 
off synchronous commit -- *really* dangerous -- and it didn't have any effect.


well synchronous commit is not as dangerous:
fsync off + power failure = corrupt database
synchronous commit off + power failure = some lost transactions

still waiting on the ssd model #.  worst case scenario is that you tps
rate is in fact sync bound and you have a ssd without capacitor backed
buffers (for example, the intel 320 has them); the probable workaround
would be to set the drive cache from write through to write back but
it would unsafe in that case.  in other words, tps rates in the triple
digits would be physically impossible.

another less likely scenario is you are having network issues
(assuming you are connecting to the database through tcp/ip).  20
years in, microsoft is still figuring out how to properly configure a
network socket.

merlin



Even if its all local, windows doesnt have domain sockets (correct?), so 
all that traffic still has to go thru some bit of network stack, yes?


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Postgres 9.1 on Windows

2012-04-30 Thread Andy Colson

On 4/30/2012 8:49 AM, Walker, James Les wrote:

I’m trying to benchmark Postgres vs. several other databases on my
workstation. My workstation is running 64 bit Windows 7. It has 12 gb of
RAM and a W3550 @ 3 Ghz. I installed Postgres 9.1 using the windows
installer. The data directory is on a 6Gb/s SATA SSD.

My application is multithreaded and uses pooled connections via JDBC.
It’s got around 20 threads doing asynchronous transactions against the
database. It’s about 70% read/30% write. Transactions are very small.
There are no long-running transactions. I start with an empty database
and I only run about 5,000 business transactions in my benchmark. That
results in 10,000 – 15,000 commits.

When I first installed Postgres I did no tuning at all and was able to
get around 40 commits per-second which is quite slow. I wanted to
establish a top-end so I turned off synchronous commit and ran the same
test and got the same performance of 40 commits per second. I turned on
the “large system cache” option on Windows 7 and got the same results.
There seems to be some resource issues that’s limiting me to 40 commits
per second but I can’t imagine what it could be or how to detect it.

I’m not necessarily looking for advice on how to increase performance,
but I at least need to know how to find the bottleneck.

-- Les Walker



One thing I'd look at is your hardware and determine if you are CPU 
bound or IO bound.  I use Linux so don't know how you'd do that on windows.


Have you checked your sql statements with explain analyze?

I don't know anything about config file settings on windows, but on 
Linux its really important.  google could probably help you there.


Knowing if you are CPU bound or IO bound, and if you have any bad plans, 
will tell you what config file changes to make.


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] scale up (postgresql vs mssql)

2012-04-18 Thread Andy Colson

On 4/18/2012 2:32 AM, Eyal Wilde wrote:

hi all,

i ran vmstat during the test :

[yb@centos08 ~]$ vmstat 1 15
procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 2  0  0 6118620 160572 112028800 0 13792 4548 3054 63 

 6 25  6  0

the temp-tables normally don't populate more then 10 rows. they are
being created in advanced. we don't drop them, we use ON COMMIT DELETE
ROWS. i believe temp-tables are in the RAM, so no disk-i/o, right? and
also: no writing to the system catalogs, right?


Temp tables are not 100% ram, they might spill to disk. The vmstat shows 
there is disk io.  The BO column (blocks out) shows you are writing to 
disk.  And you have wait time (which means one or more of the cpus is 
stopped waiting for disk).


I don't know if the disk io is because of the temp tables (I've never 
used them myself), or something else (stats, vacuum, logs, other sql, etc).


I'd bet, though, that a derived table would be faster than create temp 
table...; insert into temp  ; select .. from temp;


Of course it may not be that much faster... and it might require a lot 
of code change.  Might be worth a quick benchmark though.




about returning multiple refcursors, we checked this issue in the past,
and we concluded that returning many small refcursors (all have the same
structure), is faster than returning 1 big refcursor. dose it sound
wired (maybe it worth more tests)?  that's why we took that path.



No, if you tried it out, I'd stick with what you have.  I've never used 
them myself, so I was just wondering aloud.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] scale up (postgresql vs mssql)

2012-04-16 Thread Andy Colson

On 4/15/2012 7:43 AM, Eyal Wilde wrote:

hi,

thanks a lot to all of you for your help.

(i'm sorry i did not know how to reply to a certain message)

i found that the best number of active connections is indeed 8-10. with
8-10 active connections postgresql did ~170 account-ids. this is still
only half of what mssql did, but it now makes sence, considering that
mssql works close to twice faster.

i played with work_mem, shared_buffers, temp_buffers. i ran the tests
with both of the following configurations, but no significant difference
was found.

thanks again for any more help.



We'd need to know if you are CPU bound or IO bound before we can help. 
Watch vmstat 2 while the server is busy (and maybe post a few rows).




i had a stored procedure in ms-sql server. this stored procedure gets a 
parameter (account-id), dose about 20 queries, fills some temporary tables, and 
finally, returns a few result-sets. this stored procedure converted to stored 
function in postgresql (9.1). the result-sets are being returned using 
refcursors. this stored function is logically, almost identical to the ms-sql 
stored procedure.


I think that may be a problem.  Treating PG like its mssql wont work 
well I'd bet.  things that work well in one database may not work well 
in another.


Instead of temp tables, have you tried derived tables?  Instead of:

insert into temptable select * from stuff;
select * from temptable;

try something like:

select * from (
  select * from stuff
) as subq

Another thing you might try to remove temp tables is to use views.

I dont know if it'll be faster, I'm just guessing.  Pulling out 
individual parts and running explain analyze on them will help you 
find the slow ones.  Finding which is faster (temp tables, derived 
tables, or views) might help you deiced what needs to be re-written.


Also, I'm not sure how well PG does return multiple refcursors. there 
may be a lot of round trips from client to server to fetch next.  How 
hard would it be to re-do your single procedure that returns a bunch of 
refcursors into multiple procedures each returning one resultset?


Or maybe it would be something you can speed test to see if it would 
even make a difference.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DBD-Pg prepared statement versus plain execution

2012-03-21 Thread Andy Colson

On 3/21/2012 6:21 AM, Rafael Martinez wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello

We are having some performance problems with an application that uses
prepared statement heavily.

We have found out that it creates-executes-destroys a prepared statement
*per* statement it sends to the database (pg-9.1) via DBD-Pg.

A normal log entry for a sql-statement looks e.g. like this:
- --
[2012-03-15 14:49:12.484 CET]  LOG:  duration: 8.440 ms  parse
dbdpg_p32048_3:

SELECT DISTINCT ACL.RightName FROM ACL, Principals, CachedGroupMembers
WHERE Principals.id = ACL.PrincipalId AND Principals.PrincipalType =
'Group' AND Principals.Disabled = 0 AND CachedGroupMembers.GroupId  =
ACL.PrincipalId AND CachedGroupMembers.GroupId  = Principals.id AND
CachedGroupMembers.MemberId = 19312 AND CachedGroupMembers.Disabled = 0
  AND ((ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1) OR
(ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1))

[2012-03-15 14:49:12.485 CET]  LOG:  duration: 0.087 ms  bind
dbdpg_p32048_3:

SELECT DISTINCT ACL.RightName FROM ACL, Principals, CachedGroupMembers
WHERE Principals.id = ACL.PrincipalId AND Principals.PrincipalType =
'Group' AND Principals.Disabled = 0 AND CachedGroupMembers.GroupId  =
ACL.PrincipalId AND CachedGroupMembers.GroupId  = Principals.id AND
CachedGroupMembers.MemberId = 19312 AND CachedGroupMembers.Disabled = 0
  AND ((ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1) OR
(ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1))


[2012-03-15 14:49:12.487 CET] LOG:  duration: 1.692 ms  execute
dbdpg_p32048_3:

SELECT DISTINCT ACL.RightName FROM ACL, Principals, CachedGroupMembers
WHERE Principals.id = ACL.PrincipalId AND Principals.PrincipalType =
'Group' AND Principals.Disabled = 0 AND CachedGroupMembers.GroupId  =
ACL.PrincipalId AND CachedGroupMembers.GroupId  = Principals.id AND
CachedGroupMembers.MemberId = 19312 AND CachedGroupMembers.Disabled = 0
  AND ((ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1) OR
(ACL.ObjectType = 'RT::System' AND ACL.ObjectId = 1))


[2012-03-15 14:49:12.488 CET] LOG:  duration: 0.029 ms  statement:
DEALLOCATE dbdpg_p32048_3
- --

As you can see, the parse+bind+deallocate part uses much more time than
the execution part. This is the same for many of the statements send to
the database.

My question is:

Is the parse+bind time reported, a time (not reported) that the planer
will use anyway when running a sql-statement in a normal way or the
parse+bind+deallocate time is *extra* time needed by the prepared statement?

Can we assume that running this application without using prepared
statements will do that it runs faster the time used by
parse+bind+deallocate?

Thanks in advance.

regards,
- --
  Rafael Martinez Guerrero
  Center for Information Technology
  University of Oslo, Norway

  PGP Public Key: http://folk.uio.no/rafael/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9pubAACgkQBhuKQurGihTYkwCcCFYQRDGWD0yaR+f2FFwKs7gN
RfgAoJdPrAzUhfBfsXmst7/l7LVLisHy
=l7Fl
-END PGP SIGNATURE-



What does your perl look like?  This would be wrong:

for $key (@list)
{
  my $q = $db-prepare('select a from b where c = $1');
  $q-execute($key);
  $result = $q-fetch;
}


This would be right:

my $q = $db-prepare('select a from b where c = $1');
for $key (@list)
{
  $q-execute($key);
  $result = $q-fetch;
}

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shared memory for large PostGIS operations

2012-03-17 Thread Andy Colson

On 03/16/2012 05:30 PM, Kevin Grittner wrote:

Brian Hamlinmapl...@light42.com  wrote:

On Mar 16, 2012, at 7:17 AM, Kevin Grittner wrote:

Andy Colsona...@squeakycode.net  wrote:


I tried shared_buffers at both 2400M and 18000M, and it took 4.5
hours both times.  ... (weak attempts at humor omitted) 


Ah, I didn't pick up on the attempts at humor; perhaps that's why
you mistook something I said as an attempt at an insult.


It wasn't you Kevin, it was me that insulted him.  (Although I was trying to be 
funny, and not mean).

Sorry again Brian.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shared memory for large PostGIS operations

2012-03-16 Thread Andy Colson



So let me clean that up for you:

 On 3/14/2012 11:29 PM, mapl...@light42.com wrote:

Hello list, my name is Brian Hamlin, but I prefer to go by darkblue, its 
mysterious and dangerous!


I run PG 9.1, PostGIS 1.5.3, Linux 64 on Dual Xeons, OS on a single 
drive, and db is on 3-disk raid 5.  I'm the only user.


work_mem = 640M

I do these huge monolithic postGIS queries on an otherwise idle linux 
machine.  python driven analysis.. 15 steps.. some, unusual for me, are 
multiple queries running at once on the same data ... and others are 
just one labor intensive thing then the next (one result table is 1.8M 
rows for 745M on disk, others are smaller)


I tried shared_buffers at both 2400M and 18000M, and it took 4.5 hours 
both times.  I dont know if I am CPU bound or IO bound, but since giving 
PG more ram didnt help much, I'll assume I'm CPU bound.  I heard of this 
program called vmstat that I'll read up on and post some results for.


I don't know how much memory my box has, and I've never run explain 
analyze, but I'll try it out and post some.  I just learned about 
http://explain.depesz.com/ and figure it might help me.


This is the best list ever!  Thanks all!  (especially that poetic Dave 
Fetter, and that somewhat mean, but helpful, Andy Colson)


Shout outs to my friends Garlynn, Nick and Rush (best band ever!). 
Party, my house, next week!



==
(Virtually) Brian Hamlin
GeoCal
OSGeo California Chapter
415-717-4462 cell



-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Advice sought : new database server

2012-03-04 Thread Andy Colson

On 03/04/2012 03:58 AM, Rory Campbell-Lange wrote:

I'd be grateful for advice on specifying the new server

providing about 230GB of usable storage, 150GB of which is on an LV
providing reconfigurable space for the databases which are served off an
XFS formatted volume.



Do you mean LVM?  I've heard that LVM limits IO, so if you want full speed you 
might wanna drop LVM.  (And XFS supports increasing fs size, and when are you 
ever really gonna want to decrease fs size?).

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?

2012-02-26 Thread Andy Colson

On 02/25/2012 06:16 PM, Stefan Keller wrote:


1. How can I warm up or re-populate shared buffers of Postgres?
2. Are there any hints on how to tell Postgres to read in all table
contents into memory?

Yours, Stefan



How about after you load the data, vacuum freeze it, then do something like:

SELECT count(*) FROM osm_point WHERE tags @ 'tourism=junk'

-Andy



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?

2012-02-26 Thread Andy Colson

On 02/26/2012 01:11 PM, Stefan Keller wrote:

2012/2/26 Andy Colsona...@squeakycode.net  wrote:

On 02/25/2012 06:16 PM, Stefan Keller wrote:

1. How can I warm up or re-populate shared buffers of Postgres?
2. Are there any hints on how to tell Postgres to read in all table
contents into memory?

Yours, Stefan


How about after you load the data, vacuum freeze it, then do something like:

SELECT count(*) FROM osm_point WHERE tags @  'tourism=junk'

-Andy


That good idea is what I proposed elsewhere on one of the PG lists and
got told that this does'nt help.

I can accept this approach that users should'nt directly interfere
with the optimizer. But I think it's still worth to discuss a
configuration option (per table) or so which tells PG that this table
contents should fit into memory so that it tries to load a table into
memory and keeps it there. This option probably only makes sense in
combination with unlogged tables.

Yours, Stefan



I don't buy that.  Did you test it?  Who/where did you hear this?  And... how 
long does it take after you replace the entire table until things are good and 
cached?  One or two queries?

After a complete reload of the data, do you vacuum freeze it?

After a complete reload of the data, how long until its fast?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 6:34 AM, Thom Brown wrote:

On 22 February 2012 23:50, Alessandro Gagliardialessan...@path.com  wrote:

I have a database where I virtually never delete and almost never do
updates. (The updates might change in the future but for now it's okay to
assume they never happen.) As such, it seems like it might be worth it to
set autovacuum=off or at least make it so vacuuming hardly ever occurs.
Actually, the latter is probably the more robust solution, though I don't
know how to do that (hence me writing this list). I did try turning
autovacuum off but got:

ERROR: parameter autovacuum cannot be changed now
SQL state: 55P02

Not sure what, if anything, I can do about that.


Autovacuum is controlled by how much of a table has changed, so if a
table never changes, it never gets vacuumed (with the exceptional case
being a forced vacuum freeze to mitigate the transaction id
wrap-around issue).  The settings which control this are
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.
Therefore it isn't necessary to disable autovacuum.

But if you are adamant about disabling it, you need to change it in
your postgresql.conf file and restart the server.



Agreed, don't disable autovacuum.  It's not that demanding, and if you 
do need it and forget to run it, it might cause you more problems.


I have a db that's on a VM that doesnt get hit very much.  I've noticed 
IO is a little busy (we are talking small percents of percents less than 
one) but still more that I thought should be happening on a db with next 
to no usage.


I found setting autovacuum_naptime = 6min made the IO all but vanish.

And if I ever get a wild hair and blow some stuff away, the db will 
clean up after me.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Andy Colson

On 2/23/2012 12:05 PM, Shaun Thomas wrote:

On 02/23/2012 11:56 AM, Greg Spiegelberg wrote:


I know there are perils in using ctid but with the LOCK it should be
safe. This transaction took perhaps 30 minutes and removed 100k rows
and once the table was VACUUM'd afterward it freed up close to 20 GB
on the file system.


It took *30 minutes* to delete 100k rows? And 100k rows were using 20GB?
Is that off by an order of magnitude?

Using the ctid is a cute trick, though. :)



And I'm not sure the LOCK is necessary, while googling for delete from 
table limit 10 I ran across this thread:


http://archives.postgresql.org/pgsql-hackers/2010-11/msg02028.php

They use it without locks.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 12:38 PM, Alessandro Gagliardi wrote:


Does analyze increase the efficiency of inserts or just selects? (I
assumed the latter.) Obviously, I will need to analyze sometimes, but


That depends on if you have triggers that are doing selects.  But in 
general you are correct, analyze wont help inserts.


checkpoint_segments can help insert speed, what do you have that set to?

Also how you insert can make things faster too. (insert vs prepared vs COPY)

Also, if you have too many indexes on a table that can cause things to 
slow down.


Your IO layer needs to be fast too.  Have you watched vmstat and iostat?

Have you read up on synchronous_commit?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 2:40 PM, Alessandro Gagliardi wrote:


checkpoint_segments can help insert speed, what do you have that set to?

40. Checking
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server it looks
like setting that as high as 256 would not necessarily be unreasonable.
What do you think?


I'd say go slow.  Try a little bit and see if it helps.  I don't 
actually have high insert rate problems, so I don't actually know from 
experience.




Also how you insert can make things faster too. (insert vs prepared
vs COPY)

I'm doing this all with INSERT. Is COPY that much faster? I don't know
anything about prepared.


If you can batch multiple records then COPY is the fastest method.  (Of 
course your triggers might be the cause for the slowness and not insert 
speed).


Depending on the language you are using to insert records, you can 
prepare a query and only send the arguments vs sending the entire sql 
statement every time.


In pseudo-perl code I'd:
my $q = $db-prepare('insert into table(col1, vol2) values ($1, $2)');

$q-execute('one', 'two');
$q-execute('three', 'four');
$q-execute('five', 'six');

This is faster because the insert... is only sent over the wire and 
parsed once.  Then only the arguments are sent for each execute.


Speed wise, I think it'll go:
1) slowest: individual insert statements
2) prepared statements
3) fastest: COPY

Again.. assuming the triggers are not the bottleneck.

Have you run an insert by hand with 'EXPLAIN ANALYZE'?

-Andy




Have you read up on synchronous_commit?

Only a tiny bit. A couple people suggested disabling it since my
database is being hosted on AWS so I did that. It seems a bit risky but
perhaps worth it.



I would think they are running on battery backed IO, with boxes on UPS, 
so I'd guess its pretty safe.  It would also depend on your commit size. 
 If you are batching a million records into one commit, you might loose 
all of them.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inserts or Updates

2012-02-07 Thread Andy Colson

On 2/7/2012 4:18 AM, Ofer Israeli wrote:

Hi all,

We are currently “stuck” with a performance bottleneck in our server
using PG and we are thinking of two potential solutions which I would be
happy to hear your opinion about.

Our system has a couple of tables that hold client generated
information. The clients communicate *every* minute with the server and
thus we perform an update on these two tables every minute. We are
talking about ~50K clients (and therefore records).

These constant updates have made the table sizes to grow drastically and
index bloating. So the two solutions that we are talking about are:



You dont give any table details, so I'll have to guess.  Maybe you have 
too many indexes on your table?  Or, you dont have a good primary index, 
which means your updates are changing the primary key?


If you only have a primary index, and you are not changing it, Pg should 
be able to do HOT updates.


If you have lots of indexes, you should review them, you probably don't 
need half of them.



And like Kevin said, try the simple one first.  Wont hurt anything, and 
if it works, great!


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inserts or Updates

2012-02-07 Thread Andy Colson

-Original Message-
From: Andy Colson [mailto:a...@squeakycode.net]
Sent: Tuesday, February 07, 2012 4:47 PM
To: Ofer Israeli
Cc: pgsql-performance@postgresql.org; Olga Vingurt; Netta Kabala
Subject: Re: [PERFORM] Inserts or Updates

On 2/7/2012 4:18 AM, Ofer Israeli wrote:

Hi all,

We are currently stuck with a performance bottleneck in our server
using PG and we are thinking of two potential solutions which I would be
happy to hear your opinion about.

Our system has a couple of tables that hold client generated
information. The clients communicate *every* minute with the server and
thus we perform an update on these two tables every minute. We are
talking about ~50K clients (and therefore records).

These constant updates have made the table sizes to grow drastically and
index bloating. So the two solutions that we are talking about are:



You dont give any table details, so I'll have to guess.  Maybe you have
too many indexes on your table?  Or, you dont have a good primary index,
which means your updates are changing the primary key?

If you only have a primary index, and you are not changing it, Pg should
be able to do HOT updates.

If you have lots of indexes, you should review them, you probably don't
need half of them.


And like Kevin said, try the simple one first.  Wont hurt anything, and
if it works, great!

-Andy




On 2/7/2012 11:40 AM, Ofer Israeli wrote:
 Hi Andy,

 The two tables I am referring to have the following specs:
 Table 1:
 46 columns
 23 indexes on fields of the following types:
 INTEGER - 7
 TIMESTAMP - 2
 VARCHAR - 12
 UUID - 2

 23 columns
 12 indexes on fields of the following types:
 INTEGER - 3
 TIMESTAMP - 1
 VARCHAR - 6
 UUID - 2

 All indexes are default indexes.

 The primary index is INTERGER and is not updated.

 The indexes are used for sorting and filtering purposes in our UI.


 I will be happy to hear your thoughts on this.

 Thanks,
 Ofer


Fixed that top post for ya.

Wow, so out of 46 columns, half of them have indexes?  That's a lot. 
I'd bet you could drop a bunch of them.  You should review them and see 
if they are actually helping you.  You already found out that maintain 
all those indexes is painful.  If they are not speeding up your SELECT's 
by a huge amount, you should drop them.


Sounds like you went thru your sql statements and any field that was 
either in the where or order by clause you added an index for?


You need to find the columns that are the most selective.  An index 
should be useful at cutting the number of rows down.  Once you have it 
cut down, an index on another field wont really help that much.  And 
after a result set has been collected, an index may or may not help for 
sorting.


Running some queries with EXPLAIN ANALYZE would be helpful.  Give it a 
run, drop an index, try it again to see if its about the same, or if 
that index made a difference.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inserts or Updates

2012-02-07 Thread Andy Colson

Oh, I knew I'd seen index usage stats someplace.

give this a run:

select * from pg_stat_user_indexes where relname = 'SuperBigTable';

http://www.postgresql.org/docs/current/static/monitoring-stats.html

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Andy Colson

On 1/30/2012 3:27 AM, Saurabh wrote:

Hi all,

I am using Postgresql database for our project and doing some
performance testing. We need to insert millions of record with indexed
columns. We have 5 columns in table. I created index on integer only
then performance is good but when I created index on text column as
well then the performance reduced to 1/8th times. My question is how I
can improve performance when inserting data using index on text
column?

Thanks,
Saurabh



Do it in a single transaction, and use COPY.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgress is taking lot of CPU on our embedded hardware.

2012-01-27 Thread Andy Colson

On 1/27/2012 10:47 AM, Heikki Linnakangas wrote:

On 27.01.2012 15:34, Jayashankar K B wrote:

Hi,

We are having an embedded system with a freescale m68k architecture
based micro-controller, 256MB RAM running a customized version of
Slackware 12 linux.
It's a relatively modest Hardware.


Fascinating!


We have installed postgres 9.1 as our database engine. While testing,
we found that the Postgres operations take more than 70% of CPU and
the average also stays above 40%.
This is suffocating the various other processes running on the system.
Couple of them are very critical ones.
The testing involves inserting bulk number of records (approx. 1
records having between 10 and 20 columns).
Please let us know how we can reduce CPU usage for the postgres.


The first step would be to figure out where all the time is spent. Are
there unnecessary indexes you could remove? Are you using INSERT
statements or COPY? Sending the data in binary format instead of text
might shave some cycles.



Do you have triggers on the table?



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Can lots of small writes badly hamper reads from other tables?

2012-01-24 Thread Andy Colson

On 1/24/2012 2:16 PM, Dave Crooke wrote:

Hi folks

This could be a sheer volume issue, but I though I would ask the wisdom
of this forum as to next investigative steps.



We use PostgreSQL 8.4.4 which is bundled with our application as a
VMware virtual appliance. The bulk of the app's database activity is
recording performance data points which arrive in farily large sustained
bursts of perhaps 10,000 rows a minute at a medium sized customer, each
of which are logically separate items and being committed as individual
transactions (JDBC auto-commit mode). Our offshore QA team was assigned
to track an intermittent issue with speed of some large queries on other
tables, and they believe based on correlation the two activities may be
contending.


You have 10 connections, all doing:

begin
insert into PERF_RAW_2012_01_24  -- one record
commit


If that's what you're doing, yes, I'd say that's the slowest way possible.

Doing this would be faster:

begin
insert into PERF_RAW_2012_01_24  -- one record
insert into PERF_RAW_2012_01_24  -- one record
...
insert into PERF_RAW_2012_01_24  -- one record
commit

Doing this would be even faster:


begin
-- one insert, multiple rows
insert into PERF_RAW_2012_01_24 values (...) (...) (...) ... (...);
insert into PERF_RAW_2012_01_24 values (...) (...) (...) ... (...);
commit

And, fastest of all fastest, use COPY.  But be careful, its so fast 
it'll melt your face off :-)



I didnt even bother trying to pick out the uncommented settings from 
your .conf file.  Way to much work.


VM usually have pretty slow IO, so you might wanna watch vmstat and 
iostat to see if you are IO bound or CPU bound.


Also watching iostat before and after the change might be interesting.

If you you keep having lots and lots of transaction, look into 
commit_delay, it'll help batch commits out to disk  (if I remember 
correctly).


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Cursor fetch performance issue

2012-01-24 Thread Andy Colson

On Tue, 2012-01-24 at 21:47 +0100, Pavel Stehule wrote:

Hello

2012/1/24 Tony Capobiancotcapobia...@prospectiv.com:

We are migrating our Oracle warehouse to Postgres 9.

This function responds well:

pg=# select public.getMemberAdminPrevious_sp2(247815829, 
1,'test.em...@hotmail.com', 'email', 'test');
  getmemberadminprevious_sp2

  unnamed portal 1
(1 row)

Time: 7.549 ms

However, when testing, this fetch takes upwards of 38 minutes:

BEGIN;
select public.getMemberAdminPrevious_sp2(247815829, 1,'test.em...@hotmail.com', 
'email', 'test');
FETCH ALL IN unnamed portal 2;

How can I diagnose any performance issues with the fetch in the cursor?



Cursors are optimized to returns small subset of result - if you plan
to read complete result, then set

set cursor_tuple_fraction to 1.0;

this is session config value, you can set it before selected cursors queries

Regards

Pavel Stehule


Thanks.
Tony


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance









On 1/24/2012 2:57 PM, Tony Capobianco wrote:
 Running just the sql of the function returns only 10 rows:

 pg=# SELECT m.memberid, m.websiteid, m.emailaddress,
 pg-#m.firstname, m.lastname, m.regcomplete, m.emailok
 pg-# FROM   members m
 pg-# WHERE  m.emailaddress LIKE 'test.em...@hotmail.com'
 pg-# ANDm.changedate_id  5868 ORDER BY m.emailaddress, 
m.websiteid;
   memberid  | websiteid |emailaddress| firstname | 
lastname | regcomplete | emailok
 
---+---++---+--+-+-
   247815829 | 1 | test.em...@hotmail.com | email | test 
   |   1 |   1
   300960335 |62 | test.em...@hotmail.com |   | 
   |   1 |   1
   300959937 |   625 | test.em...@hotmail.com |   | 
   |   1 |   1
   260152830 |  1453 | test.em...@hotmail.com |   | 
   |   1 |   1
   300960163 |  1737 | test.em...@hotmail.com | email | test 
   |   1 |   1
   300960259 |  1824 | test.em...@hotmail.com | email | test 
   |   1 |   1
   300959742 |  1928 | test.em...@hotmail.com | email | test 
   |   1 |   1
   368122699 |  2457 | test.em...@hotmail.com | email | test 
   |   1 |   1
   403218613 |  2464 | test.em...@hotmail.com | email | test 
   |   1 |   0
   378951994 |  2656 | test.em...@hotmail.com |   | 
   |   1 |   1

 (10 rows)

 Time: 132.626 ms

 So, it would seem that's a small enough number of rows. 
Unfortunately, issuing:


 set cursor_tuple_fraction to 1.0;

 Did not have an effect on performance.  Is it common to modify this
 cursor_tuple_fraction parameter each time we execute the function?




So, is getMemberAdminPrevious_sp2() preparing a statement with wildcards?

SELECT m.memberid, m.websiteid, m.emailaddress,
   m.firstname, m.lastname, m.regcomplete, m.emailok
   FROM   members m
   WHERE  m.emailaddress LIKE $1
   ANDm.changedate_id  $2
  ORDER BY m.emailaddress, m.websiteid;

Or is it creating the string and executing it:

sql = 'SELECT m.memberid, m.websiteid, m.emailaddress, '
||  ' m.firstname, m.lastname, m.regcomplete, m.emailok '
||  ' FROM   members m
||  ' WHERE  m.emailaddress LIKE ' || arg1
||  ' ANDm.changedate_id  ' || arg2
||  ' ORDER BY m.emailaddress, m.websiteid ';
execute(sql);

Maybe its the planner doesnt plan so well with $1 arguments vs actual 
arguments thing.


-Andy



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance question 83 GB Table 150 million rows, distinct select

2011-11-16 Thread Andy Colson

On 11/16/2011 04:53 PM, Tory M Blue wrote:

Linux F12 64bit
Postgres 8.4.4
16 proc / 32GB
8 disk 15KRPM SAS/Raid 5 (I know!)


shared_buffers = 6000MB 
#temp_buffers = 8MB 
max_prepared_transactions = 0   
work_mem = 250MB
maintenance_work_mem = 1000MB   





We now have about 180mill records in that table. The database size is
about 580GB and the userstats table which is the biggest one and the
one we query the most is 83GB.

Just a basic query takes 4 minutes:

For e.g. select count(distinct uid) from userstats where log_date'11/7/2011'



How'd you feel about keeping a monthly summary table?  Update it daily, with 
only a days worth of stats, then you could query the summary table much faster.

That's what I do for my website stats.  I log details for a month, then 
summarize everything into a summary table, and blow away the details.  You 
wouldn't have to delete the details if you wanted them, just keeping the 
summary table updated would be enough.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize the database performance

2011-10-17 Thread Andy Colson

On 10/17/2011 04:48 AM, Micka wrote:

Hi,

I've a postgres 9.1 database used for map generating ( tiles ).
The system has 24Go RAM and 5 processors.
I'm using geoserver to generate the tiles.

My data used 8486 MB  = psql -d gis -c SELECT 
pg_size_pretty(pg_database_size('gis'))

I've carefully indexes the table by the the_geom column.

Here is my database config :

-- change :
-- listen_addresses = '*'
-- max_connections = 50
-- tcp_keepalives_idle = 60# TCP_KEEPIDLE, in seconds;
-- shared_buffers = 1024MB # 10% of available RAM
-- work_mem = 256MB# min 64kB
-- maintenance_work_mem = 256MB# min 1MB
-- effective_cache_size = 5120MB
-- autovacuum = off

sudo nano /etc/sysctl.conf
--   kernel.shmmax=5368709120
--   kernel.shmall=5368709120

I wanted to have your opinion about this config ?  What can I do to optimize 
the performance ?

Thank you,



Yeah... We're gonna need some more details.  Whats slow?  Are you CPU bound or 
IO bound?  How many concurrent db connections?  What does vmstat look like?  
And 10% of 24 gig is 2.4 gig, not 1 gig.

Or is this box doing something else.  I noticeeffective_cache_size is only 5 
gig, so you must be doing other things on this box.


-- autovacuum = off


Are you vacuuming by hand!?   If not this is a really bad idea (tm)(c)(r)

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-12 Thread Andy Colson

On 9/12/2011 12:15 PM, Robert Klemme wrote:

On 11.09.2011 19:02, Marti Raudsepp wrote:

On Sun, Sep 11, 2011 at 17:23, Andy Colsona...@squeakycode.net wrote:

On 09/11/2011 08:59 AM, Igor Chudov wrote:

By the way, does that INSERT UPDATE functionality or something like
this exist in Postgres?

You have two options:
1) write a function like:
create function doinsert(_id integer, _value text) returns void as
2) use two sql statements:


Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.


Wouldn't it be sufficient to reverse order for race condition safety?
Pseudo code:

begin
insert ...
catch
update ...
if not found error
end

Speed is another matter though...

Kind regards

robert





No, I dont think so, if you had two loaders, both would start a 
transaction, then neither could see what the other was doing.  There are 
transaction isolation levels, but they are like playing with fire.  (in 
my opinion).


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Allow sorts to use more available memory

2011-09-12 Thread Andy Colson

On 9/12/2011 12:33 PM, Robert Schnabel wrote:

The recent data warehouse thread made me think about how I use
work_mem for some of my big queries. So I tried SET work_mem = '4GB' for
a session and got

ERROR: 4194304 is outside the valid range for parameter work_mem (64
.. 2097151)

A bit of searching turned up the Allow sorts to use more available
memory section of the to-do list. Am I correct in reading that the
max_val is 2GB and regardless of how much RAM I have in the box I'm
stuck with only using 2GB? Am I missing something?

I'm using: PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit
Windows 2008 Server Enterprise

Thanks,
Bob


work_mem is not the total a query can use.  I believe each step can use 
that much, and each backend can use it for multiple bits.  So if you had 
two backends, each doing 2 sorts, you'd use 2*2 = 4 * 2GB = 8GB.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Allow sorts to use more available memory

2011-09-12 Thread Andy Colson

On 9/12/2011 12:57 PM, Shaun Thomas wrote:

On 09/12/2011 12:47 PM, Andy Colson wrote:


work_mem is not the total a query can use. I believe each step can
use that much, and each backend can use it for multiple bits. So if
you had two backends, each doing 2 sorts, you'd use 2*2 = 4 * 2GB =
8GB.


Exactly. Find a big query somewhere in your system. Use EXPLAIN to
examine it.


Yeah, and even better, on PG 9, if you EXPLAIN ANALYZE it'll show you 
just how much memory is actually being used.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Allow sorts to use more available memory

2011-09-12 Thread Andy Colson

On 9/12/2011 1:22 PM, Robert Schnabel wrote:


On 9/12/2011 12:57 PM, Shaun Thomas wrote:

On 09/12/2011 12:47 PM, Andy Colson wrote:


work_mem is not the total a query can use. I believe each step can
use that much, and each backend can use it for multiple bits. So if
you had two backends, each doing 2 sorts, you'd use 2*2 = 4 * 2GB =
8GB.

Exactly. Find a big query somewhere in your system. Use EXPLAIN to
examine it. Chances are, that one query has one or more sorts. Each one
of those gets its own work_mem. Each sort. The query have four sorts? It
may use 4*work_mem. On a whim a while back, I doubled our 8MB setting to
16MB on a test system. During a load test, the machine ran out of
memory, swapped out, and finally crashed after the OOM killer went nuts.

Set this value *at your own risk* and only after *significant* testing.
Having it too high can have rather unexpected consequences. Setting it
to 1 or 2GB, unless you have VERY few threads, or a TON of memory, is a
very, very bad idea.


Yep, I know. But in the context of the data warehouse where *I'm the
only user* and I have a query that does, say 4 large sorts like
http://explain.depesz.com/s/BrAO and I have 32GB RAM I'm not worried
about using 8GB or 16GB in the case of work_mem = 4GB. I realize the
query above only used 1.9GB for the largest sort but I know I have other
queries with 1 or 2 sorts that I've watched go to disk.

Bob






Wow, you are getting close to the limits there.  Another thing you can 
do is mount tmpfs in ram and then just let it spill.


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Allow sorts to use more available memory

2011-09-12 Thread Andy Colson

On 9/12/2011 1:22 PM, Robert Schnabel wrote:


On 9/12/2011 12:57 PM, Shaun Thomas wrote:

On 09/12/2011 12:47 PM, Andy Colson wrote:


work_mem is not the total a query can use. I believe each step can
use that much, and each backend can use it for multiple bits. So if
you had two backends, each doing 2 sorts, you'd use 2*2 = 4 * 2GB =
8GB.

Exactly. Find a big query somewhere in your system. Use EXPLAIN to
examine it. Chances are, that one query has one or more sorts. Each one
of those gets its own work_mem. Each sort. The query have four sorts? It
may use 4*work_mem. On a whim a while back, I doubled our 8MB setting to
16MB on a test system. During a load test, the machine ran out of
memory, swapped out, and finally crashed after the OOM killer went nuts.

Set this value *at your own risk* and only after *significant* testing.
Having it too high can have rather unexpected consequences. Setting it
to 1 or 2GB, unless you have VERY few threads, or a TON of memory, is a
very, very bad idea.


Yep, I know. But in the context of the data warehouse where *I'm the
only user* and I have a query that does, say 4 large sorts like
http://explain.depesz.com/s/BrAO and I have 32GB RAM I'm not worried
about using 8GB or 16GB in the case of work_mem = 4GB. I realize the
query above only used 1.9GB for the largest sort but I know I have other
queries with 1 or 2 sorts that I've watched go to disk.

Bob






Huge guess here, cant see select or ddl, but looks like all the tables 
are sequential scans.  It might help to add an index or two, then the 
table joins could be done much more efficiently with with a lot less 
memory.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 07:35 AM, Igor Chudov wrote:

I have been a MySQL user for years, including owning a few
multi-gigabyte databases for my websites, and using it to host
algebra.com http://algebra.com (about 12 GB database).

I have had my ups and downs with MySQL. The ups were ease of use and
decent performance for small databases such as algebra.com 
http://algebra.com. The downs
were things like twenty hour REPAIR TABLE operations on a 35 GB
table, etc.

Right now I have a personal (one user) project to create a 5-10
Terabyte data warehouse. The largest table will consume the most space
and will take, perhaps, 200,000,000 rows.

I want to use it to obtain valuable business intelligence and to make
money.

I expect it to grow, never shrink, and to be accessed via batch
queries. I do not care for batch queries to be super fast, for example
an hour per query would be just fine.

However, while an hour is fine, two weeks per query is NOT fine.

I have a server with about 18 TB of storage and 48 GB of RAM, and 12
CPU cores.

My initial plan was to use MySQL, InnoDB, and deal with problems as
they arise. Perhaps, say, I would implement my own joining
procedures.

After reading some disparaging stuff about InnoDB performance on large
datasets, however, I am getting cold feet. I have a general feeling
that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
with either MySQL and Postgres.

I do not know much about Postgres, but I am very eager to learn and
see if I can use it for my purposes more effectively than MySQL.

I cannot shell out $47,000 per CPU for Oracle for this project.

To be more specific, the batch queries that I would do, I hope,
would either use small JOINS of a small dataset to a large dataset, or
just SELECTS from one big table.

So... Can Postgres support a 5-10 TB database with the use pattern
stated above?

Thanks!

i



That is a scale or two larger than I have experience with.  I converted my 
website database from mysql to PG, and it has several db's between 1 and 10 
gig.  There are parts of the website that were faster with mysql, and there are 
parts faster with PG.  One spot, because PG has superior join support on select 
statements, I was able to change the code to generate a single more complicated 
sql statement vs. mysql that had to fire off several simpler statements.  Its a 
search screen where you can type in 15'ish different options.  I was able to 
generate a single sql statement which joins 8 some odd tables and plenty of 
where statements.  PG runs it in the blink of an eye.  Its astonishing compared 
to the pain of mysql.  If you ever have to write your own join, or your own 
lookup function, that's a failure of your database.

One spot that was slower was a batch insert of data.  Its not so much slower 
that it was a problem.  I use COPY on PG vs prepared insert's on mysql.  It was 
pretty close, but mysql still won.

Seeing as you can setup and test both databases, have you considered a trial 
run?

Things to watch for:


I think the same amount of data will use more disk space in PG than in mysql.

Importing data into PG should use COPY and multiple connections at the same 
time.

PG will only use multi-core if you use multiple connections. (each connecion 
uses one core).

Huge result sets (like a select statement that returns 1,000,000 rows) will be 
slow.

PG is a much fuller database than mysql, and as such you can influence its join 
types, and function calls. (table scan vs index, immutable function vs stable, 
perl function vs sql).  So if at first it appears slow, you have a million 
options.  I think the only option you have in mysql is to pull the data back 
and code it yourself.

Upgrading to major versions of PG may or may not be painful.  (mysql sometimes 
works seamlessly between versions, it appears brilliant.  But I have had 
problems with an update, and when it goes bad, you dont have a lot of options). 
 In the past PG's only method of upgrade was a full backup of old, restore in 
new.  Things have gotten better, there is new pg_upgrade support (still kinda 
new though), and there is some 3rd party replication support where you 
replicate your 9.0 database to a new 9.1 database, and at some point you 
promote the new 9.1 database as the new master.  Or something like that.  I've 
only read posts about it, never done it.  But with that much data, you'll need 
an upgrade plan.

All in all, if I can summarize my personal view: mysql is fast at the expense 
of safety and usability.   (mysql still cannot do update's with subselects).  
PG is safe and usable at the expense of speed, and you wont be disappointed by 
the speed.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 08:59 AM, Igor Chudov wrote:



I do not plan to do a lot of random writing. My current design is that my perl 
scripts write to a temporary table every week, and then I do INSERT..ON 
DUPLICATE KEY UPDATE.

By the way, does that INSERT UPDATE functionality or something like this exist 
in Postgres?

i


You have two options:

1) write a function like:
create function doinsert(_id integer, _value text) returns void as $$
begin
  update thetable set value = _value where id = _id;
  if not found then
 insert into thetable(id, value) values (_id, _value);
   end if
end;
$$ language plpgsql;

2) use two sql statements:
-- update the existing
update realTable set value = (select value from tmp where tmp.id = realTable.id)
where exists (select value from tmp where tmp.id = realTable.id);

-- insert the missing
insert into realTable(id, value)
select id, value from tmp where not exists(select 1 from realTable where tmp.id 
= realTable.id);


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 09:21 AM, Igor Chudov wrote:



On Sun, Sep 11, 2011 at 9:16 AM, Claudio Freire klaussfre...@gmail.com 
mailto:klaussfre...@gmail.com wrote:

On Sun, Sep 11, 2011 at 3:59 PM, Igor Chudov ichu...@gmail.com 
mailto:ichu...@gmail.com wrote:
  Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a 
RAID-6
  configuration.
  They are managed by a 3WARE 9750 RAID CARD.
 
  I would say that I am not very concerned with linear relationship of read
  speed to disk speed. If that stuff is somewhat slow, it is OK with me.

With Raid 6 you'll have abysmal performance on write operations.
In data warehousing, there's lots of writes to temporary files, for
sorting and stuff like that.

You should either migrate to raid 10, or set up a separate array for
temporary files, perhaps raid 0.


Thanks. I will rebuild the RAID array early next week and I will see if I have 
a Raid 10 option with that card.

Quantitatively, what would you say is the write speed difference between RAID 
10 and RAID 6?



Note that using RAID 10, while faster, cuts your usable space in half. 12 2TB 
drives in raid 10 == 6 drives * 2TB == 12 TB total space.  That's not big 
enough, is it?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] should i expected performance degradation over time

2011-09-10 Thread Andy Colson

On 09/10/2011 11:55 AM, Anibal David Acosta wrote:

Sometimes I read that postgres performance is degraded over the time and 
something people talk about backup and restore database solve the problem.

It is really true?

I have postgres 9.0 on a windows machine with The autovacuum is ON

I have some configuration tables

And a couple of transactional table.

Transactional table has about 4 millions of rows inserted per day.

In the midnight all rows are moved to a historical table and in the historical 
table rows are about 2 months, any transaction older than 2 months are deleted 
daily.

So, my question is, if Should I expect same performance over time (example: 
after 1 year) or should I expect a degradation and must implements come 
technics like backup restore every certain time?

Thanks!!



Yes.  And no.  Things have changed over that last few versions.  In older 
version of PG I recall hearing about table bloat problems that were really bad, 
and backup/restore would fix it.  (Vacuum full would probably also have fixed 
it).

Vacuum full, in older versions was a last chance, bring a gun to a knife 
fight, nothing else has worked, fix table bloat solution.  Its not dis-similar from 
backup/restore.

In newer versions of PG, autovacuum, vacuum and vacuum full are all much nicer 
and work better.  I really doubt you'll need to resort to backup/restore to fix 
problems.

Just remember:  the harder you hit a table, the less chance autovacuum will 
have to clean it up.  So you might need manual vacuum.  autovacuum will cancel 
itself if the table is getting hit, where-as manual vacuum wont.

Keeping on top of vacuum will keep your tables slim and trim.  If things get 
out of hand, they'll balloon into problems.  Vacuum full at that point should 
clean it up.  But, if you ignore the problem for two years, and have super 
really bad table bloat, well, maybe backup/restore is best.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Rather large LA

2011-09-05 Thread Andy Colson

On 09/05/2011 05:28 AM, Richard Shaw wrote:


Hi,

I have a database server that's part of a web stack and is experiencing 
prolonged load average spikes of up to 400+ when the db is restarted and first 
accessed by the other parts of the stack and has generally poor performance on 
even simple select queries.



Is the slowness new?  Or has it always been a bit slow?  Have you checked for 
bloat on your tables/indexes?

When you start up, does it peg a cpu or sit around doing IO?

Have you reviewed the server logs?


autovacuum | off

Why?  I assume that's a problem.

fsync  | off

Seriously?


-Andy




There are 30 DBs in total on the server coming in at 226GB.  The one that's 
used the most is 67GB and there are another 29 that come to 159GB.

I'd really appreciate it if you could review my configurations below and make 
any suggestions that might help alleviate the performance issues.  I've been 
looking more into the shared buffers to the point of installing the contrib 
module to check what they're doing, possibly installing more RAM as the most 
used db @ 67GB might appreciate it, or moving the most used DB onto another set 
of disks, possible SSD.


PostgreSQL 9.0.4
Pgbouncer 1.4.1

Linux 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:13 EDT 2011 x86_64 x86_64 
x86_64 GNU/Linux

CentOS release 5.6 (Final)

4 x Intel Xeon-NehalemEX E7540-HexCore [2GHz] ( 24 physical cores )
32GB DDR3 RAM
1 x Adaptec 5805 Z  SATA/SAS RAID with battery backup
4 x Seagate Cheetah ST3300657SS 300GB 15RPM SAS drives in RAID 10
1 x 500GB 7200RPM SATA disk

Postgres and the OS reside on the same ex3 filesystem, whilst query and archive 
logging go onto the SATA disk which is also ext3.


   name  |  
current_setting
+---
  version| PostgreSQL 9.0.4 on 
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 
4.1.2-48), 64-bit
  archive_command| tar jcf /disk1/db-wal/%f.tar.bz2 %p
  archive_mode   | on
  autovacuum | off
  checkpoint_completion_target   | 0.9
  checkpoint_segments| 10
  client_min_messages| notice
  effective_cache_size   | 17192MB
  external_pid_file  | /var/run/postgresql/9-main.pid
  fsync  | off
  full_page_writes   | on
  lc_collate | en_US.UTF-8
  lc_ctype   | en_US.UTF-8
  listen_addresses   |
  log_checkpoints| on
  log_destination| stderr
  log_directory  | /disk1/pg_log
  log_error_verbosity| verbose
  log_filename   | postgresql-%Y-%m-%d_%H%M%S.log
  log_line_prefix| %m %u %h
  log_min_duration_statement | 250ms
  log_min_error_statement| error
  log_min_messages   | notice
  log_rotation_age   | 1d
  logging_collector  | on
  maintenance_work_mem   | 32MB
  max_connections| 1000
  max_prepared_transactions  | 25
  max_stack_depth| 4MB
  port   | 6432
  server_encoding| UTF8
  shared_buffers | 8GB
  superuser_reserved_connections | 3
  synchronous_commit | on
  temp_buffers   | 5120
  TimeZone   | UTC
  unix_socket_directory  | /var/run/postgresql
  wal_buffers| 10MB
  wal_level  | archive
  wal_sync_method| fsync
  work_mem   | 16MB


Pgbouncer config

[databases]
* = port=6432
[pgbouncer]
user=postgres
pidfile = /tmp/pgbouncer.pid
listen_addr =
listen_port = 5432
unix_socket_dir = /var/run/postgresql
auth_type = trust
auth_file = /etc/pgbouncer/userlist.txt
admin_users = postgres
stats_users = postgres
pool_mode = session
server_reset_query = DISCARD ALL;
server_check_query = select 1
server_check_delay = 10
server_idle_timeout = 5
server_lifetime = 0
max_client_conn = 4096
default_pool_size = 100
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
client_idle_timeout = 30
reserve_pool_size = 800


Thanks in advance

Richard





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sudden drop in DBb performance

2011-09-05 Thread Andy Colson

On 09/05/2011 02:48 AM, Tomas Vondra wrote:

On 3 Září 2011, 9:26, Gerhard Wohlgenannt wrote:

Dear list,

we are encountering serious performance problems with our database.
Queries which took around 100ms or less last week now take several
seconds.



Results of Bonnie++

Version  1.96   --Sequential Output-- --Sequential Input-
--Random-
Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
voyager 95G  1400  93 27804   3 16324   2  2925  96 41636   3
374.9   4
Latency  7576us  233s  164s   15647us   13120ms
3302ms
Version  1.96   --Sequential Create-- Random
Create
voyager -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
   16   141   0 + +++   146   0   157   0 + +++
172   0
Latency  1020ms 128us9148ms 598ms  37us
485ms



That seems a bit slow ... 27MB/s for writes and 41MB/s forreads is ait
slow with 8 drives.

Tomas




Agreed, that's really slow.  A single SATA drive will get 60 MB/s.  Did you run 
Bonnie while the VM's were up and running?

  root@host:~# vmstat
  procs ---memory-- ---swap-- -io -system-- cpu
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
   0  0  0 308024 884812 4051293200   464   168  353   92  4  2 84  
9


Only one line?  That does not help much.  Can you run it as 'vmstat 2' and let 
it run while a few slow queries are performed?  Then paste all the lines?


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sudden drop in DBb performance

2011-09-05 Thread Andy Colson

On 09/05/2011 09:39 AM, Tomas Vondra wrote:

On 5 Září 2011, 16:08, Gerhard Wohlgenannt wrote:

Below please find the results of vmstat 2 over some periode of time ..
with normal database / system load.


What does a normal load mean? Does that mean a time when the queries are
slow?

Are you sure the machine really has 48GB of RAM? Because from the vmstat
output it seems like there's just 32GB.

procs ---memory-- ---swap-- -io -system--
cpu
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa
   0  0 1342168 336936 107636 313535120014   183 1911 3426
2  1 93  4


1342168 + 336936 + 107636 + 31353512 = 33140252 ~ 31GB

BTW there's 1.3GB of swap, although it's not used heavily (according to
the vmstat output).

Otherwise I don't see anything wrong in the output. What is the size of
the database (use pg_database_size to get it)? Did it grow significantly
recently?

Tomas



Yeah, its interesting that it swapped in memory, but never out.  Looking at this vmstat, 
it does not look like a hardware problem.(Assuming normal load means slow 
queries)


Did it grow significantly recently?


That's a good thought,  maybe the stats are old and you have bad plans?  It 
could also be major updates to the data too (as opposed to growth).

Gerhard, have you done an 'explain analyze' on any of your slow queries?   Have 
you done an analyze lately?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Rather large LA

2011-09-05 Thread Andy Colson

On 09/05/2011 08:57 AM, Richard Shaw wrote:


Hi Andy,

It's not a new issue no, It's a legacy system that is in no way ideal but is 
also not in a position to be overhauled.  Indexes are correct, tables are up to 
25 million rows.

On startup, it hits CPU more than IO, I'll provide some additional stats after 
I restart it tonight.

Server logs have been reviewed and where possible, slow queries have been fixed.

Autovacuum has been disabled and set to run manually via cron during a quiet 
period and fsync has recently been turned off to gauge any real world 
performance increase, there is battery backup on the raid card providing some 
level of resilience.

Thanks

Richard




So I'm guessing that setting fsync off did not help your performance problems.  
And you say CPU is high, so I think we can rule out disk IO problems.


possibly installing more RAM as the most used db @ 67GB might appreciate it


That would only be if every row of that 67 gig is being used.  If its history 
stuff that never get's looked up, then adding more ram wont help because none 
of that data is being loaded anyway.  Out of that 67 Gig, what is the working 
size?  (Not really a number you can look up, I'm looking for more of an 
empirical little/some/lots/most).

pgpool:

max_client_conn = 4096
reserve_pool_size = 800

I've not used pgpool, but these seem really high.  Does that mean pgpool will 
create 4K connectsions to the backend?  Or does it mean it'll allow 4K 
connections to pgpool but only 800 connections to the backend.

I wonder...When you startup, if you watch vmstat for a bit, do you have tons of context 
switches?  If its not IO, and you dont say OMG, CPU is pegged! so I assume 
its not CPU bound, I wonder if there are so many processes fighting for resources they 
are stepping on each other.

When you get up and running (and its slow), what does this display:

ps ax|grep postgr|wc --lines

That and a minute of 'vmstat 2' would be neet to see as well.

-Andy




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sudden drop in DBb performance

2011-09-05 Thread Andy Colson

On 09/05/2011 01:45 PM, Scott Marlowe wrote:

On Mon, Sep 5, 2011 at 8:08 AM, Gerhard Wohlgenanntwo...@ai.wu.ac.at  wrote:

Below please find the results of vmstat 2 over some periode of time .. with
normal database / system load.


2  1 1344204 240924 104156 31462484  3500  1906   234 3687 4512 12  3 77  9


Your IO Wait is actually pretty high.  On an 8 core machine, 12.5%
means one core is doing nothing but waiting for IO.



My server is 2-core, so these numbers looked fine by me.  I need to remember 
core count when I look at these.

So the line above, for 2 core's would not worry me a bit, but on 8 cores, it 
pretty much means one core was pegged (with 9% wait?  Or is it one core was 
pegged, and another was 72% io wait?)

I have always loved the vmstat output, but its starting to get confusing when 
you have to take core's into account.  (And my math was never strong in the 
first place :-) )

Good catch, thanks Scott.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] IN or EXISTS

2011-08-31 Thread Andy Colson

On 8/30/2011 8:33 PM, Craig Ringer wrote:

On 31/08/2011 4:30 AM, Andy Colson wrote:

Hi all,

I have read things someplace saying not exists was better than not
in... or something like that. Not sure if that was for in/exists and
not in/not exists, and for a lot of records or not.


`EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary
to build a list of values then iterate over them to check for a match.
By contrast, `EXISTS' may use a simple index lookup or the like to test
for the presence of a value.

On the other hand, the `IN' subquery is uncorrelated needs only run
once, where the `EXISTS' subquery is correlated and has to run once for
every outer record. That means that the `IN' list approach can be a lot
faster where the subquery in question is relatively time consuming for
the number of values it returns. For example, if the `IN' query returns
only 5 values and takes 100ms, you're scanning 1 million records in the
outer query, and the subquery `EXISTS' version would take 50ms, using
`IN' is a no-brainer since 1 million times 50ms will be a lot slower
than 1 times 100ms plus the time required to scan 5 elements 1 million
times.

Another complication is the possible presence of NULL in an IN list.
Getting NULLs in `IN' lists is a common source of questions on this
list, because people are quite surprised by how it works. EXISTS avoids
the NULL handling issue (and in the process demonstrates how woefully
inconsistent SQL's handling of NULL really is).

Theoretically the query planner could transform:

SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE
z.y_id IS NOT NULL);

into:

SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)

... or vice versa depending on which it thought would be faster. AFAIK
it doesn't currently do this. To be able to do it the planner would need
to know how to estimate the cost of scanning an `IN' result list. It'd
also need to be able to use constraints on the target table to prove
that the result of the `IN' may not contain nulls. To transform the
EXISTS version into the IN version where it'd be more efficient, it'd
also have to be able to use constraints on the target table to prove
that results of a SELECT would be unique without explicit deduplication.

All this makes me wonder ... does Pg currently support sorting IN lists
and using a binary search? It'd be pretty nice to be able to prove that:

SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);

is equvalent to:

SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id
IS NOT NULL)

... and either transform it to an EXISTS test or add an ORDER BY z_id
and flag the resultset as sorted so a binary search could be done on it
whenever a row hits the IN test.

--
Craig Ringer



Yeah... my current code uses IN.  Most of my updates are small, so my 
inner list is 500 integers.  It runs fine.  What I'm worried about is 
when I update the entire table, so my inner list is 60k integers.  Maybe 
I'm just worrying for naught.  I tested a table with 100k rows, ran both 
with explain analyzes, and they look the same:


 Delete  (cost=11186.26..20817.60 rows=25911 width=12) (actual 
time=408.138..408.138 rows=0 loops=1)
   -  Hash Semi Join  (cost=11186.26..20817.60 rows=25911 width=12) 
(actual time=61.997..182.573 rows=105434 loops=1)

 Hash Cond: (public.general.gid = upd.general.gid)
 -  Seq Scan on general  (cost=0.00..9113.11 rows=25911 
width=10) (actual time=0.004..42.364 rows=105434 loops=1)
 -  Hash  (cost=9868.34..9868.34 rows=105434 width=10) (actual 
time=61.958..61.958 rows=105434 loops=1)

   Buckets: 16384  Batches: 1  Memory Usage: 4531kB
   -  Seq Scan on general  (cost=0.00..9868.34 rows=105434 
width=10) (actual time=0.003..34.372 rows=105434 loops=1)


With or without an index, (even if I ANALYZE it) it still does a table 
scan and builds a hash.  Both IN and EXISTS act the same way.


I assume:
Buckets: 16384  Batches: 1  Memory Usage: 4531kB

That means a total of 4.5 meg of ram was used for the hash, so if my 
work_mem was lower than that it would swap?  (or choose a different plan?)


I'll only ever be running one update at a time, so I'm not worried about 
multiple connections running at once.


Anyway, I'll just leave it alone (and stop optimizing things that dont 
need it)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow performance

2011-08-31 Thread Andy Colson
When you ran it, did it really feel like 30 seconds?  Or did it come 
right back real quick?


Because your report says:

 35.833 ms

Thats ms, or milliseconds, or 0.035 seconds.

-Andy


On 8/31/2011 8:04 AM, Kai Otto wrote:

Hi all,

I am running a simple query:

SELECT * FROM public.“Frame”

Time taken:

35.833 ms (i.e. roughly 35 seconds)

Number of rows:

121830

Number of columns:

38

*This is extremely slow for a database server.*

*Can anyone help me in finding the problem?*

Thanks,

KOtto

*Client:* pgAdmin III

*_Information:_*

*Table definition for “Frame”:*

CREATE TABLE Frame

(

ID bigint NOT NULL,

Series.ID bigint NOT NULL,

filename text NOT NULL,

Frame UID text NOT NULL,

Instance Number integer,

Image Type text,

Scanning Sequence text,

Sequence Variant text,

Scan Options text,

MR Acquisition Type text,

Sequence Name text,

Angio Flag text,

Repetition Time double precision,

Echo Time double precision,

Inversion Time double precision,

Number of Averages double precision,

Imaging Frequency double precision,

Imaged Nucleus text,

Echo Number text,

Magnetic Field Strength double precision,

Spacing Between Slices double precision,

Number of Phase Encoding Steps integer,

Echo Train Length integer,

Protocol Name text,

Trigger Time double precision,

Nominal Interval integer,

Cardiac Number of Images integer,

SAR double precision,

Image Position Patient text,

Image Orientation Patient text,

Slice Location double precision,

Rows integer,

Columns integer,

Pixel Spacing text,

Transfer Syntax UID text,

SOP Instance UID text,

Temporal Position Identifier integer,

Number Of Temporal Positions integer,

CONSTRAINT Frame_pkey PRIMARY KEY (ID),

CONSTRAINT Frame_ID_key UNIQUE (ID)

)

WITH (

OIDS=FALSE

);

ALTER TABLE Frame OWNER TO MDDBClient;

GRANT ALL ON TABLE Frame TO MDDBClient;

GRANT ALL ON TABLE Frame TO public;

*PostGreSQL :* 9.0

*History:* Query has always been slow

*Hardware: *Win 7 enterprise 64bit with SP1, 3.0GB RAM, Intel Xeon 3050
@ 2.13Ghz dual, 500GB HD (/WD5000AAKS/).

*Explain:*

Seq Scan on Frame (cost=0.00..9537.30 rows=121830 width=541) (actual
time=0.047..93.318 rows=121830 loops=1)

Total runtime: 100.686 ms

*Auto Vacuum: *Vacuum just performed.

**

*GUC:*

version;PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit

bytea_output;escape

client_encoding;UNICODE

effective_cache_size;2GB

lc_collate;English_United States.1252

lc_ctype;English_United States.1252

listen_addresses;*

log_destination;stderr

log_line_prefix;%t 

logging_collector;on

max_connections;100

max_stack_depth;2MB

port;5432

server_encoding;UTF8

shared_buffers;32MB

TimeZone;CET

work_mem;16MB




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow performance

2011-08-31 Thread Andy Colson

On 8/31/2011 1:51 PM, Alan Hodgson wrote:

On August 31, 2011 11:26:57 AM Andy Colson wrote:

When you ran it, did it really feel like 30 seconds?  Or did it come
right back real quick?

Because your report says:
35.833 ms

Thats ms, or milliseconds, or 0.035 seconds.



I think the . is a thousands separator in some locales, possibly the reason
for confusion.



D'oh.  I'm zero for two today.  Guess I'll call it a day.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] IN or EXISTS

2011-08-30 Thread Andy Colson

Hi all,

I have read things someplace saying not exists was better than not in... 
or something like that.  Not sure if that was for in/exists and not 
in/not exists, and for a lot of records or not.


Here is my setup:

My website has a general table, let say 60k rows.  Its mostly read-only. 
 Every once and a while we get updated data, so I:

create schema upd;
create table upd.general(like public.general);

Then I dump the new data into upd.general.  (This has many table's and 
steps, I'm simplifying it here).


For the last step, I want to:

begin;
delete from public.general where gid in (select gid from upd.general);
insert into public.general select * from upd.general;
... 7 other tables same way ...
commit;


Most of the time upd.general will be  500 rows.  Every once and a while 
things get messed up and we just update the entire database, so count(*) 
upd.general == count(*) public.general.


My question is:
fast is nice, but safe and less resource intensive is better, so which 
would I probably like better:


delete from public.general where gid in (select gid from upd.general);

or

-- currently dont have and index, so
create index general_pk on upd.general(gid);
delete from public.general a where exists(select 1 from upd.general b 
where a.gid=b.gid);



Thanks for any suggestions,

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Variable versus constrant size tuples

2011-08-19 Thread Andy Colson

On 8/19/2011 4:03 AM, Krzysztof Chodak wrote:

Is there any performance benefit of using constant size tuples?



If you are referring to varchar(80) vs text, then no, there is no benefit.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Tips for a new Server

2011-08-17 Thread Andy Colson

On 8/16/2011 8:35 PM, Ogden wrote:

Hope all is well. I have received tremendous help from this list prior and 
therefore wanted some more advice.

I bought some new servers and instead of RAID 5 (which I think greatly hindered 
our writing performance), I configured 6 SCSI 15K drives with RAID 10. This is 
dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a different 
virtual disk and also Raid 10, a total of 146Gb. I was thinking of putting 
Postgres' xlog directory on the OS virtual drive. Does this even make sense to 
do?

The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 
6-core each).

It is a dedicated PostgreSQL box and needs to support heavy read and moderately 
heavy writes.

Currently, I have this for the current system which as 16Gb Ram:

  max_connections = 350

work_mem = 32MB
maintenance_work_mem = 512MB
wal_buffers = 640kB

# This is what I was helped with before and made reporting queries blaze by
seq_page_cost = 1.0
random_page_cost = 3.0
cpu_tuple_cost = 0.5
effective_cache_size = 8192MB

Any help and input is greatly appreciated.

Thank you

Ogden


What seems to be the problem?  I mean, if nothing is broke, then don't 
fix it :-)


You say reporting query's are fast, and the disk's should take care of 
your slow write problem from before.  (Did you test the write 
performance?)  So, whats wrong?



-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DBT-5 Postgres 9.0.3

2011-08-17 Thread Andy Colson

On 8/17/2011 10:29 AM, bobbyw wrote:

Hi, I know this is an old thread, but I wanted to chime in since I am having
problems with this as well.

I too am trying to run dbt5 against Postgres.  Specifically I am trying to
run it against Postgres 9.1beta3.

After jumping through many hoops I ultimately was able to build dbt5 on my
debian environment, but when I attempt to run the benchmark with:

dbt5-run-workload -a pgsql -c 5000 -t 5000 -d 60 -u 1 -i ~/dbt5-0.1.0/egen
-f 500 -w 300 -n dbt5 -p 5432 -o /tmp/results

it runs to completion but all of the dbt5 log files contain errors like:

terminate called after throwing an instance of 'pqxx::broken_connection'
   what():  could not connect to server: No such file or directory
 Is the server running locally and accepting
 connections on Unix domain socket
/var/run/postgresql/.s.PGSQL.5432?

I'm lead to believe that this is an error I would receive if the Postgres db
were not running, but it is.  In fact, the way dbt5-run-workload works it
starts the database automatically.  I have also confirmed it is running by
manually connecting while this benchmark is in progress (and after it has
already started the database and logged the above error).

Any thoughts on why I might be getting this error?

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/DBT-5-Postgres-9-0-3-tp4297670p4708692.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.



Its trying to connect to unix socket /var/run/postgresql/.s.PGSQL.5432,

but your postgresql.conf file probably has:
unix_socket_directory = '/tmp'


Change it to:
unix_socket_directory = '/var/run/postgresql'

and restart PG.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Raid 5 vs Raid 10 Benchmarks Using bonnie++

2011-08-17 Thread Andy Colson

On 8/17/2011 1:35 PM, k...@rice.edu wrote:

On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:


On Aug 17, 2011, at 1:31 PM, k...@rice.edu wrote:


On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:

I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with 
the new one we have, which I have configured with RAID 10. The drives are the 
same (SAS 15K). I tried the new system with ext3 and then XFS but the results 
seem really outrageous as compared to the current system, or am I reading 
things wrong?

The benchmark results are here:

http://malekkoheavyindustry.com/benchmark.html


Thank you

Ogden


That looks pretty normal to me.

Ken


But such a jump from the current db01 system to this? Over 20 times difference 
from the current system to the new one with XFS. Is that much of a jump normal?

Ogden


Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get 
similar
results with EXT4 as well, I suspect, although you did not test that.

Regards,
Ken



A while back I tested ext3 and xfs myself and found xfs performs better 
for PG.  However, I also have a photos site with 100K files (split into 
a small subset of directories), and xfs sucks bad on it.


So my db is on xfs, and my photos are on ext4.

The numbers between raid5 and raid10 dont really surprise me either.  I 
went from 100 Meg/sec to 230 Meg/sec going from 3 disk raid 5 to 4 disk 
raid 10.  (I'm, of course, using SATA drives with 4 gig of ram... 
and 2 cores.  Everyone with more than 8 cores and 64 gig of ram is off 
my Christmas list! :-) )


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Raid 5 vs Raid 10 Benchmarks Using bonnie++

2011-08-17 Thread Andy Colson

On 8/17/2011 1:55 PM, Ogden wrote:


On Aug 17, 2011, at 1:48 PM, Andy Colson wrote:


On 8/17/2011 1:35 PM, k...@rice.edu wrote:

On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:


On Aug 17, 2011, at 1:31 PM, k...@rice.edu wrote:


On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:

I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with 
the new one we have, which I have configured with RAID 10. The drives are the 
same (SAS 15K). I tried the new system with ext3 and then XFS but the results 
seem really outrageous as compared to the current system, or am I reading 
things wrong?

The benchmark results are here:

http://malekkoheavyindustry.com/benchmark.html


Thank you

Ogden


That looks pretty normal to me.

Ken


But such a jump from the current db01 system to this? Over 20 times difference 
from the current system to the new one with XFS. Is that much of a jump normal?

Ogden


Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get 
similar
results with EXT4 as well, I suspect, although you did not test that.

Regards,
Ken



A while back I tested ext3 and xfs myself and found xfs performs better for PG. 
 However, I also have a photos site with 100K files (split into a small subset 
of directories), and xfs sucks bad on it.

So my db is on xfs, and my photos are on ext4.



What about the OS itself? I put the Debian linux sysem also on XFS but haven't 
played around with it too much. Is it better to put the OS itself on ext4 and 
the /var/lib/pgsql partition on XFS?

Thanks

Ogden


I doubt it matters.  The OS is not going to batch delete thousands of 
files.  Once its setup, its pretty constant.  I would not worry about it.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Parameters for PostgreSQL

2011-08-04 Thread Andy Colson



On 8/3/2011 11:03 PM, Craig Ringer wrote:

great gobs of battery backed write cache DRAM.


Now I know what I'm asking Santa for Christmas this year!

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres performance on Linux and Windows

2011-08-03 Thread Andy Colson

On 8/3/2011 11:37 AM, Dusan Misic wrote:

I had done some testing for my application (WIP) and I had executed same
SQL script and queries on real physical 64-bit Windows 7 and on
virtualized 64-bit CentOS 6.

Both database servers are tuned with real having 8 GB RAM and 4 cores,
virtualized having 2 GB RAM and 2 virtual cores.

Virtualized server crushed real physical server in performance in both
DDL and DML scripts.

My question is simple. Does PostgreSQL perform better on Linux than on
Windows and how much is it faster in your tests?

Thank you for your time.



Given the exact same hardware, I think PG will perform better on Linux.

Your question how much faster is really dependent on usage.  If you're 
cpu bound then I'd bet they perform the same.  You are cpu bound after 
all, and on the exact same hardware, it should be the same.


If you have lots of clients, with lots of IO, I think linux would 
perform better, but hard to say how much.  I cant recall anyone posting 
benchmarks from the exact same hardware.


Comparing windows on metal vs linux on vm is like comparing apples to 
Missouri.  If your test was io bound, and the vmserver was write 
caching, that's why your vm won so well... but I'd hate to see a power 
failure.


It would be interesting to compare windows on metal vs windows on vm 
though.  (Which, I have done linux on metal vs linux on vm, but the 
hardware specs where different (dual amd64 4 sata software raid10 vs 
intel 8-core something with 6-disk scsi hardware raid), but linux on 
metal won every time.)


I think in the long run, running the system you are best at, will be a 
win.  If you don't know linux much, and run into problems, how much 
time/money will you spend fixing it.  Compared to windows.


If you have to have the fastest, absolute, system.  Linux on metal is 
the way to go.


(This is all speculation and personal opinion, I have no numbers to back 
anything up)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-29 Thread Andy Colson

On 4/29/2011 1:55 PM, Greg Smith wrote:

James Mansion wrote:

Does the server know which IO it thinks is sequential, and which it
thinks is random? Could it not time the IOs (perhaps optionally) and
at least keep some sort of statistics of the actual observed times?


It makes some assumptions based on what the individual query nodes are
doing. Sequential scans are obviously sequential; index lookupss random;
bitmap index scans random.

The measure the I/O and determine cache state from latency profile has
been tried, I believe it was Greg Stark who ran a good experiment of
that a few years ago. Based on the difficulties of figuring out what
you're actually going to with that data, I don't think the idea will
ever go anywhere. There are some really nasty feedback loops possible in
all these approaches for better modeling what's in cache, and this one
suffers the worst from that possibility. If for example you discover
that accessing index blocks is slow, you might avoid using them in favor
of a measured fast sequential scan. Once you've fallen into that local
minimum, you're stuck there. Since you never access the index blocks,
they'll never get into RAM so that accessing them becomes fast--even
though doing that once might be much more efficient, long-term, than
avoiding the index.

There are also some severe query plan stability issues with this idea
beyond this. The idea that your plan might vary based on execution
latency, that the system load going up can make query plans alter with
it, is terrifying for a production server.



How about if the stats were kept, but had no affect on plans, or 
optimizer or anything else.


It would be a diag tool.  When someone wrote the list saying AH! It 
used the wrong index!.  You could say, please post your config 
settings, and the stats from 'select * from pg_stats_something'


We (or, you really) could compare the seq_page_cost and random_page_cost 
from the config to the stats collected by PG and determine they are way 
off... and you should edit your config a little and restart PG.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on AIX

2011-03-19 Thread Andy Colson

On 03/19/2011 04:00 AM, phb07 wrote:

Hi all,

At Bull company, we want to answer a call for tender from a large company. And 
we are asked for information about PostgreSQL performance under AIX on Power 7 
servers.

By chance, has someone some data about this ?
Has someone performed a benchmark using AIX quite recently ?

Are there any reasons for having performance level significantly different 
between AIX and, let say, Linux, on a given platform ?

Thanks by advance for any help.

Philippe BEAUDOIN



Dunno, never gotten to play with AIX or a Power 7... If you sent me one I'd be 
more than happy to benchmark it and send it back :-)

Or, more seriously, even remote ssh would do.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fastest pq_restore?

2011-03-18 Thread Andy Colson

On 3/18/2011 9:38 AM, Kevin Grittner wrote:

Andy Colsona...@squeakycode.net  wrote:

On 03/17/2011 09:25 AM, Michael Andreasen wrote:



I've been looking around for information on doing a pg_restore as
fast as possible.



bgwriter_lru_maxpages = 0


I hadn't thought much about that last one -- do you have benchmarks
to confirm that it helped with a bulk load?



Nope, I got it from the running with scissors thread (I think), (maybe 
from Greg Smith)



or here:

http://rhaas.blogspot.com/2010/06/postgresql-as-in-memory-only-database_24.html

I dont recall exactly.  I saw it, add added a comment to my .conf just 
incase I ever needed it.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fastest pq_restore?

2011-03-17 Thread Andy Colson

On 03/17/2011 09:25 AM, Michael Andreasen wrote:

Hi,

I've been looking around for information on doing a pg_restore as fast as 
possible. It is for a backup machine so I am not interested in anything like 
crash recovery or anything else that would impact speed of load. I just want to 
go from no database to database there as fast as possible. The server is for 
postgresql only and this is the only database, sp both system at postgres can 
be set however is required for the fast load.

Currently I am using a twin processor box with 2GB of memory and raid 5 disk.

I start postgres before my load with these settings, which have been suggested.


shared_buffers = 496MB
maintenance_work_mem = 160MB
checkpoint_segments = 30
autovacuum = false
full_page_writes=false

maintenance_work_mem and checkpoint_segments were advised to be increased, which I have 
done, but these are just guess values as I couldn't see any advise for values, other than 
bigger.


I restore like this;

pg_restore -Fc -j 4 -i -O -d my_db my_db_dump.tbz


Even as this, it is still slower than I would like.

Can someone suggest some optimal settings (for postgresql 9) that will get this 
as quick as it can be?

Thanks.







autovacuum = off
fsync = off
synchronous_commit = off
full_page_writes = off
bgwriter_lru_maxpages = 0



-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2011-03-08 Thread Andy Colson
I have seen really complex geometries cause problems.  If you have 
thousands of points, when 10 would do, try ST_Simplify and see if it 
doesnt speed things up.


-Andy


On 3/8/2011 2:42 AM, Andreas Forø Tollefsen wrote:

Hi. Thanks for the comments. My data is right, and the result is exactly
what i want, but as you say i think what causes the query to be slow is
the ST_Intersection which creates the intersection between the vector
grid (fishnet) and the country polygons.
I will check with the postgis user list if they have any idea on how to
speed up this query.

Best,
Andreas

2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us

=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com
mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual rowcounts imply
that each country intersects about half of the grid cells, which doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 


  --
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the ST_Intersects
and/or ST_Intersection functions.  The two scans are only accounting for
perhaps 5.5 seconds, and the join infrastructure isn't going to be
terribly expensive, so it's got to be those functions.  Not knowing much
about PostGIS, I don't know if the functions themselves can be expected
to be really slow.  If it's not them, it could be the cost of fetching
their arguments --- in particular, I bet the country outlines are very
large objects and are toasted out-of-line.  There's been some past
discussion of automatically avoiding repeated detoastings in scenarios
like the above, but nothing's gotten to the point of acceptance yet.
Possibly you could do something to force detoasting in a subquery.

regards, tom lane





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2011-03-08 Thread Andy Colson

On 3/8/2011 10:58 AM, Andreas Forø Tollefsen wrote:

Andy. Thanks. That is a great tips. I tried it but i get the error:
NOTICE: ptarray_simplify returned a 2 pts array.

Query:
SELECT ST_Intersection(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AS geom,
priogrid_land.gid AS divider, gwcode, gwsyear, gweyear, startdate,
enddate, capname, caplong, caplat, col, row, xcoord, ycoord
FROM priogrid_land, cshapeswdate WHERE ST_Intersects(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AND cshapeswdate.gwsyear =1946 AND
cshapeswdate.gweyear =1946 AND cshapeswdate.startdate = '1946/1/1';


2011/3/8 Andy Colson a...@squeakycode.net mailto:a...@squeakycode.net

I have seen really complex geometries cause problems.  If you have
thousands of points, when 10 would do, try ST_Simplify and see if it
doesnt speed things up.

-Andy



On 3/8/2011 2:42 AM, Andreas Forř Tollefsen wrote:

Hi. Thanks for the comments. My data is right, and the result is
exactly
what i want, but as you say i think what causes the query to be
slow is
the ST_Intersection which creates the intersection between the
vector
grid (fishnet) and the country polygons.
I will check with the postgis user list if they have any idea on
how to
speed up this query.

Best,
Andreas

2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us
mailto:t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us


=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?=
andrea...@gmail.com mailto:andrea...@gmail.com
mailto:andrea...@gmail.com mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual
rowcounts imply
that each country intersects about half of the grid cells,
which doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 

  

 
--
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704)
(actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell,
cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the
ST_Intersects
and/or ST_Intersection functions.  The two scans are only
accounting for
perhaps 5.5 seconds, and the join infrastructure isn't going
to be
terribly expensive, so it's got to be those functions.  Not
knowing much
about PostGIS, I don't know if the functions themselves can
be expected
to be really slow.  If it's not them, it could be the cost
of fetching
their arguments --- in particular, I bet the country
outlines are very
large objects and are toasted out-of-line.  There's been
some past
discussion of automatically avoiding repeated detoastings in
scenarios
like the above, but nothing's gotten to the point of
acceptance yet.
Possibly you could do something to force detoasting in a
subquery.

regards, tom lane








ew... thats not good.  Seems like it simplified it down to a single 
point?  (not 100% sure that's what the error means, just a guess)


Try getting some info about it:

select
  ST_Npoints(geom) As before,
  ST_NPoints(ST_Simplify(geom,0.1)) as after
from cshapeswdate


Also try things like ST_IsSimple ST_IsValid.  I seem to recall sometimes 
needing ST_Points or st_NumPoints instead of ST_Npoints.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription

Re: [PERFORM] Performance trouble finding records through related records

2011-03-03 Thread Andy Colson

On 3/3/2011 3:19 AM, sverhagen wrote:


Andy Colson wrote:


For your query, I think a join would be the best bet, can we see its
explain analyze?




Here is a few variations:


SELECT events_events.* FROM events_events WHERE transactionid IN (
SELECT transactionid FROM events_eventdetails customerDetails
WHERE customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='1957'
AND transactionid IS NOT NULL
) ORDER BY id LIMIT 50;

-- http://explain.depesz.com/s/Pnb


explain analyze SELECT events_events.* FROM events_events,
events_eventdetails customerDetails
WHERE events_events.transactionid = customerDetails.transactionid
AND customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='1957'
AND customerDetails.transactionid IS NOT NULL
ORDER BY id LIMIT 50;

-- http://explain.depesz.com/s/rDh


explain analyze SELECT events_events.* FROM events_events
JOIN events_eventdetails customerDetails
ON events_events.transactionid = customerDetails.transactionid
AND customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='1957'
AND customerDetails.transactionid IS NOT NULL
ORDER BY id LIMIT 50;

-- http://explain.depesz.com/s/6aB


Thanks for your efforts!



Huh.  Pretty much exactly the same.  I'm sorry but I think I'm at my 
limit.  I'm not sure why the nested loop takes so long, or how to get it 
to use something different.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance trouble finding records through related records

2011-03-02 Thread Andy Colson

On 03/02/2011 06:12 PM, sverhagen wrote:

Thanks for your help already!
Hope you're up for some more :-)


Andy Colson wrote:


First off, excellent detail.

Second, your explain analyze was hard to read... but since you are not
really interested in your posted query, I wont worry about looking at
it... but... have you seen:

http://explain.depesz.com/



Thanks for that. Using it below :-)


Andy Colson wrote:


If you run the individual queries, without the union, are the part's slow
too?



Only problem is the second part. So that part can safely be isolated. Also
the following does not play a role at this point: WHERE events2.eventtype_id
IN
(100,103,105,...

Then I went ahead and denormalized the transactionId on both ends, so that
both events_events records and events_eventdetails records have the
transactionId (or NULL). That simplifies the query to this:

SELECT events_events.* FROM events_events WHERE transactionid IN (
SELECT transactionid FROM events_eventdetails customerDetails
WHERE customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='1957'
AND transactionid IS NOT NULL
) ORDER BY id LIMIT 50;

To no avail. Also changing the above WHERE IN into implicit/explicit JOIN's
doesn't make more than a marginal difference. Should joining not be very
efficient somehow?

http://explain.depesz.com/s/Pnb

The above link nicely shows the hotspots, but I am at a loss now as how to
approach them.


Andy Colson wrote:


Looked like your row counts (the estimate vs the actual) were way off,
have you analyzed lately?



Note sure what that means.
Isn't all the maintenance nicely automated through my config?




In the explain analyze you'll see stuff like:
Append  (cost=0.00..325645.93 rows=115469434145 width=52) (actual 
time=0.304..58763.738 rows=222 loops=1)

This is taken from your first email.  Red flags should go off when the row 
counts are not close.  The first set is the planner's guess.  The second set is 
what actually happened.  The planner thought there would be 115,469,434,145 
rows.. but turned out to only be 222.  That's usually caused by bad stats.


Isn't all the maintenance nicely automated through my config?



I'd never assume.  But the numbers in the plan you posted:


http://explain.depesz.com/s/Pnb


look fine to me (well, the row counts), and I didnt look to much at that plan 
in the first email, so we can probably ignore it.



Andy Colson wrote:


I could not tell from the explain analyze if an index was used, but I
notice you have a ton of indexes on events_events table.



Yes, a ton of indexes, but still not the right one :-)


But... many indexes will slow down update/inserts.  And an index on an 
unselective field can cause more problems than it would help.  Especially if 
the stats are off.  If PG has lots and lots of options, it'll take longer to 
plan querys too.  If it picks an index to use, that it thinks is selective, but 
in reality is not, you are in for a world of hurt.

For your query, I think a join would be the best bet, can we see its explain 
analyze?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance trouble finding records through related records

2011-03-01 Thread Andy Colson

On 03/01/2011 06:14 PM, sverhagen wrote:

Hi, appreciated mailing list. Thanks already for taking your time for my
performance question. Regards, Sander.


===POSTGRESQL VERSION AND ORIGIN===

PostgreSQL 8.3.9 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.2.4
(Ubuntu 4.2.4-1ubuntu3)
Installed using apt-get install postgresql-8.3


===A DESCRIPTION OF WHAT YOU ARE TRYING TO ACHIEVE===

Query involving tables events_events and events_eventdetails. There is any
number of events_eventdetails records for each events_events record.

There may be multiple records in events_events that have the same value for
their transactionId, which is available in one of their events_eventdetails
records.

We want a total query that returns events_events records that match
condition I. or II., sorted by datetime descending, first 50.

Condition I.
All events_events records for which an events_eventdetails records that
matches the following conditions:
- Column keyname (in events_eventdetails) equals customerId, and
- Column value (in events_eventdetails) equals 598124, or more precisely
substring(customerDetails.value,0,32)='598124'

Condition II.
All events_events records that have a same value for in one of their
events_eventdetails records with keyname 'transactionId' as any of the
resulting events_events records of condition I.

In other words: I want all events for a certain customerId, and all events
with the same transactionId as those.

The total query's time should be of the magnitude 100ms, but currently is of
the magnitude 1min.

JUST FOR THE PURPOSE OF EXPERIMENT I've now a denormalized copy of
transactionId as a column in the events_events records. Been trying queries
on those, with no improvements.

I am not seeking WHY my query is too slow, rather trying to find a way to
get it faster :-)



much snippage

First off, excellent detail.

Second, your explain analyze was hard to read... but since you are not really 
interested in your posted query, I wont worry about looking at it... but... 
have you seen:

http://explain.depesz.com/

Its nice.

And last, to my questions:

SELECT events1.id, events1.transactionId, events1.dateTime FROM
events_events events1
JOIN events_eventdetails customerDetails
ON events1.id = customerDetails.event_id
AND customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='598124'
WHERE events1.eventtype_id IN
(100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108)
UNION
SELECT events2.id, events2.transactionId, events2.dateTime FROM
events_events events2
JOIN events_eventdetails details2_transKey
ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
JOIN events_eventdetails details2_transValue
ON substring(details2_transKey.value,0,32) =
substring(details2_transValue.value,0,32)
AND details2_transValue.keyname='transactionId'
JOIN events_eventdetails customerDetails
ON details2_transValue.event_id = customerDetails.event_id
AND customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='598124'
WHERE events2.eventtype_id IN
(100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108)
ORDER BY dateTime DESC LIMIT 50


If you run the individual queries, without the union, are the part's slow too?

Looked like your row counts (the estimate vs the actual) were way off, have you 
analyzed lately?

I could not tell from the explain analyze if an index was used, but I notice 
you have a ton of indexes on events_events table.  You have two indexes on the 
same fields but in reverse order:

events_events_eventtype_id_datetime_ind (datetime, eventtype_id);
events_events_datetime_eventtype_id_ind (eventtype_id, datetime);

AND both eventtype_id and datetime are in other indexes!  I think you need to 
review your indexes.  Drop all of them and add one or two that are actually 
useful.


A useful tool I have found for complex queries is to break them down into 
smaller sub sets, write sql that get's me just those sets, and them add them 
all back into one main query with subselects:

select a,b,c,...
from events_events
where
 id in ( select id from details where some subset is needed )
and id not in ( select id frome details where some set is bad )
and id in ( select anotherid from anothertable  where ... )


Its the subselects you need to think about.  Find one that gets you a small set 
that's interesting somehow.  Once you get all your little sets, its easy to 
combine them.

-Andy

--
Sent via 

Re: [PERFORM] Get master-detail relationship metadata

2011-02-03 Thread Andy Colson

On 2/3/2011 5:40 AM, Laszlo Nagy wrote:


Hi All,

I'm working on a client program that iterates over master-detail
relationships in a loop chain.

Pseudo code:

for row_1 in table_1:
table_2 = get_details(row_1,table2)
for row_2 in table_2:
row_3 = get_details(row_2,table3)
 etc.
process_data(row1,row_2,row_3,)

My task is to write the get_details iterator effectively. The obvious
way to do it is to query details in every get_details() call, but that
is not efficient. We have relationships where one master only has a few
details. For 1 million master rows, that would result in execution of
millions of SQL SELECT commands, degrading the performance by
magnitudes. My idea was that the iterator should pre-fetch and cache
data for many master records at once. The get_details() would use the
cached rows, thus reducing the number of SQL SELECT statements needed.
Actually I wrote the iterator, and it works fine in some cases. For
example:

producers = get_rows(producer)
for producer in producers:
products = get_getails(producer,product)
for product in products:
prices = get_details(product,prices)
for price in prices:
process_product_price(producer,product,price)

This works fine if one producer has not more than 1000 products and one
product has not more than 10 prices. I can easly keep 10 000 records in
memory. The actual code executes about 15 SQL queries while iterating
over 1 million rows. Compared to the original obvious method,
performance is increased to 1500%

But sometimes it just doesn't work. If a producer has 1 million
products, and one product has 100 prices, then it won't work, because I
cannot keep 100 million prices in memory. My program should somehow
figure out, how much rows it will get for one master, and select between
the cached and not cached methods.

So here is the question: is there a way to get this information from
PostgreSQL itself? I know that the query plan contains information about
this, but I'm not sure how to extract. Should I run an ANALYZE command
of some kind, and parse the result as a string? For example:

EXPLAIN select * from product where producer_id=1008;
QUERY PLAN
--
Seq Scan on product (cost=0.00..1018914.74 rows=4727498 width=1400)
Filter: (producer_id = 1008)
(2 rows)


Then I could extract rows=4727498 to get an idea about how much detail
rows I'll get for the master.

Is there any better way to do it? And how reliable is this?


Thanks,

Laszlo




One way would be to join the master to the detail, and write your code 
expecting duplicates.


q = get_rows(select * from product inner join price ... order by 
productid, priceid);


lastprodid = ''
for x in q:
prodid = q.prodid
if prodid  lastprodid:
# we saw the last product, prepare to move to the next product
lastprodid = prodid

... etc

 Is there any better way to do it? And how reliable is this?

It makes the sql really easy, but the code complex... so pick your poison.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-03 Thread Andy Colson

On 2/3/2011 9:08 AM, Mark Stosberg wrote:


Each night we run over a 100,000 saved searches against PostgreSQL
9.0.x. These are all complex SELECTs using cube functions to perform a
geo-spatial search to help people find adoptable pets at shelters.

All of our machines in development in production have at least 2 cores
in them, and I'm wondering about the best way to maximally engage all
the processors.

Now we simply run the searches in serial. I realize PostgreSQL may be
taking advantage of the multiple cores some in this arrangement, but I'm
seeking advice about the possibility and methods for running the
searches in parallel.

One naive I approach I considered was to use parallel cron scripts. One
would run the odd searches and the other would run the even
searches. This would be easy to implement, but perhaps there is a better
way.  To those who have covered this area already, what's the best way
to put multiple cores to use when running repeated SELECTs with PostgreSQL?

Thanks!

 Mark




1) I'm assuming this is all server side processing.
2) One database connection will use one core.  To use multiple cores you 
need multiple database connections.
3) If your jobs are IO bound, then running multiple jobs may hurt 
performance.


Your naive approach is the best.  Just spawn off two jobs (or three, or 
whatever).  I think its also the only method.  (If there is another 
method, I dont know what it would be)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-03 Thread Andy Colson

On 02/03/2011 04:56 PM, Greg Smith wrote:

Scott Marlowe wrote:

On Thu, Feb 3, 2011 at 8:57 AM,gnuo...@rcn.com  wrote:


Time for my pet meme to wiggle out of its hole (next to Phil's, and a day 
later).  For PG to prosper in the future, it has to embrace the 
multi-core/processor/SSD machine at the query level.  It has to.  And



I'm pretty sure multi-core query processing is in the TODO list.  Not
sure anyone's working on it tho.  Writing a big check might help.



Work on the exciting parts people are interested in is blocked behind completely mundane 
tasks like coordinating how the multiple sessions are going to end up with a consistent 
view of the database. See Export snapshots to other sessions at 
http://wiki.postgresql.org/wiki/ClusterFeatures for details on that one.

Parallel query works well for accelerating CPU-bound operations that are 
executing in RAM. The reality here is that while the feature sounds important, 
these situations don't actually show up that often. There are exactly zero 
clients I deal with regularly who would be helped out by this. The ones running 
web applications whose workloads do fit into memory are more concerned about 
supporting large numbers of users, not optimizing things for a single one. And 
the ones who have so much data that single users running large reports would 
seemingly benefit from this are usually disk-bound instead.

The same sort of situation exists with SSDs. Take out the potential users whose 
data can fit in RAM instead, take out those who can't possibly get an SSD big 
enough to hold all their stuff anyway, and what's left in the middle is not 
very many people. In a database context I still haven't found anything better 
to do with a SSD than to put mid-sized indexes on them, ones a bit too large 
for RAM but not so big that only regular hard drives can hold them.

I would rather strongly disagree with the suggestion that embracing either of 
these fancy but not really as functional as they appear at first approaches is 
critical to PostgreSQL's future. They're specialized techniques useful to only 
a limited number of people.

--
Greg Smith   2ndQuadrant usg...@2ndquadrant.comBaltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance:http://www.2ndQuadrant.com/books



4 cores is cheap and popular now, 6 in a bit, 8 next year, 16/24 cores in 5 
years.  You can do 16 cores now, but its a bit expensive.  I figure hundreds of 
cores will be expensive in 5 years, but possible, and available.

Cpu's wont get faster, but HD's and SSD's will.  To have one database 
connection, which runs one query, run fast, it's going to need multi-core 
support.

That's not to say we need parallel query's.  Or we need multiple backends to 
work on one query.  We need one backend, working on one query, using mostly the same 
architecture, to just use more than one core.

You'll notice I used _mostly_ and _just_, and have no knowledge of PG 
internals, so I fully expect to be wrong.

My point is, there must be levels of threading, yes?  If a backend has data to 
sort, has it collected, nothing locked, what would it hurt to use multi-core 
sorting?

-- OR --

Threading (and multicore), to me, always mean queues.  What if new type's of backend's 
were created that did simple things, that normal backends could distribute 
work to, then go off and do other things, and come back to collect the results.

I thought I read a paper someplace that said shared cache (L1/L2/etc) multicore 
cpu's would start getting really slow at 16/32 cores, and that message passing 
was the way forward past that.  If PG started aiming for 128 core support right 
now, it should use some kinda message passing with queues thing, yes?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-03 Thread Andy Colson

On 02/03/2011 10:00 PM, Greg Smith wrote:

Andy Colson wrote:

Cpu's wont get faster, but HD's and SSD's will. To have one database 
connection, which runs one query, run fast, it's going to need multi-core 
support.


My point was that situations where people need to run one query on one database 
connection that aren't in fact limited by disk I/O are far less common than people 
think. My troublesome database servers aren't ones with a single CPU at its max 
but wishing there were more workers, they're the ones that have 25% waiting 
for I/O. And even that crowd is still a subset, distinct from people who don't 
care about the speed of any one core, they need lots of connections to go at once.



Yes, I agree... for today.  If you gaze into 5 years... double the core count 
(but not the speed), double the IO rate.  What do you see?



My point is, there must be levels of threading, yes? If a backend has data to 
sort, has it collected, nothing locked, what would it hurt to use multi-core 
sorting?


Optimizer nodes don't run that way. The executor pulls rows out of the top of 
the node tree, which then pulls from its children, etc. If you just blindly ran off and 
executed every individual node to completion in parallel, that's not always going to be 
faster--could be a lot slower, if the original query never even needed to execute 
portions of the tree.

When you start dealing with all of the types of nodes that are out there it gets very 
messy in a hurry. Decomposing the nodes of the query tree into steps that can be executed 
in parallel usefully is the hard problem hiding behind the simple idea of use all 
the cores!




What if... the nodes were run in separate threads, and interconnected via 
queues?  A node would not have to run to completion either.  A queue could be 
setup to have a max items.  When a node adds 5 out of 5 items it would go to 
sleep.  Its parent node, removing one of the items could wake it up.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load,

2011-01-27 Thread Andy Colson

On 1/27/2011 4:31 AM, Michael Kohl wrote:

Hi all,

we are running a fairly big Ruby on Rails application on Postgres 8.4.
Our traffic grew quite a bit lately, and since then we are facing DB
performance issues. System load occasionally explodes (around 170
yesterday on a 16 core system), which seems to be caused by disk I/O
(iowait in our Munin graphs goes up significantly during these
periods). At other times the laod stays rather low under pretty much
the same circumstances.

There are 6 application servers with 18 unicorns each, as well as 12
beanstalk workers talking to the DB. I know the problem description is
very vague, but so far we haven't consistently managed to reproduce
the problem. Turning of the beanstalk workers usually leads to a great
decreases in writes and system loads, but during yesterday's debugging
session they obviously ran fine (thanks, Murphy).

Below you'll find our system information and Postgres config, maybe
someone could be so kind as to point out any obvious flaws in our
current configuration while I'm trying to get a better description of
the underlying problem.


SNIP

If the suggestions below are not enough, you might have to check some of 
your sql statements and make sure they are all behaving.  You may not 
notice a table scan when the user count is low, but you will when it 
gets higher.


Have you run each of your queries through explain analyze lately?

Have you checked for bloat?

You are vacuuming/autovacuuming, correct?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load,

2011-01-27 Thread Andy Colson

On 1/27/2011 9:09 AM, Michael Kohl wrote:

On Thu, Jan 27, 2011 at 4:06 PM, Andy Colsona...@squeakycode.net  wrote:

Have you run each of your queries through explain analyze lately?


A code review including checking of queries is on our agenda.


You are vacuuming/autovacuuming, correct?


Sure :-)

Thank you,
Michael



Oh, also, when the box is really busy, have you watched vmstat to see if 
you start swapping?


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queries becoming slow under heavy load

2011-01-25 Thread Andy Colson

On 1/25/2011 3:37 PM, Anne Rosset wrote:

Hi,

We are running some performances tests. With a lot of concurrent access,
queries get very slow. When there is no load, those queries run fast.

We kind of see a trend about these queries: it seems like the ones that
become very slow have an ORDER BY or MAX in them.

Here are our config settings:



SNIP


It seems to me that we should try increasing shared_buffers. But do you
have any other suggestions? Or do you see anything wrong in our config?

Thanks,

Anne



While I applaud your attempt to get us lots of information, 
unfortunately the the one property you ask about (shared_buffers), I 
can't seem to find.


So, maybe you could post a bit more:

1) how many concurrent clients?
2) can we see an explain analyze for a query when its fast, and then 
again when its slow?

3) Is this box dedicated to PG or are there other services running?
4) Looks like you have 8 Gig of ram, so I assume this is a 64 bit OS, 
can you tell us what you have for:


shared_buffers
effective_cahce_size
work_mem


5) Once many clients start hitting the db, it might not all fit into ram 
and start hitting the HD, can you tell us what sort of IO you have 
(sata, scsi, raid, # of disks, etc).


The stats from /proc/meminfo:
SwapTotal: 2097112 kB
SwapFree:  2096612 kB

Was this run when the system was busy?  Looks like you are not using any 
swap, so thats good at least.  Oh, wait, there are two cat 
/proc/meminfo's.  Is one when its fast and one when its slow?


Looks to me, in both cases, you are not using much memory at all.  (if 
you happen to have 'free', its output is a little more readable, if you 
wouldn't mind posting it (only really need it for when the box is slow)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fun little performance IMPROVEMENT...

2011-01-21 Thread Andy Colson

On 1/21/2011 12:12 PM, gr...@amadensor.com wrote:

I was doing a little testing to see how machine load affected the
performance of different types of queries, index range scans, hash joins,
full scans, a mix, etc.

In order to do this, I isolated different performance hits, spinning only
CPU, loading the disk to create high I/O wait states, and using most of
the physical memory.   This was on a 4 CPU Xen virtual machine running
8.1.22 on CENTOS.


Here is the fun part.   When running 8 threads spinning calculating square
roots (using the stress package), the full scan returned consistently 60%
faster than the machine with no load.   It was returning 44,000 out of
5,000,000 rows.   Here is the explain analyze.   I am hoping that this
triggers something (I can run more tests as needed) that can help us make
it always better.

Idling:
  QUERY PLAN

  Seq Scan on schedule_details  (cost=0.00..219437.90 rows=81386 width=187)
(actual time=0.053..2915.966 rows=44320 loops=1)
Filter: (schedule_type = '5X'::bpchar)
  Total runtime: 2986.764 ms

Loaded:
  QUERY PLAN

  Seq Scan on schedule_details  (cost=0.00..219437.90 rows=81386 width=187)
(actual time=0.034..1698.068 rows=44320 loops=1)
Filter: (schedule_type = '5X'::bpchar)
  Total runtime: 1733.084 ms



Odd.  Did'ja by chance run the select more than once... maybe three or 
four times, and always get the same (or close) results?


Is the stress package running niced?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Migrating to Postgresql and new hardware

2011-01-20 Thread Andy Colson

On 1/19/2011 6:42 PM, Craig Ringer wrote:

On 01/19/2011 05:09 PM, Lars wrote:

Thanks for the reply!



As others have mentioned, how are you going to be doing your shards?

Hmm... shards might not have been a good word to describe it. I'll
paste what I wrote in another reply:
I used sharding as an expression for partitioning data into several
databases.


sharding or shards is pretty much the standard way that setup is
described. It doesn't come up on the Pg list a lot as most people doing
web-oriented horizontally scaled apps use MySQL or fashionable non-SQL
databases, but it's pretty well known in wider circles.

--
Craig Ringer




Or... PG is just so good we've never had to use more than one database 
server!  :-)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Migrating to Postgresql and new hardware

2011-01-18 Thread Andy Colson

On 1/18/2011 4:56 AM, Lars wrote:

Hi,

We are in the process of moving a web based application from a MySql
to Postgresql database. Our main reason for moving to Postgresql is
problems with MySql (MyISAM) table locking. We will buy a new set of
servers to run the Postgresql databases.

The current setup is five Dell PowerEdge 2950 with 2 *  XEON E5410,
4GB RAM. PERC 5/I 256MB NV Cache, 4 * 10K Disks (3 in RAID 5 + 1
spare).

One server is used for shared data. Four servers are used for sharded
data. A user in the system only has data in one of the shards. There
is another server to which all data is replicated but I'll leave that
one out of this discussion. These are dedicated database servers.
There are more or less no stored procedures. The shared database size
is about 20GB and each shard database is about 40GB (total of 20 + 40
* 4 = 180GB). I would expect the size will grow 10%-15% this year.
Server load might increase with 15%-30% this year. This setup is disk
I/O bound. The overwhelming majority of sql statements are fast
(typically single row selects, updates, inserts and deletes on
primary key) but there are some slow long running (10min) queries.



No idea what mysql thinks a shard is, but in PG we have read-only hot
standby's.

The standby database is exactly the same as the master (save a bit of 
data that has not been synced yet.)  I assume you know this... but I'd 
really recommend trying out PG's hot-standby and make sure it works the 
way you need (because I bet its different than mysql's).


Assuming the shared and the sharded databases are totally different 
(lets call them database a and c), with the PG setup you'd have database 
a on one computer, then one master with database b on it (where all 
writes go), then several hot-standby's mirroring database b (that 
support read-only queries).


As for the hardware, you'd better test it.  Got any old servers you 
could put a real-world workload on?  Or just buy one new server for 
testing?  Its pretty hard to guess what your usage pattern is (70% read, 
 small columns, no big blobs (like photos), etc)... and even then we'd 
still have to guess.


I can tell you, however, having your readers and writers not block each 
other is really nice.


Not only will I not compare apples to oranges, but I really wont compare 
apples in Canada to oranges in Japan. :-)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Migrating to Postgresql and new hardware

2011-01-18 Thread Andy Colson

oops, call them database 'a' and database 'b'.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Possible to improve query plan?

2011-01-16 Thread Andy Colson

On 01/16/2011 09:21 PM, Jeremy Palmer wrote:

Hi all,

I've come to a dead end in trying to get a commonly used query to perform 
better. The query is against one table with 10 million rows. This table has 
been analysed. The table definition is:

CREATE TABLE version_crs_coordinate_revision
(
   _revision_created integer NOT NULL,
   _revision_expired integer,
   id integer NOT NULL,
   cos_id integer NOT NULL,
   nod_id integer NOT NULL,
   ort_type_1 character varying(4),
   ort_type_2 character varying(4),
   ort_type_3 character varying(4),
   status character varying(4) NOT NULL,
   sdc_status character(1) NOT NULL,
   source character varying(4),
   value1 numeric(22,12),
   value2 numeric(22,12),
   value3 numeric(22,12),
   wrk_id_created integer,
   cor_id integer,
   audit_id integer NOT NULL,
   CONSTRAINT pkey_version_crs_coordinate_revision PRIMARY KEY 
(_revision_created, id),
   CONSTRAINT version_crs_coordinate_revision_revision_created_fkey FOREIGN KEY 
(_revision_created)
   REFERENCES revision (id) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE NO ACTION,
   CONSTRAINT version_crs_coordinate_revision_revision_expired_fkey FOREIGN KEY 
(_revision_expired)
   REFERENCES revision (id) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
   OIDS=FALSE
);
ALTER TABLE version.version_crs_coordinate_revision ALTER COLUMN 
_revision_created SET STATISTICS 1000;
ALTER TABLE version.version_crs_coordinate_revision ALTER COLUMN 
_revision_expired SET STATISTICS 1000;
ALTER TABLE version.version_crs_coordinate_revision ALTER COLUMN id SET 
STATISTICS 1000;

CREATE INDEX idx_crs_coordinate_revision_created ON 
version.version_crs_coordinate_revision USING btree (_revision_created);
CREATE INDEX idx_crs_coordinate_revision_created_expired ON 
version.version_crs_coordinate_revision USING btree (_revision_created, 
_revision_expired);
CREATE INDEX idx_crs_coordinate_revision_expired ON 
version.version_crs_coordinate_revision USING btree (_revision_expired);
CREATE INDEX idx_crs_coordinate_revision_expired_created ON 
version.version_crs_coordinate_revision USING btree (_revision_expired, 
_revision_created);
CREATE INDEX idx_crs_coordinate_revision_expired_id ON 
version.version_crs_coordinate_revision USING btree (_revision_expired, id);
CREATE INDEX idx_crs_coordinate_revision_id ON 
version.version_crs_coordinate_revision USING btree (id);
CREATE INDEX idx_crs_coordinate_revision_id_created ON 
version.version_crs_coordinate_revision USING btree (id, _revision_created);


The distribution of the data is that all but 120,000 rows have null values in 
the _revision_expired column.

The query itself that I'm trying to optimise is below:

EXPLAIN
SELECT * FROM (
 SELECT
 row_number() OVER (PARTITION BY id ORDER BY _revision_created DESC) as 
row_number,
 *
 FROM
 version_crs_coordinate_revision
 WHERE (
 (_revision_created= 16 AND _revision_expired  16 AND 
_revision_expired= 40) OR
 (_revision_created  16 AND _revision_created= 40)
 )
) AS T
WHERE row_number = 1;

Subquery Scan t  (cost=170692.25..175678.27 rows=767 width=205)
   Filter: (t.row_number = 1)
   -   WindowAgg  (cost=170692.25..173760.57 rows=153416 width=86)
 -   Sort  (cost=170692.25..171075.79 rows=153416 width=86)
   Sort Key: version_crs_coordinate_revision.id, 
version_crs_coordinate_revision._revision_created
   -   Bitmap Heap Scan on version_crs_coordinate_revision  
(cost=3319.13..157477.69 rows=153416 width=86)
 Recheck Cond: (((_revision_expired  16) AND 
(_revision_expired= 40)) OR ((_revision_created  16) AND (_revision_created= 
40)))
 Filter: (((_revision_created= 16) AND (_revision_expired  16) AND 
(_revision_expired= 40)) OR ((_revision_created  16) AND (_revision_created= 40)))
 -   BitmapOr  (cost=3319.13..3319.13 rows=154372 width=0)
   -   Bitmap Index Scan on 
idx_crs_coordinate_revision_expired  (cost=0.00..2331.76 rows=111041 width=0)
 Index Cond: ((_revision_expired  16) AND 
(_revision_expired= 40))
   -   Bitmap Index Scan on 
idx_crs_coordinate_revision_created  (cost=0.00..910.66 rows=43331 width=0)
 Index Cond: ((_revision_created  16) AND 
(_revision_created= 40))


One thought I have is that maybe the 
idx_crs_coordinate_revision_expired_created index could be used instead of 
idx_crs_coordinate_revision_expired.

Does anyone have any suggestions what I could do to improve the plan? Or how I 
could force the use of the idx_crs_coordinate_revision_expired_created index to 
see if that is better.

Thanks
Jeremy


First, wow, those are long names... I had a hard time keeping track.

Second: you have lots of duplicated indexes. I count _revision_created in 4 
indexes?  Not sure what other sql you are using, but have you 

Re: [PERFORM] Possible to improve query plan?

2011-01-16 Thread Andy Colson

-Original Message-
From: Andy Colson [mailto:a...@squeakycode.net]
Sent: Monday, 17 January 2011 5:22 p.m.
To: Jeremy Palmer
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Possible to improve query plan?


First, wow, those are long names... I had a hard time keeping track.

Second: you have lots of duplicated indexes. I count _revision_created in 4 
indexes?  Not sure what other sql you are using, but have you tried one index 
for one column?  PG will be able to Bitmap them together if it thinks it can 
use more than one.  Was that because you were testing?

Third: any chance we can get an explain analyze?  It give's more info.  
(Also, have you seen http://explain.depesz.com/)

Last: If you wanted to force the index usage, for a test, you could drop the 
other indexes.  I assume this is on a test box so it should be ok.  If its 
live, you could wrap it in a BEGIN ... ROLLBACK (in theory... never tried it 
myself)

-Andy


On 01/16/2011 10:43 PM, Jeremy Palmer wrote:

Hi Andy,

Yeah sorry about the long name, there are all generated by function as part of 
a table versioning system. And yes I placed all possible indexes on the table 
to see which would be used by the planner. In production I will drop the unused 
indexes.

Yes simple drop the extra index :P I have dropped the index and it made the 
query slower :(

Here is the explain analyse:

Subquery Scan t  (cost=170692.25..175678.27 rows=767 width=205) (actual 
time=13762.783..14322.315 rows=106299 loops=1)'
   Filter: (t.row_number = 1)'
   -   WindowAgg  (cost=170692.25..173760.57 rows=153416 width=86) (actual 
time=13762.774..14208.522 rows=149557 loops=1)'
 -   Sort  (cost=170692.25..171075.79 rows=153416 width=86) (actual 
time=13762.745..13828.584 rows=149557 loops=1)'
   Sort Key: version_crs_coordinate_revision.id, 
version_crs_coordinate_revision._revision_created'
   Sort Method:  quicksort  Memory: 23960kB
   -   Bitmap Heap Scan on version_crs_coordinate_revision  
(cost=3319.13..157477.69 rows=153416 width=86) (actual time=70.925..13531.720 
rows=149557 loops=1)
 Recheck Cond: (((_revision_expired  16) AND 
(_revision_expired= 40)) OR ((_revision_created  16) AND (_revision_created= 
40)))
 Filter: (((_revision_created= 16) AND (_revision_expired  16) AND 
(_revision_expired= 40)) OR ((_revision_created  16) AND (_revision_created= 40)))
 -   BitmapOr  (cost=3319.13..3319.13 rows=154372 width=0) 
(actual time=53.650..53.650 rows=0 loops=1)
   -   Bitmap Index Scan on 
idx_crs_coordinate_revision_expired  (cost=0.00..2331.76 rows=111041 width=0) 
(actual time=37.773..37.773 rows=110326 loops=1)
 Index Cond: ((_revision_expired  16) AND 
(_revision_expired= 40))
   -   Bitmap Index Scan on 
idx_crs_coordinate_revision_created  (cost=0.00..910.66 rows=43331 width=0) 
(actual time=15.872..15.872 rows=43258 loops=1)
 Index Cond: ((_revision_created  16) AND 
(_revision_created= 40))
Total runtime: 14359.747 ms

http://explain.depesz.com/s/qpL says that the bitmap heap scan is bad. Not sure 
what to do about it.

Thanks,
Jeremy




Hum.. yeah it looks like it takes no time at all to pull data from the 
individual indexes, and them bitmap them.  I'm not sure what the bitmap heap 
scan is, or why its slow.  Hopefully someone smarter will come along.

Also its weird that explain.depesz.com didnt parse and show your entire plan.  
Hum.. you seem to have ending quotes on some of the lines?

One other though: quicksort  Memory: 23960kB
It needs 20Meg to sort... It could be your sort is swapping to disk.

What sort of PG version is this?
What are you using for work_mem?  (you could try to bump it up a little (its 
possible to set for session only, no need for server restart) and see if that'd 
help.

And sorry, but its my bedtime, good luck though.

-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Andy Colson

On 1/13/2011 4:42 PM, Robert Haas wrote:

On Thu, Jan 13, 2011 at 5:41 PM, Robert Haasrobertmh...@gmail.com  wrote:

On Thu, Jan 13, 2011 at 5:26 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelsonjnelson+pg...@jamponi.net  wrote:

I still think that having UNION do de-duplication of each contributory
relation is a beneficial thing to consider -- especially if postgresql
thinks the uniqueness is not very high.



This might be worth a TODO.


I don't believe there is any case where hashing each individual relation
is a win compared to hashing them all together.  If the optimizer were
smart enough to be considering the situation as a whole, it would always
do the latter.


You might be right, but I'm not sure.  Suppose that there are 100
inheritance children, and each has 10,000 distinct values, but none of
them are common between the tables.  In that situation, de-duplicating
each individual table requires a hash table that can hold 10,000
entries.  But deduplicating everything at once requires a hash table
that can hold 1,000,000 entries.

Or am I all wet?


Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
end.  But then why did the OP get a speedup?  *scratches head*



Because it all fix it memory and didnt swap to disk?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Andy Colson

On 1/13/2011 4:49 PM, Robert Haas wrote:

On Thu, Jan 13, 2011 at 5:47 PM, Andy Colsona...@squeakycode.net  wrote:

I don't believe there is any case where hashing each individual relation
is a win compared to hashing them all together.  If the optimizer were
smart enough to be considering the situation as a whole, it would always
do the latter.


You might be right, but I'm not sure.  Suppose that there are 100
inheritance children, and each has 10,000 distinct values, but none of
them are common between the tables.  In that situation, de-duplicating
each individual table requires a hash table that can hold 10,000
entries.  But deduplicating everything at once requires a hash table
that can hold 1,000,000 entries.

Or am I all wet?


Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
end.  But then why did the OP get a speedup?  *scratches head*


Because it all fix it memory and didnt swap to disk?


Doesn't make sense.  The re-de-duplication at the end should use the
same amount of memory regardless of whether the individual relations
have already been de-duplicated.



Unless I missed something in the thread:

distinctList + distinctList + ... - [fit in mem] - last distinct - 
[fit in mem]


vs.

fullList + fullList + ... - [swapped to disk] - last distinct - [fit 
in mem]



-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-17 Thread Andy Colson

On 12/17/2010 11:08 AM, Tom Polak wrote:

So, I am back on this topic again.
I have a related question, but this might be the correct thread (and
please let me know that).  The boss is pressing the issue because of the
cost of MSSQL.

What kind of performance can I expect out of Postgres compare to MSSQL?
Let's assume that Postgres is running on Cent OS x64 and MSSQL is running
on Windows 2008 x64, both are on identical hardware running RAID 5 (for
data redundancy/security), SAS drives 15k RPM, dual XEON Quad core CPUs,
24 GB of RAM.  I have searched around and I do not see anyone ever really
compare the two in terms of performance.  I have learned from this thread
that Postgres needs a lot of configuration to perform the best.

We provide the MLS service to our members.  Our data goes back to 1997 and
nothing is ever deleted.  Here is a general overview of our current MSSQL
setup.  We have over 10GB of data in a couple of tables (no pictures are
stored in SQL server).  Our searches do a lot of joins to combine data to
display a listing, history, comparables, etc.  We probably do 3 or 4 reads
for every write in the database.

Any comparisons in terms of performance would be great.  If not, how can I
quickly truly compare the two systems myself without coding everything to
work for both?  Thoughts? Opinions?

Thanks,
Tom Polak
Rockford Area Association of Realtors
815-395-6776 x203

The information contained in this email message is intended only for the
use of the individual or entity named.  If the reader of this email is not
the intended recipient or the employee or agent responsible for delivering
it to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this email is strictly
prohibited.  If you have received this email in error, please immediately
notify us by telephone and reply email.  Thank you.

Although this email and any attachments are believed to be free of any
viruses or other defects that might affect any computer system into which
it is received and opened, it is the responsibility of the recipient to
ensure that it is free of viruses, and the Rockford Area Association of
Realtors hereby disclaims any liability for any loss or damage that
results.


Most of the time, the database is not the bottle neck.  So find the spot 
where your current database IS the bottleneck.  Then write a test that 
kinda matches that situation.


Lets say its 20 people doing an mls lookup at the exact same time, while 
and update is running in the background to copy in new data.


Then write a simple test (I use perl for my simple tests) for both 
databases.  If PG can hold up to your worst case situation, then maybe 
you'll be alright.


Also:  Are you pegged right now?  Do you have slowness problems?  Even 
if PG is a tad slower, will anybody even notice?  Maybe its not worth 
worrying about?  If your database isnt pegging the box, I'd bet you wont 
even notice a switch.


The other's that have answered have sound advice... but I thought I'd 
say: I'm using raid-5!  Gasp!


Its true, I'm hosting maps with PostGIS, and the slowest part of the 
process is the arial imagery, which is HUGE.  The database query's sit 
around 1% of my cpu.  I needed the disk space for the imagery.  The 
imagery code uses more cpu that PG does.  The database is 98% read, 
though, so my setup is different that yours.


My maps get 100K hits a day.  The cpu's never use more than 20%.  I'm 
running on a $350 computer, AMD Dual core, with 4 IDE disks in software 
raid-5.  On Slackware Linux, of course!


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-17 Thread Andy Colson

On 12/17/2010 11:37 AM, Robert Haas wrote:

On Fri, Dec 17, 2010 at 12:08 PM, Tom Polak
t...@rockfordarearealtors.org  wrote:



other direction to get good performance, too.  You're not going to
compare two major database systems across the board and find that one
of them is just twice as fast, across the board.  They have different
advantages and disadvantages.  When you're using one product, you
naturally do things in a way that works well for that product, and
moving to a different product means starting over.  Oh, putting this
in a stored procedure was faster on MS SQL, but it's slower on
PostgreSQL.  Using a view here was terrible on MS SQL, but much faster
under PostgreSQL.



Yeah, totally agree with that.  Every database has its own personality, 
and you have to work with it.  Its way.  Dont expect one bit of code to 
work great on all the different databases.  You need 5 different bits of 
code, one for each database.


In the end, can PG be fast?  Yes.  Very.  But only when you treat is as 
PG.  If you try to use PG as if it were mssql, you wont be a happy camper.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Help with bulk read performance

2010-12-14 Thread Andy Colson

On 11/1/2010 9:15 AM, Dan Schaffer wrote:

Hello

We have an application that needs to do bulk reads of ENTIRE Postgres
tables very quickly (i.e. select * from table). We have observed that
such sequential scans run two orders of magnitude slower than observed
raw disk reads (5 MB/s versus 100 MB/s). Part of this is due to the
storage overhead we have observed in Postgres. In the example below, it
takes 1 GB to store 350 MB of nominal data. However that suggests we
would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest that
the queries are CPU bound, not I/O bound. In fact, repeating the queries
yields similar response times. Presumably if it were an I/O issue the
response times would be much shorter the second time through with the
benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored
procedures, and libpq. In all cases the client code ran on the same box
as the server.
We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters
such as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following
(there are no indexes):

Table bulk_performance.counts
Column | Type | Modifiers
+-+---
i1 | integer |
i2 | integer |
i3 | integer |
i4 | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster. In this
case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.


Our business logic does operations on the resulting data such that the
output is several orders of magnitude smaller than the input. So we had
hoped that by putting our business logic into stored procedures (and
thus drastically reducing the amount of data flowing to the client) our
throughput would go way up. This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls SPI
directly) or some other approach to get close to the expected 35 MB/s
doing these bulk reads? Or is this the price we have to pay for using
SQL instead of some NoSQL solution. (We actually tried Tokyo Cabinet and
found it to perform quite well. However it does not measure up to
Postgres in terms of replication, data interrogation, community support,
acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson






Whoa... Deja Vu

Is this the same thing Nick is working on?  How'd he get along?

http://archives.postgresql.org/message-id/4cd1853f.2010...@noaa.gov


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Help with bulk read performance

2010-12-14 Thread Andy Colson

On 12/14/2010 9:41 AM, Jim Nasby wrote:

On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:

Is this the same thing Nick is working on?  How'd he get along?

http://archives.postgresql.org/message-id/4cd1853f.2010...@noaa.gov


So it is. The one I replied to stood out because no one had replied to it; I 
didn't see the earlier email.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net





Oh.. I didn't even notice the date... I thought it was a new post.

But still... (and I'll cc Nick on this)  I'd love to hear an update on 
how this worked out.


Did you get it to go fast?  What'd you use?  Did the project go over 
budget and did you all get fired?  COME ON MAN!  We need to know! :-)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-07 Thread Andy Colson

On 12/7/2010 11:34 AM, Tom Polak wrote:

We are in the process of deciding on how to proceed on a database
upgrade.  We currently have MS SQL 2000 running on Windows 2003 (on my
test server).  I was shocked at the cost for MS SQL 2008 R2 for a new
server (2 CPU license).  I started comparing DB’s and came across
postgresql.  It seemed to be exactly what I was after.  All of our
programming is in ASP.net.  Since I am running MSSQL 2000 I have no
benefit for .Net integration, so it is not a concern.

I ran a head to head test of MS SQL 2000 and Postgresql 9.0.  Both are
running on Windows 2003.  What I found was quite surprising and I am
wondering if anyone can point out what is going on here.
Here is the test I ran.
I created 2 tables, the main table had 5 fields with a serial ID field.
The second table linked to table 1 for a state field.

I had ASP.net via MSSQL create 1,000 records in the main table. Took
9.85 seconds to complete.
Next I had ASP.net via Postgresql create 1,000 records.  Took .65625
seconds.
Postgresql smoked MS SQL server on that test.


did you play with the postgresql.conf file?  Maybe turn off fsync?  I'd 
guess the above is mssql is flushing to disk while PG isnt.




Next test is to use ASP.net and join all 1,000 rows with table 2 and
then display the text out.

MS SQL took 0.76 seconds to display
select name,address,city,state,statename,stateid,other from pgtemp1 left
join pgtemp2 on state=stateid

Then I did the same test via Postgresql and it took 8.85 seconds!  I
tried it again as I thought I did something wrong.  I did a few tweaks
such as increasing the shared buffers.  Still the best I could get it to
was 7.5 seconds.  This is insanely slow compared to MSSQL 2000.  What am
I missing.  Here is my SQL statement for postgresql:
select name,address,city,state,statename,stateid,other from pgtemp1 left
join pgtemp2 on state=stateid

Any ideas on why the Postgres server is s much slower on the
joins?   I am trying to understand what is going on here so please don’t
flame me.  Any advice is appreciated.



Did you create an index?  That'd be my first guess.  Also, can you run 
the sql from the command line client (psql) and see if it takes that 
long?  While your in psql, stick a 'explain analyze' infront of your 
query, and let's see its output.


Also, as a fair warning: mssql doesn't really care about transactions, 
but PG really does.  Make sure all your code is properly starting and 
commiting transactions.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-07 Thread Andy Colson

On 12/7/2010 1:22 PM, Justin Pitts wrote:


Also, as a fair warning: mssql doesn't really care about transactions, but
PG really does.  Make sure all your code is properly starting and commiting
transactions.

-Andy


I do not understand that statement. Can you explain it a bit better?


In mssql you can write code that connects to the db, fire off updates 
and inserts, and then disconnects.  I believe mssql will keep all your 
changes, and the transaction stuff is done for you.


In PG the first statement you fire off (like an insert into for 
example) will start a transaction.  If you dont commit before you 
disconnect that transaction will be rolled back.  Even worse, if your 
program does not commit, but keeps the connection to the db open, the 
transaction will stay open too.


There are differences in the way mssql and pg do transactions.  mssql 
uses a transaction log and keeps current data in the table.  In mssql if 
you open a transaction and write a bunch of stuff, the table contains 
that new stuff.  Everyone can see it.  (I think default transaction 
isolation level is read commited).  But if you set your isolation level 
to something with repeatable read, then your program will block and have 
to wait on every little change to the table.  (or, probably page.. I 
think mssql has page level locking?)


anyway, in PG, multiple versions of the same row are kept, and when you 
open, and keep open a transaction, PG has to keep a version of the row 
for every change that other people make.  So a long lasting transaction 
could create hundreds of versions of one row.  Then when somebody goes 
to select against that table, it has to scan not only the rows, but 
every version of every row!


So my point is, in PG, use transactions as they were meant to be used, 
as single atomic operations.  Start, do some work, commit.


mssql made it easy to ignore transactions by doing it for you.  Ignoring 
transaction in PG will hurt you.


you can google MVCC and postgres idle in transaction for more.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-07 Thread Andy Colson

On 12/7/2010 2:10 PM, Kenneth Marshall wrote:

On Tue, Dec 07, 2010 at 11:56:51AM -0800, Richard Broersma wrote:

On Tue, Dec 7, 2010 at 11:43 AM, Andy Colsona...@squeakycode.net  wrote:


In PG the first statement you fire off (like an insert into for example)
will start a transaction. ?If you dont commit before you disconnect that
transaction will be rolled back. ?Even worse, if your program does not
commit, but keeps the connection to the db open, the transaction will stay
open too.


Huh - is this new?  I always thought that every statement was wrapped
in its own transaction unless you explicitly start your own.  So you
shouldn't need to commit before closing a connection if you never
opened a transaction to begin with.


--
Regards,
Richard Broersma Jr.



The default of autocommit unless explicitly starting a transaction with
BEGIN is the normal behavior that I have seen as well.

Cheers,
Ken


Crikey!  You're right.  I need to be more careful with my assumptions.

I maintain that people need to be more careful with pg transactions. 
I've seen several posts about idle in transaction.  But its not as bad 
as I made out.  My confusion comes from the library I use to hit PG, 
which fires off a begin for me, and if I dont explicitly commit, it 
gets rolled back.


sorry, it was confused between framework and PG.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Running PostgreSQL as fast as possible no matter the consequences

2010-11-15 Thread Andy Colson

On 11/15/2010 9:06 AM, Robert Haas wrote:

In 9.1, I'm hopeful that we'll have unlogged tables, which will even
better than turning these parameters off, and for which I just posted
a patch to -hackers.  Instead of generating WAL and writing WAL to the
OS and then NOT trying to make sure it hits the disk, we just won't
generate it in the first place.  But if PostgreSQL or the machine it's
running on crashes, you won't need to completely blow away the cluster
and start over; instead, the particular tables that you chose to
create as unlogged will be truncated, and the rest of your data,
including the system catalogs, will still be intact.



if I am reading this right means: we can run our db safely (with fsync 
and full_page_writes enabled) except for tables of our choosing?


If so, I am very +1 for this!

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MVCC performance issue

2010-11-12 Thread Andy Colson

On 11/12/2010 7:47 AM, Kyriacos Kyriacou wrote:


SUGGESTION
--
1) When a raw UPDATE is performed, store all new raw versions either
in separate temporary table space
or in a reserved space at the end of each table (can be allocated
dynamically) etc


Your use of raw is confusing.  I'll just ignore the word.  New row 
versions are already stored in a dynamically allocated spot, right along 
with the other versions of the table.  You are assuming that getting to 
the correct version of the row is very slow?  That's only going to be 
the case if you have lots and lots of versions.  And your solution will 
not actually help if there are lots of versions.  While one person who 
is hitting the most recent version might be ok, everyone else will still 
have to search for theirs.  Just as they do now.



2) Any SELECT queries within the same session will be again accessing
the new version of the row


I don't see how this is different from what we currently have.  same 
session could have been dropped from your separate table space, and 
then you'd have to go search through previous versions of the row... 
exactly like you do now.


And worse, if you dont want to drop your version of the row from the 
separate table space until you commit/rollback, then no other user can 
start a transaction on that table until your done!  oh no!  You have 
reads and writes blocking each other.



3) Any SELECT queries from other users will still be accessing the old
version


Again.. the same.


4) When UPDATE transaction is ROLLBACK just release the space used in
new temporary location


current layout makes rollback very very fast.


5) When UPDATE transaction is COMMIT then try to LOCK the old version
and overwrite it at the same physical location (NO FRAGMENTATION).


Not sure what you mean by lock, but lock requires single user access and 
slow's things down.  Right now we just bump the most active transaction 
number, which is very efficient, and requires no locks.  As soon as you 
lock anything, somebody, by definition, has to wait.




6) Similar mechanism can be applied on INSERTS and DELETES
7) In case that transaction was COMMIT, the temporary location can be
either released or archived/cleaned on a pre-scheduled basis. This will
possibly allow the introduction of a TRANSACTION LOG backup mechanism as
a next step.


You are kind of assuming there will only ever be one new transaction, 
and one old transaction.  What about a case where 10 people start a 
transaction, and there are 10 versions of the row?



It seems to me like you are using very long transactions, which is 
causing lots of row versions to show up.  Have you run explain analyze 
on your slow querys to find out the problems?


Have you checked to see if you are cpu bound or io bound?  If you are 
dealing with lots of row versions, I'd assume you are cpu bound.  If you 
check your system though, and see you are io bound, I think that might 
invalidate your assumptions above.


MVCC makes multi user access very nice because readers and writers dont 
block each other, and there are very few locks.  It does come with some 
kinks (gotta vacuum, keep transactions short, you must commit, etc).


select count(*) for example is always going to be slow... just expect 
it, lets not destroy what works well about the database just to make it 
fast.  Instead, find a better alternative so you dont have to run it.


Just like any database, you have to work within MVCC's good points and 
try to avoid the bad spots.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Huge overestimation in rows expected results in bad plan

2010-11-09 Thread Andy Colson

On 11/9/2010 3:26 PM, bricklen wrote:

Hi,

I have a query that is getting a pretty bad plan due to a massively
incorrect count of expected rows. All tables in the query were vacuum
analyzed right before the query was tested. Disabling nested loops
gives a significantly faster result (4s vs 292s).
Any thoughts on what I can change to make the planner generate a better plan?


explain analyze
select c.id, c.transactionid, c.clickgenerated, c.confirmed,
c.rejected, cr.rejectedreason
from conversion c
inner join conversionrejected cr on cr.idconversion = c.id
where date = '2010-11-06'
and idaction = 12906
and idaffiliate = 198338
order by transactionid;





  -   Seq Scan on conversionrejected cr  (cost=0.00..191921.82
rows=11012682 width=31) (actual time=0.003..1515.816 rows=11012682
loops=72)
  Total runtime: 292668.992 ms






Looks like the table stats are ok.  But its doing a sequential scan. 
Are you missing an index?


Also:

http://explain.depesz.com/

is magic.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple (hopefully) throughput question?

2010-11-03 Thread Andy Colson

On 11/3/2010 10:52 AM, Nick Matheson wrote:

Hello

We have an application that needs to do bulk reads of ENTIRE
Postgres tables very quickly (i.e. select * from table). We have
observed that such sequential scans run two orders of magnitude slower
than observed raw disk reads (5 MB/s versus 100 MB/s). Part of this is
due to the storage overhead we have observed in Postgres. In the
example below, it takes 1 GB to store 350 MB of nominal data. However
that suggests we would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest
that the queries are CPU bound, not I/O bound. In fact, repeating the
queries yields similar response times. Presumably if it were an I/O
issue the response times would be much shorter the second time through
with the benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored
procedures, and libpq. In all cases the client code ran on the same
box as the server. We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters
such as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following
(there are no indexes):

Table bulk_performance.counts
Column | Type | Modifiers
+-+---
i1 | integer |
i2 | integer |
i3 | integer |
i4 | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster. In this
case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.

Our business logic does operations on the resulting data such that
the output is several orders of magnitude smaller than the input. So
we had hoped that by putting our business logic into stored procedures
(and thus drastically reducing the amount of data flowing to the
client) our throughput would go way up. This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls
SPI directly) or some other approach to get close to the expected 35
MB/s doing these bulk reads? Or is this the price we have to pay for
using SQL instead of some NoSQL solution. (We actually tried Tokyo
Cabinet and found it to perform quite well. However it does not measure
up to Postgres in terms of replication, data interrogation, community
support, acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson




I have no idea if this would be helpful or not, never tried it, but when 
you fire off select * from bigtable pg will create the entire 
resultset in memory (and maybe swap?) and then send it all to the client 
in one big lump.  You might try a cursor and fetch 100-1000 at a time 
from the cursor.  No idea if it would be faster or slower.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using more tha one index per table

2010-07-21 Thread Andy Colson

On 7/21/2010 2:31 AM, Elias Ghanem wrote:

Hi,
I have a question concerning the uses of indexes in Postgresql.
I red that in PG a query can not use more than one index per table: a
query or data manipulation command can use at most one index per table.
Actually I found this a little weird and unfortunately I could not find
further explanation with my Google searches. But the tests I made proved
that this is true:

If we have table :

*create table test_index(col_1 integer, col_2 integer, col_3 integer,
col_4 integer)*


and we have 2 indexes created on this table:

*create index idx_col_1 on test_index (col_1)*

*create index idx_col_2 on test_index (col_2)*

A query like : *select * from test_index where col_1 = 15 and col_2 =
30* would never use both the indexes. The query plan is:

*Index Scan using idx_col_2 on test_index (cost=0.00..8.27 rows=1
width=16) (actual time=0.092..0.092 rows=0 loops=1)*

* Index Cond: (col_2 = 30)*

* Filter: (col_1 = 15)*

*Total runtime: 0.127 ms*

The query will use *idx_col_2 *only and apply the other condition
ignoring the other index(*idx_col_1*).


So please can you give some more details about this point. Is the above
citation true or I misunderstood it?

A next step is what if a query made a join on two tables table1 and
table2 (for ex: where table1.id = table2.id and table2.col_2 = 3 and
table2.col_3 = 4)?
Will it use, for table2, the index of the join column (table2.id) only
and neglect the indexes of the other two columns(col_2 and col_3)
although they are present in the where clause.

Thanks for your response,

Elias



As others have said, it will use more than one index.  The question you 
may have though, is why didnt it?


Its because an index isn't always faster.  The answer to both your 
questions (does it get used, and how about in a join) comes down to 
selectivity.  If an index can drastically cut down the number of rows 
then it'll be used, otherwise its faster to scan for the ones you need.


In your first example:
select * from test_index where col_1 = 15 and col_2 = 30

the planner will use whatever index has the test selectivity.  If 100's 
of rows have col_1 = 15, but only 5 rows have col_2 = 30, then its much 
faster to pull out the 5 rows with col_2 = 30 and just scan them for 
col_1 = 15.


Lets say both are highly selective (there are only a few rows each). 
Again its not going to be faster to use both indexes:


read the col_1 index for 15
fetch 5 rows from db.
read the col2_ index for 30
fetch different 5 rows from db
scan/bitmap the 10 rows for the both col_1 and col_2 conditions.

vs:
read col_1 index for 15
fetch 5 rows from db.
scan 5 rows for col_2 condition

The join case is exactly the same.  If the index can be used to reduce 
the resultset, or make individual row lookups faster, then it'll be used.


I have some description tables, like:  (id integer, descr text)
with maybe 100 rows in it.  PG never uses the unique index on id, it 
always table scans it... because its faster.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Identical query slower on 8.4 vs 8.3

2010-07-15 Thread Andy Colson

FULL is usually bad.  Stick to vacuum analyze and drop the full.

Do you have indexes on:

test.tid, testresult.fk_tid, questionresult.fk_trid and testresult.trid


-Andy



On 7/15/2010 10:12 AM, Patrick Donlin wrote:

I'll read over that wiki entry, but for now here is the EXPLAIN ANALYZE
output assuming I did it correctly. I have run vacuumdb --full
--analyze, it actually runs as a nightly cron job.

8.4.4 Sever:
Unique (cost=202950.82..227521.59 rows=702022 width=86) (actual
time=21273.371..22429.511 rows=700536 loops=1)
 - Sort (cost=202950.82..204705.87 rows=702022 width=86) (actual
time=21273.368..22015.948 rows=700536 loops=1)
 Sort Key: test.tid, testresult.trscore, testresult.trpossiblescore,
testresult.trstart, testresult.trfinish, testresult.trscorebreakdown,
testresult.fk_sid, test.tname, qr.qrscore, qr.qrtotalscore,
testresult.trid, qr.qrid
 Sort Method: external merge Disk: 71768kB
 - Hash Join (cost=2300.82..34001.42 rows=702022 width=86) (actual
time=64.388..1177.468 rows=700536 loops=1)
 Hash Cond: (qr.fk_trid = testresult.trid)
 - Seq Scan on questionresult qr (cost=0.00..12182.22 rows=702022
width=16) (actual time=0.090..275.518 rows=702022 loops=1)
 - Hash (cost=1552.97..1552.97 rows=29668 width=74) (actual
time=63.042..63.042 rows=29515 loops=1)
 - Hash Join (cost=3.35..1552.97 rows=29668 width=74) (actual
time=0.227..39.111 rows=29515 loops=1)
 Hash Cond: (testresult.fk_tid = test.tid)
 - Seq Scan on testresult (cost=0.00..1141.68 rows=29668 width=53)
(actual time=0.019..15.622 rows=29668 loops=1)
 - Hash (cost=2.60..2.60 rows=60 width=21) (actual time=0.088..0.088
rows=60 loops=1)
 - Seq Scan on test (cost=0.00..2.60 rows=60 width=21) (actual
time=0.015..0.044 rows=60 loops=1)
Total runtime: 22528.820 ms

8.3.7 Server:
Unique (cost=202950.82..227521.59 rows=702022 width=86) (actual
time=22157.714..23343.461 rows=700536 loops=1)
 - Sort (cost=202950.82..204705.87 rows=702022 width=86) (actual
time=22157.706..22942.018 rows=700536 loops=1)
 Sort Key: test.tid, testresult.trscore, testresult.trpossiblescore,
testresult.trstart, testresult.trfinish, testresult.trscorebreakdown,
testresult.fk_sid, test.tname, qr.qrscore, qr.qrtotalscore,
testresult.trid, qr.qrid
 Sort Method: external merge Disk: 75864kB
 - Hash Join (cost=2300.82..34001.42 rows=702022 width=86) (actual
time=72.842..1276.634 rows=700536 loops=1)
 Hash Cond: (qr.fk_trid = testresult.trid)
 - Seq Scan on questionresult qr (cost=0.00..12182.22 rows=702022
width=16) (actual time=0.112..229.987 rows=702022 loops=1)
 - Hash (cost=1552.97..1552.97 rows=29668 width=74) (actual
time=71.421..71.421 rows=29515 loops=1)
 - Hash Join (cost=3.35..1552.97 rows=29668 width=74) (actual
time=0.398..44.524 rows=29515 loops=1)
 Hash Cond: (testresult.fk_tid = test.tid)
 - Seq Scan on testresult (cost=0.00..1141.68 rows=29668 width=53)
(actual time=0.117..20.890 rows=29668 loops=1)
 - Hash (cost=2.60..2.60 rows=60 width=21) (actual time=0.112..0.112
rows=60 loops=1)
 - Seq Scan on test (cost=0.00..2.60 rows=60 width=21) (actual
time=0.035..0.069 rows=60 loops=1)
Total runtime: 23462.639 ms


Thanks for the quick responses and being patient with me not providing
enough information.
-Patrick

- Original Message -
From: Kevin Grittner kevin.gritt...@wicourts.gov
To: Patrick Donlin pdon...@oaisd.org, pgsql-performance@postgresql.org
Sent: Thursday, July 15, 2010 10:55:19 AM GMT -05:00 US/Canada Eastern
Subject: Re: [PERFORM] Identical query slower on 8.4 vs 8.3

Patrick Donlin pdon...@oaisd.org wrote:

  Anyone have any ideas on where I should start looking to figure
  this out?

You're going to want to run EXPLAIN ANALYZE for the slow query on
both servers. If you want the rest of us to be able to contribute
ideas, we'll need a little more information -- please read this
page:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

  I didn't perform any special steps when moving to v8.4, I just did
  a pg_dump from the 8.3 server and restored it on the new 8.4
  servers.

A database VACUUM ANALYZE by a superuser is a good idea; otherwise
that's fine technique.

-Kevin



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queries with conditions using bitand operator

2010-07-13 Thread Andy Colson

On 07/13/2010 06:48 AM, Elias Ghanem wrote:

Hi,
I have table ARTICLE containing a String a field STATUS that represents a 
number in binary format (for ex: 10011101).
My application issues queries with where conditions that uses BITAND operator on 
this field (for ex: select * from article where status  4 = 4).
Thus i'm facing performance problemes with these select queries: the queries 
are too slow.
Since i'm using the BITAND operator in my conditions, creating an index on the 
status filed is useless
and since the second operator variable (status  4 = 4; status  8 = 8; status 
 16 = 16...) a functional index is also usless (because a functional index require the 
use of a function that accept only table column as input parameter: constants are not 
accepted).
So is there a way to enhance the performance of these queries?
Thanks,
Elias



How many flags are there?  If its not too many you could make a separate column 
for each... but then that would be lots of indexes too...

One other thought I had was to make it a text column, turn the flags into words 
(space separated) and use full text indexes.

I played around with int's and string's but I couldnt find a way using the  
operator.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance on new linux box

2010-07-13 Thread Andy Colson

On 07/11/2010 03:02 PM, Ryan Wexler wrote:



Well I got me a new raid card, MegaRAID 8708EM2, fully equipped with
BBU and read and write caching are enabled.  It completely solved my
performance problems.  Now everything is way faster than the previous
server.  Thanks for all the help everyone.

One question I do have is this card has a setting called Read Policy
which apparently helps with sequentially reads.  Do you think that is
something I should enable?





I would think it depends on your usage.  If you use clustered indexes (and 
understand how/when they help) then enabling it would help (cuz clustered is 
assuming sequential reads).

or if you seq scan a table, it might help (as long as the table is stored 
relatively close together).

But if you have a big db, that doesnt fit into cache, and you bounce all over 
the place doing seeks, I doubt it'll help.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance on new linux box

2010-07-07 Thread Andy Colson

On 07/07/2010 06:06 PM, Ryan Wexler wrote:

Postgresql was previously running on a single cpu linux machine with 2 gigs of 
memory and a single sata drive (v8.3).  Basically a desktop with linux on it.  
I experienced slow performance.

So, I finally moved it to a real server.  A dually zeon centos machine with 6 
gigs of memory and raid 10, postgres 8.4.  But, I am now experiencing even 
worse performance issues.

My system is consistently highly transactional.  However, there is also regular 
complex queries and occasional bulk loads.

On the new system the bulk loads are extremely slower than on the previous 
machine and so are the more complex queries.  The smaller transactional queries 
seem comparable but i had expected an improvement.  Performing a db import via 
psql -d databas -f dbfile illustrates this problem.  It takes 5 hours to run 
this import.  By contrast, if I perform this same exact import on my crappy 
windows box with only 2 gigs of memory and default postgres settings it takes 1 
hour.  Same deal with the old linux machine.  How is this possible?

Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB
effective_cache_size = 2560MB
work_mem = 16MB
maintenance_work_mem = 128MB
checkpoint_segments = 7
checkpoint_timeout = 7min
checkpoint_completion_target = 0.5

I have tried varying the shared_buffers size from 128 all the way to 1500mbs 
and got basically the same result.   Is there a setting change I should be 
considering?

Does 8.4 have performance problems or is this unique to me?

thanks



Yeah, I inherited a server (the quotes are sarcastic air quotes), with really bad disk 
IO... er.. really safe disk IO.  Try the dd test.  On my desktop I get 60-70 meg a second.  On this 
server (I laugh) I got about 20.  I had to go out of my way (way out) to enable the 
disk caching, and even then only got 50 meg a second.

http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm

-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query tuning help

2010-06-14 Thread Andy Colson

On 06/14/2010 05:41 AM, AI Rumman wrote:

Can any one please help me in tuning the query?

explain
select *
from (select * from crmentity where deleted = 0 and createdtime between
(now() - interval '6 month') and now() ) as crmentity
inner join (select * from activity where  activitytype = 'Emails' and
date_start between (now() - interval '6 month')  and now()) as activity
on crmentity.crmid=activity.activityid
inner join emaildetails on emaildetails.emailid = crmentity.crmid
inner join vantage_email_track on
vantage_email_track.mailid=emaildetails.emailid
left join seactivityrel on seactivityrel.activityid = emaildetails.emailid



Can you send us 'explain analyze' too?


-  Seq Scan on emaildetails  (cost=0.00..1686.95 rows=44595 width=186)
-  Seq Scan on vantage_email_track  (cost=0.00..1324.52 rows=88852 width=12)


do you have indexes on emaildetails(emailid) and  vantage_email_track(mailid)?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dead lock

2010-06-14 Thread Andy Colson

On 06/14/2010 06:50 AM, Elias Ghanem wrote:

Hi all,
I have 2 data bases trying to perform an update query at the same time
on a same table in a third data base using db link.
I'm getting a dead lock exception:
ERROR: deadlock detected
DETAIL: Process 27305 waits for ShareLock on transaction 55575; blocked
by process 27304.
Process 27304 waits for ShareLock on transaction 55576; blocked by
process 27305.
HINT: See server log for query details.
Actually the folowing function is installed on 2 dbs DB1 and DB2. This
function issues an update query on DB3.
When this function is running simultaneously on DB1 and DB2, it produces
a dead lock making one of the functions (in DB1 or DB2) stop with the
above exception:
Is it normal? should'nt postgres be able to handle such situations, for
ex: let one transaction wait untill the other commits or rollback then
continue with the first transaction?
Is there a parameter that should be set in postgresql.conf to allow
handling of concurrent transaction...?

CREATE OR REPLACE FUNCTION TEST_DB_LINK(VARCHAR)
RETURNS VOID AS'
DECLARE
C INTEGER;
P ALIAS FOR $1;
DUMMY VARCHAR;
BEGIN
C:= 0;
LOOP
EXIT WHEN C  15;
C:= C+1;
SELECT INTO DUMMY DBLINK_EXEC(''CONNECTION_STRING TO DB3', ''UPDATE
IN_FICHE_PRODUIT SET VALIDE = 1'');
RAISE NOTICE ''%, %'', C,P;
END LOOP;
END;'
LANGUAGE 'plpgsql';

Thanks for your time.



I think PG is doing what you want.. if you think about it.  You start two transactions at the same 
time.  A transaction is defined as do this set of operations, all of which must succeed or 
fail atomicly.  One transaction cannot update the exact same row as another transaction 
because that would break the second transactions must succeed rule.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Analysis Function

2010-06-10 Thread Andy Colson

On 06/10/2010 07:41 PM, David Jarvis wrote:

Hi,

I found a slow part of the query:

SELECT
*  date(extract(YEAR FROM m.taken)||'-1-1') d1,*
*  date(extract(YEAR FROM m.taken)||'-1-31') d2*
FROM
   climate.city c,
   climate.station s,
   climate.station_category sc,
   climate.measurement m
WHERE
c.id http://c.id = 5148 AND ...

Date extraction is 3.2 seconds, but without is 1.5 seconds. The PL/pgSQL
code that actually runs (where p_month1, p_day1, and p_month2, p_day2
are integers):

*date(extract(YEAR FROM
m.taken)||''-'||p_month1||'-'||p_day1||''') d1,
 date(extract(YEAR FROM
m.taken)||''-'||p_month2||'-'||p_day2||''') d2
*
What is a better way to create those dates (without string
concatenation, I presume)?

Dave



I assume you are doing this in a loop?  Many Many Many times?  cuz:

andy=# select  date(extract(year from current_date) || '-1-1');
date

 2010-01-01
(1 row)

Time: 0.528 ms

Its pretty quick.  You say without its 1.5 seconds?  Thats all you change?  
Can we see the sql and 'explain analyze' for both?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Andy Colson

On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:

Hello,

i have a set of unique data which about 150.000.000 rows. Regullary i
get a list of data, which contains multiple times of rows than the
already stored one. Often around 2.000.000.000 rows. Within this rows
are many duplicates and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

Already stored set:
a,b,c

Given set:
a,b,a,c,d,a,c,d,b

Expected set after import:
a,b,c,d

I now looking for a faster way for the import. At the moment i import
the new data with copy into an table 'import'. then i remove the
duplicates and insert every row which is not already known. after that
import is truncated.

Is there a faster way? Should i just insert every row and ignore it, if
the unique constrain fails?

Here the simplified table-schema. in real life it's with partitions:
test=# \d urls
Tabelle »public.urls«
Spalte | Typ | Attribute
+-+---
url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
url | text | not null
Indexe:
»urls_url« UNIQUE, btree (url)
»urls_url_id« btree (url_id)

Thanks for every hint or advice! :)

Greetings from Germany,
Torsten


I do this with a stored procedure.  I do not care about speed because my db is 
really small and I only insert a few records a month.  So I dont know how fast 
this is, but here is my func:

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) 
RETURNS character varying
AS $$
declare
tmp integer;
begin
insert into blood(adate, alevel) values(idate, ilevel);
return 'ok';
exception
when unique_violation then
select into tmp alevel from blood where adate = idate;
if tmp  ilevel then
return idate || ' levels differ!';
else
return 'ok, already in table';
end if;
end; $$
LANGUAGE plpgsql;


Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but 
you wouldnt have to.  There is a unique index on adate.

-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] slow query performance

2010-06-03 Thread Andy Colson

On 6/3/2010 12:47 PM, Anj Adu wrote:

I cant seem to pinpoint why this query is slow . No full table scans
are being done. The hash join is taking maximum time. The table
dev4_act_action has only 3 rows.

box is a 2 cpu quad core intel 5430 with 32G RAM... Postgres 8.4.0
1G work_mem
20G effective_cache
random_page_cost=1
default_statistics_target=1000

The larget table  in the inner query is dev4_act_dy_fact which is
partitioned into 3 partitions per month. Each partition has about 25
million rows.
The rest of the tables are very small (100- 1000 rows)

explain analyze
select ipconvert(srctest_num),CASE targetpt::character varying
 WHEN NULL::text THEN serv.targetsrv
 ELSE targetpt::character varying
 END AS targetsrv, sesstype,hits as cons,bytes, srcz.srcarea as
srcz, dstz.dstarea as dstz from
(
select srctest_num, targetpt,targetsrv_id, sesstype_id, sum(total) as
hits, sum(bin) + sum(bout) as bts, sourcearea_id, destinationarea_id
  from dev4_act_dy_fact a, dev4_act_action act where thedate between
'2010-05-22' and '2010-05-22'
  and a.action_id = act.action_id and action in ('rejected','sess_rejected')
  and guardid_id in (select guardid_id from dev4_act_guardid where
guardid like 'cust00%')
  and node_id=(select node_id from dev4_act_node where node='10.90.100.2')
  group by srctest_num,targetpt,targetsrv_id,sesstype_id,
sourcearea_id, destinationarea_id
   order by (sum(bin) + sum(bout)) desc
  limit 1000
  ) a left outer join dev4_act_dstarea dstz on a.destinationarea_id =
dstz.dstarea_id
  left outer join dev4_act_srcarea srcz on a.sourcearea_id = srcz.srcarea_id
  left outer join  dev4_act_targetsrv serv on a.targetsrv_id = serv.targetsrv_id
  left outer join dev4_sesstype proto on a.sesstype_id = proto.sesstype_id
  order by bytes desc





Wow, the word wrap on that makes it hard to read... can you paste it 
here and send us a link?


http://explain.depesz.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Bulk Delete

2010-05-08 Thread Andy Colson

On 05/08/2010 06:39 AM, thilo wrote:

Hi all!

We moved from MySQL to Postgresql for some of our projects. So far
we're very impressed with the performance (especially INSERTs and
UPDATEs), except for a strange problem with the following bulk delete
query:

DELETE FROM table1 WHERE table2_id = ?

I went through these Wiki pages, trying to solve the problem:

http://wiki.postgresql.org/wiki/SlowQueryQuestions and
http://wiki.postgresql.org/wiki/Performance_Optimization

but unfortunately without much luck.

Our application is doing batch jobs. On every batch run, we must
delete approx. 1M rows in table1 and recreate these entries. The
inserts are very fast, but deletes are not. We cannot make updates,
because there's no identifying property in the objects of table1.

This is what EXPLAIN is telling me:

EXPLAIN ANALYZE DELETE FROM table1 WHERE table2_id = 11242939
  QUERY
PLAN

  Index Scan using sr_index on table1  (cost=0.00..8.56 rows=4 width=6)
(actual time=0.111..0.154 rows=4 loops=1)
Index Cond: (table2_id = 11242939)
  Total runtime: 0.421 ms
(3 rows)

This seems to be very fast (using the index), but running this query
from JDBC takes up to 20ms each. For 1M rows this sum up to several
hours. When I have a look at pg_top psql uses most of the time for the
deletes. CPU usage is 100% (for the core used by postgresql). So it
seems that postgresql is doing some sequential scanning or constraint
checks.

This is the table structure:

id  bigint   (primary key)
table2_id   bigint   (foreign key constraint to table 2, *indexed*)
table3_id   bigint   (foreign key constraint to table 3, *indexed*)
some non-referenced text and boolean fields

My server settings (Potgresql 8.4.2):

shared_buffers = 1024MB
effective_cache_size = 2048MB
work_mem = 128MB
wal_buffers = 64MB
checkpoint_segments = 32
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9

It would be very nice to give me a hint to solve the problem. It
drives me crazy ;-)

If you need more details please feel free to ask!

Thanks in advance for your help!

Kind regards

Thilo



I am going to guess the slow part is sending 1M different queries back and 
forth from client to server.  You could try batching them together:

DELETE FROM table1 WHERE table2_id in (11242939, 1,2,3,4,5, 42);

Also are you preparing the query?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How check execution plan of a function

2010-04-09 Thread Andy Colson

On Fri Apr 9 2010 8:18 AM, Sabin Coanda wrote:

I have just a function returning a cursor based on a single coplex query.
When I check the execution plan of that query it takes about 3 seconds. Just
when it is used inside the function it freezes.

This is the problem, and this is the reason I cannot imagine what is happen.
Also I tried to recreate the function as it was before when it run in 3
seconds, but I cannot make it to run properly now.




a query, like: select stuff from aTable where akey = 5 can be 
planned/prepared differently than a function containing: select stuff 
from aTable where akey = $1.  I'm guessing this is the problem you are 
running into.  The planner has no information about $1, so cannot make 
good guesses.


I think you have two options:
1) dont use a function, just fire off the sql.
2) inside the function, create the query as a string, then execute it, like:

a := select junk from aTable where akey = 5;
EXECUE a;

(I dont think that's the exact right syntax, but hopefully gets the idea 
across)


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How check execution plan of a function

2010-04-08 Thread Andy Colson

On Wed Apr 7 2010 7:47 AM, Sabin Coanda wrote:

Hi there,

I have a function which returns setof record based on a specific query.
I try to check the execution plan of that query, so I write EXPLAIN ANALYZE
before my select, I call the function and I see the result which shows an
actual time about 5 seconds. But when I call my function after I remove
EXPLAIN ANALYZE it takes more than 300 seconds, and I cancel it.

What's happen, or how can I see the function execution plan to optimize it ?

TIA,
Sabin




I ran into the same problems, what I did was enable the logging in 
postgresql.conf.  I dont recall exactly what I enabled, but it was 
something like:


track_functions = pl

log_statement_stats = on
log_duration = on

Then in the serverlog you can see each statement, and how long it took. 
 Once I found a statement that was slow I used explain analyze on just 
it so I could optimize that one statement.


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database size growing over time and leads to performance impact

2010-03-30 Thread Andy Colson

On 3/30/2010 6:17 AM, Gnanakumar wrote:

We're using pgpool-II version 2.0.1 for PostgreSQL connection management.

pgpool configurations are:
num_init_children = 450
child_life_time = 300
connection_life_time = 120
child_max_connections = 30

As you recommended, I ran ps -ax|grep postgres at almost a busy
transaction time and I can find idle entries:
[r...@newuser ~]# ps -ax|grep postgres
  2664 ?Ss 0:00 postgres: newuser mydb 192.168.0.200(43545) idle
  2783 ?Ss 0:00 postgres: newuser mydb 192.168.0.200(43585) idle
  2806 ?Ss 0:02 postgres: newuser mydb 192.168.0.200(43588) idle
  2807 ?Ss 0:01 postgres: newuser mydb 192.168.0.200(43589) idle
  2818 ?Ss 0:00 postgres: newuser mydb 192.168.0.200(43601) idle
  2819 ?Ss 0:00 postgres: newuser mydb 192.168.0.200(43602) idle
  2833 ?Ss 0:02 postgres: newuser mydb 192.168.0.200(43603) idle
  2856 ?Ss 0:03 postgres: newuser mydb 192.168.0.200(43614) idle

Based on pgpool documentation, and also as far as I know, even though
application layer returns/closes the application, pgpool will only handle
actual closing of connections based on the connection_life_time parameter
defined.  And if this timeout, it goes to wait for connection request
state.

Can you throw some light on this?  Is there any better way that we need to
re-configure our pgpool parameters?



Connections are ok.  Connection is different than transaction.  The 
output above looks good, that's what you want to see.  (If it had said 
idle in transaction that would be a problem).  I dont think you need 
to change anything.


Hopefully just vacuuming more often will help.

-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] REINDEXing database-wide daily

2010-03-30 Thread Andy Colson

On 3/30/2010 4:32 AM, Gnanakumar wrote:

Hi,

We're using PostgreSQL 8.2.

I have a question in connection to this question posted by me earlier:

http://archives.postgresql.org/pgsql-performance/2010-03/msg00343.php

In our application, DML operations (INSERT/UPDATE/DELETE) are heavily
performed in a day.

I also read about pg_autovacuum  REINDEX at:

http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html

http://www.postgresql.org/docs/8.2/static/sql-reindex.html

I do not want to run pg_autovacuum daemon on a busy hour.

In case, if I can afford to take my database offline at low-usage time
and perform REINDEX database-wide manually/linux cron, to boost up index
performance, what is the community answer/suggestion on the following:

1. Is it a good idea to perform this on a daily basis?

2. Any implications of doing this on a daily basis?

3. Is there a way to find out bloated indexes?

4. Any other maintenance command, like ANALYZE, that has to be executed
before/after REINDEX?

5. Is there a way to find out when REINDEX was last run on an
INDEX/TABLE/DATABASE?

NOTE: I've also seen from my past experience that REINDEX database-wide
greatly improves performance of the application.




I could be way off base here, so I hope others will confirm/deny this: 
I think the more often you run vacuum, the less you notice it.  If you 
wait for too long then vacuum will have to work harder and you'll notice 
a speed decrease.  But many small vacuums which dont have as much work 
to do, you wont notice.


It could be, and I'm guessing again, because your database grew from 3 
to 30 gig (if I recall the numbers right), REINDEX had lots of affect. 
But if vacuum can keep up with space reuse, REINDEX may not be needed. 
(maybe a few weeks or once a month).


-Andy



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


  1   2   >