Re: [PERFORM] Queries 15 times slower on 8.1 beta 2 than on 8.0

2005-09-22 Thread John Arbash Meinel
Jean-Pierre Pelletier wrote:
 Hi,
 
 I've got many queries running much slower on 8.1 beta2 than on 8.0.1
 Here is a simplified one that takes 484 ms on 8.1 and 32 ms on 8.0.1.
 
 select
   0
 from
   Content C
 
   left outer join Supplier S
   on  C.SupplierId = S.SupplierId
 
   left outer join Price P
   on C.PriceId = P.PriceId;
 
 Any ideas why it's slower?

You really have to post the results of EXPLAIN ANALYZE not just
explain. So that we can tell what the planner is expecting, versus what
really happened.

John
=:-

 
 Thanks
 Jean-Pierre Pelletier
 e-djuster
 


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Queries 15 times slower on 8.1 beta 2 than on 8.0

2005-09-22 Thread John Arbash Meinel
Jean-Pierre Pelletier wrote:
 Here are the explain analyze:

What is the explain analyze if you use set enable_seqscan to off?

Also, can you post the output of:
\d supplier
\d price
\d content

Mostly I just want to see what the indexes are, in the case that you
don't want to show us your schema.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] How many tables is too many tables?

2005-09-19 Thread John A Meinel
[EMAIL PROTECTED] wrote:
 I have a database of hundreds of millions of web links (between sites)
 in Postgres.  For each link, we record the url, the referer, and the
 most recent date the link exists.  I'm having some serious performance
 issues when it comes to writing new data into the database.

 One machine is simply not going to be able to scale with the quantities
 of links we hope to store information about and we want to move to some
 kind of cluster.  Because of the quantities of data, it seems to make
 sense to go for a cluster setup such that in a 4 machine cluster, each
 machine has a quarter of the data (is this Share nothing, or, Share
 everything?).  To that end, we figured a good first step was to
 partition the data on one machine into multiple tables defining the
 logic which would find the appropriate table given a piece of data.
 Then, we assumed, adding the logic to find the appropriate machine and
 database in our cluster would only be an incremental upgrade.

In a database app, you generally don't win by going to a cluster,
because you are almost always bound by your I/O. Which means that a
single machine, just with more disks, is going to outperform a group of
machines.

As Tom mentioned, your schema is not very good. So lets discuss what a
better schema would be, and also how you might be able to get decent
performance with a cluster.

First, 200rows * 400,000 tables = 80M rows. Postgres can handle this in
a single table without too much difficulty. It all depends on the
selectivity of your indexes, etc.

I'm not sure how you are trying to normalize your data, but it sounds
like having a url table so that each entry can be a simple integer,
rather than the full path, considering that you are likely to have a
bunch of repeated information.

This makes your main table something like 2 integers, plus the
interesting stuff (from url, to url, data).

If you are finding you are running into I/O problems, you probably could
use this layout to move your indexes off onto their own spindles, and
maybe separate the main table from the url tables.

What is your hardware? What are you trying to do that you don't think
will scale?

If you were SELECT bound, then maybe a cluster would help you, because
you could off-load the SELECTs onto slave machines, and leave your
primary machine available for INSERTs and replication.


...


 At this point, the primary performance bottleneck is in adding
 additional data to the database.  Our loader program (we load text
 files of link information) is currently getting about 40 rows a second,
 which is nowhere near the performance we need to be seeing.  In theory,
 we want to be able to re-write our entire archive of data within on a
 1-2 month cycle, so this is a very heavy write application (though
 we're also constantly generating reports from the data, so its not
 write only).

Are you VACUUMing enough? If you are rewriting all of the data, postgres
needs you to clean up afterwards. It is pessimistic, and leaves old rows
in their place.


 Is the total number of tables prohibitively affecting our write speed
 or is that an IO problem that can only be addressed by better drive
 partitioning (all data is on one drive, which I've already read is a
 problem)?  Is this approach to data partitioning one which makes any
 sense for performance, or should we move to a more normal distribution
 of links across fewer tables which house more rows each?

If all data is on a single drive, you are nowhere near needing a cluster
to improve your database. What you need is a 14-drive RAID array. It's
probably cheaper than 4x powerful machines, and will provide you with
much better performance. And put all of your tables back into one.

John
=:-


 Thanks in advance for your advice.

 -matt


 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] RAID Stripe size

2005-09-19 Thread John A Meinel
bm\mbn wrote:
 Hi Everyone

 The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k
 disks.

 2 disks are in RAID1 and hold the OS, SWAP  pg_xlog
 4 disks are in RAID10 and hold the Cluster itself.

 the DB will have two major tables 1 with 10 million rows and one with
 100 million rows.
 All the activities against this tables will be SELECT.

What type of SELECTs will you be doing? Mostly sequential reads of a
bunch of data, or indexed lookups of random pieces?


 Currently the strip size is 8k. I read in many place this is a poor
 setting.

From what I've heard of RAID, if you are doing large sequential
transfers, larger stripe sizes (128k, 256k) generally perform better.
For postgres, though, when you are writing, having the stripe size be
around the same size as your page size (8k) could be advantageous, as
when postgres reads a page, it only reads a single stripe. So if it were
reading a series of pages, each one would come from a different disk.

I may be wrong about that, though.

John
=:-


 Am i right ?



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Battery Backed Cache for RAID

2005-09-14 Thread John A Meinel
Alvaro Herrera wrote:
 On Wed, Sep 14, 2005 at 11:28:43AM -0700, Jeffrey W. Baker wrote:

On Wed, 2005-09-14 at 11:25 -0700, Peter Darley wrote:

 I'm getting a new server for our database, and I have a quick question
about RAID controllers with a battery backed cache.  I understand that the
cache will allow the cache to be written out if the power fails to the box,
which allows it to report a write as committed safely when it's not actually
committed.

Actually the cache will just hold its contents while the power is out.
When the power is restored, the RAID controller will complete the writes
to disk.  If the battery does not last through the outage, the data is
lost.


 Just curious: how long are the batteries supposed to last?


The recent *cheap* version of a ramdisk had battery backup for 16 hours.
(Very expensive ramdisks actually have enough battery power to power a
small hard-drive to dump the contents into).

I'm guessing for a RAID controller, the time would be in the max 1 day
range.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] shared buffers

2005-09-11 Thread John A Meinel
Martin Nickel wrote:
 Chris,
 Would you say that 3 pages is a good maximum for a Postgres install?
 We're running 8.0.3 on 64-bit SUSE on a dual Opteron box with 4G and have
 shared_buffers set at 12.  I've moved it up and down (it was 16
 when I got here) without any measurable performance difference.

What I've read on the mailing list, is that usually the sweet spot is
actually around 10k pages. 120k seems far too high.

I believe that the major fixes to the buffer manager are more in 8.1
rather than 8.0, so you probably are hitting some problems. (The biggest
problem was that there were places that require doing a complete scan
through shared memory looking for dirty pages, or some such).


 The reason I ask is because I occasionally see large-ish queries take
 forever (like cancel-after-12-hours forever) and wondered if this could
 result from shared_buffers being too large.

There are lots of possibilities for why these take so long, perhaps you
would want to post them, and we can try to help.
For instance, if you have a foreign key reference from one table to
another, and don't have indexes on both sides, then deleting from the
referenced table, will cause a sequential scan on the referring table
for *each* deleted row. (IIRC).

John
=:-


 Thanks for your (and anyone else's) help!
 Martin Nickel


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Prepared statement not using index

2005-09-11 Thread John A Meinel
Guido Neitzer wrote:
 Hi.

 I have an interesting problem with the JDBC drivers. When I use a
 select like this:

 SELECT t0.aktiv, t0.id, t0.ist_teilnehmer, t0.nachname, t0.plz,
 t0.vorname FROM public.dga_dienstleister t0 WHERE t0.plz  like
 ?::varchar(256) ESCAPE '|' withBindings: 1:53111(plz)

 the existing index on the plz column is not used.

 When I the same select with a concrete value, the index IS used.

 I use PostgreSQL 8.0.3 on Mac OS X and the JDBC driver 8.0-312 JDBC 3.

 After a lot of other things, I tried using a 7.4 driver and with  this,
 the index is used in both cases.

 Why can this happen? Is there a setting I might have not seen?
 Something I do wrong?

 cug

I've had this problem in the past. In my case, the issue was that the
column I was searching had a mixed blend of possible values. For
example, with 1M rows, the number 3 occurred 100 times, but the number
18 occurred 700,000 times.

So when I manually did a search for 3, it naturally realized that it
could use an index scan, because it had the statistics to say it was
very selective. If I manually did a search for 18, it switched to
sequential scan, because it was not very selective (both are the correct
plans).

But if you create a prepared statement, parameterized on this number,
postgres has no way of knowing ahead of time, whether you will be asking
about 3 or 18, so when the query is prepared, it has to be pessimistic,
and avoid worst case behavior, so it choses to always use a sequential scan.

The only way I got around this was with writing a plpgsql function which
used the EXECUTE syntax to dynamically re-plan part of the query.

Hope this makes sense. This may or may not be your problem, without
knowing more about you setup. But the symptoms seem similar.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Advise about how to delete entries

2005-09-11 Thread John A Meinel
Kevin wrote:
 Arnau wrote:

 Hi all,

 
  COPY FROM a file with all the ID's to delete, into a temporary
 table, and  do a joined delete to your main table (thus, only one query).


   I already did this, but I don't have idea about how to do this join,
 could you give me a hint ;-) ?

 Thank you very much


 maybe something like this:

 DELETE FROM statistics_sasme s
 LEFT JOIN temp_table t ON (s.statistic_id = t.statistic_id)
 WHERE t.statistic_id IS NOT NULL


Why can't you do:
DELETE FROM statistics_sasme s JOIN temp_table t ON (s.statistic_id =
t.statistic_id);

Or possibly:

DELETE FROM statistics_sasme s
  WHERE s.id IN (SELECT t.statistic_id FROM temp_table t);

I'm not sure how delete exactly works with joins, but the IN form should
be approximately correct.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Caching by Postgres

2005-08-23 Thread John A Meinel
gokulnathbabu manoharan wrote:
 Hi all,

 I like to know the caching policies of Postgresql.
 What parameter in the postgresql.conf affects the
 cache size used by the Postgresql?  As far as I have
 searched my knowledge of the parameters are

In general, you don't. The OS handles caching based on file usage.
So if you are using the files, the OS should cache them. Just like it
does with any other program.


 1. shared_buffers - Sets the limit on the amount of
 shared memory used.  If I take this is as the cache
 size then my performance should increase with the
 increase in the size of shared_buffers.  But it seems
 it is not the case and my performance actually
 decreases with the increase in the shared_buffers.  I
 have a RAM size of 32 GB.  The table which I use more
 frequently has around 68 million rows.  Can I cache
 this entire table in RAM?

There is a portion of this which is used for caching. But I believe
before 8.1 there was code that went linearly through all of the
shared_buffers and checked for dirty/clean pages. So there was a
tradeoff that the bigger you make it, the longer that search goes. So
you got diminishing returns, generally around 10k shared buffers.
I think it is better in 8.1, but if the OS is going to cache it anyway
(since it does), then having a Postgres cache is just wasting memory,
and not letting cache as much.

So I'm guessing that with 8.1 there would be 2 sweet spots. Low
shared_buffers (= 10k), and really high shared buffers (like all of
available ram).
But because postgres has been tuned for the former I would stick with it
(I don't think shared_buffers can go 2GB, but that might just be
work_mem/maintenance_work_mem).


 2. work_mem - It is the amount of memory used by an
 operation.  My guess is once the operation is complete
 this is freed and hence has nothing to do with the
 caching.

 3. effective_cache_size - The parameter used by the
 query planner and has nothing to do with the actual
 caching.

This is important from a planner issue. Because the planner can then
expect that the OS is doing its job and caching the tables, so index
scans are cheaper than they would be otherwise.

John
=:-


 So kindly help me in pointing me to the correct
 parameter to set.

 It will be great if you can point me to the docs that
 explains the implementation of caching in Postgresql
 which will help me in understanding things much
 clearly.

 Thanks in advance.
 Gokul.




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-21 Thread John A Meinel

Jeremiah Jahn wrote:

On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:


Ron wrote:


At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:


Well, since you can get a read of the RAID at 150MB/s, that means that 
it is actual I/O speed. It may not be cached in RAM. Perhaps you could 
try the same test, only using say 1G, which should be cached.



[EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m8.885s
user0m0.299s
sys 0m6.998s
[EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m1.654s
user0m0.232s
sys 0m1.415s



The write time seems about the same (but you only have 128MB of write 
cache), but your read jumped up to 620MB/s. So you drives do seem to be 
giving you 150MB/s.






...

I'm actually curious about PCI bus saturation at this point. Old 32-bit 
33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this 
is a higher performance system. But I'm really surprised that your write 
speed is that close to your read speed. (100MB/s write, 150MB/s read).



The raid array I have is currently set up to use a single channel. But I
have dual controllers In the array. And dual external slots on the card.
The machine is brand new and has pci-e backplane.  





Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them 
doing raw sequential IO like this should be capable of at
~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s 


BTW I'm using Seagate Cheetah 15K.4's



Now, are the numbers that Ron is quoting in megabytes or megabits? I'm 
guessing he knows what he is talking about, and is doing megabytes. 
80MB/s sustained seems rather high for a hard-disk.


Though this page:
http://www.storagereview.com/articles/200411/20041116ST3146754LW_2.html

Does seem to agree with that statement. (Between 56 and 93MB/s)

And since U320 is a 320MB/s bus, it doesn't seem like anything there 
should be saturating. So why the low performance




_IF_ the controller setup is high powered enough to keep that kind of IO 
rate up.  This will require a controller or controllers providing dual 
channel U320 bandwidth externally and quad channel U320 bandwidth 
internally.  IOW, it needs a controller or controllers talking 64b 
133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized 
IO buffer as well.


AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic 
MegaRAID controllers.  What I don't know is which exact one yours is, 
nor do I know if it (or any of the MegaRAID controllers) are high 
powered enough.



PERC4eDC-PCI Express, 128MB Cache, 2-External Channels


Do you know which card it is? Does it look like this one:
http://www.lsilogic.com/products/megaraid/megaraid_320_2e.html

Judging by the 320 speed, and 2 external controllers, that is my guess.
They at least claim a theoretical max of 2GB/s.

Which makes you wonder why reading from RAM is only able to get 
throughput of 600MB/s. Did you run it multiple times? On my windows 
system, I get just under 550MB/s for what should be cached, copying from 
/dev/zero to /dev/null I get 2.4GB/s (though that might be a no-op).


On a similar linux machine, I'm able to get 1200MB/s for a cached file. 
(And 3GB/s for a zero=null copy).


John
=:-




Talk to your HW supplier to make sure you have controllers adequate to 
your HD's.


...and yes, your average access time will be in the 5.5ms - 6ms range 
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO 
when accessing them.

Best to not go to HD at all ;-)


Well, certainly, if you can get more into RAM, you're always better off. 
For writing, a battery-backed write cache, and for reading lots of 
system RAM.



I'm not really worried about the writing, it's the reading the reading
that needs to be faster. 




Hope this helps,
Ron Peacetree



John
=:-




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance pb vs SQLServer.

2005-08-18 Thread John A Meinel
Qingqing Zhou wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes

Interesting; do they use an overwriting storage manager like Oracle, or
a non-overwriting one like Postgres?



 They call this MVCC RLV(row level versioning). I think they use rollback
 segment like Oracle (a.k.a version store or tempdb in SQL Server).  Some
 details are explained in their white paper:Database concurrency and row
 level versioning in SQL Server 2005.

 Regards,
 Qingqing


I found the paper here:
http://www.microsoft.com/technet/prodtechnol/sql/2005/cncrrncy.mspx

And it does sound like they are doing it the Oracle way:

When a record in a table or index is updated, the new record is stamped
with the transaction sequence_number of the transaction that is doing
the update. The previous version of the record is stored in the version
store, and the new record contains a pointer to the old record in the
version store. Old records in the version store may contain pointers to
even older versions. All the old versions of a particular record are
chained in a linked list, and SQL Server may need to follow several
pointers in a list to reach the right version. Version records need to
be kept in the version store only as long as there are there are
operations that might require them.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-18 Thread John Arbash Meinel
Jeremiah Jahn wrote:

here's an example standard query. Ireally have to make the first hit go
faster. The table is clustered as well on full_name as well. 'Smith%'
took 87 seconds on the first hit. I wonder if I set up may array wrong.
I remeber see something about DMA access versus something else, and
choose DMA access. LVM maybe?
  

It would be nice if you would format your queries to be a little bit
easier to read before posting them.
However, I believe I am reading it correctly, to say that the index scan
on identity is not your slow point. In fact, as near as I can tell, it
only takes 52ms to complete.

The expensive parts are the 4915 lookups into the litigant_details (each
one takes approx 4ms for a total of ~20s).
And then you do it again on case_data (average 3ms each * 4906 loops =
~15s).

So there is no need for preloading your indexes on the identity table.
It is definitely not the bottleneck.

So a few design bits, which may help your database.
Why is actor_id a text field instead of a number?
You could try creating an index on litigant_details (actor_id,
count_ori) so that it can do just an index lookup, rather than an index
+ filter.

More importantly, though, the planner seems to think the join of
identity to litigant_details will only return 1 row, not 5000.
Do you regularly vacuum analyze your tables?
Just as a test, try running:
set enable_nested_loop to off;
And then run EXPLAIN ANALYZE again, just to see if it is faster.

You probably need to increase some statistics targets, so that the
planner can design better plans.

  -  Nested Loop  (cost=0.00..20411.83 rows=2 width=173)
 (actual time=12.891..38317.017 rows=4906 loops=1)
-  Nested Loop  (cost=0.00..20406.48 rows=1 width=159)
 (actual time=12.826..23232.106 rows=4906 loops=1)
  -  Nested Loop  (cost=0.00..20403.18 rows=1
 width=138) (actual time=12.751..22885.439 rows=4906 loops=1)
Join Filter: ((outer.case_id)::text =
 (inner.case_id)::text)
-  Index Scan using name_speed on
 identity  (cost=0.00..1042.34 rows=4868 width=82) (actual
 time=0.142..52.538 rows=4915 loops=1)
  Index Cond: (((full_name)::text =
 'MILLER'::character varying) AND ((full_name)::text 
 'MILLES'::character varying))
  Filter: (((court_ori)::text =
 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text))
-  Index Scan using lit_actor_speed on
 litigant_details  (cost=0.00..3.96 rows=1 width=81) (actual
 time=4.631..4.635 rows=1 loops=4915)
  Index Cond: ((outer.actor_id)::text
 = (litigant_details.actor_id)::text)
  Filter: ('IL081025J'::text =
 (court_ori)::text)
  -  Seq Scan on court  (cost=0.00..3.29 rows=1
 width=33) (actual time=0.053..0.062 rows=1 loops=4906)
Filter: ('IL081025J'::text = (id)::text)
-  Index Scan using case_speed on case_data 
 (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 rows=1
 loops=4906)
  Index Cond: (('IL081025J'::text =
 (case_data.court_ori)::text) AND ((case_data.case_id)::text =
 (outer.case_id)::text))


John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Data Selection Slow From VB 6.0

2005-08-17 Thread John A Meinel
Mahesh Shinde wrote:
 Hi
...

 To connect To the PostgreSQL I am using **PostgreSQL Win32 ODBC and
 OLEDB client drivers 1.0.0.2**
  

...

 Since in the database I have one Major table that Debtor table which
 is master table and having around 55 lac records. I have set debtorId as
 a primary key having index on it.I am developing a search screen to
 search a specific debtor info using this table.
  
 When I fire a query to search a debtor id,  it took around 5 seconds to
 return an answer for a query whether entered debtor id is present in the
 database or not using ODBC. Where as when Explian  the query on the
 database
  Index Scan using tbmstban_debtorid on tbmstbandetails  (cost=0.00..6.01
 rows=2 width=143)
Index Cond: ((debtorid)::text = '234'::text)

Are you checking this from the VB App? Or just going onto the server and
running psql? (I'm guessing there is some way to run a flat query using
VB. In which case you can just have the query run EXPLAIN ANALYZE, the
return value is just the text, one line after another.)

What I'm thinking is that it might be a locale/encoding issue.
What is the encoding of your database? And what is the default locale
and the locale that you are connecting as?

Can you give us the EXPLAIN ANALYZE output so that we can see if the
planner is doing what it thinks it is?

It certainly sounds like either it is always doing a sequential scan, or
something else is going on. 5 sec is a really long time for the type of
query you are doing.

Oh, and can you run the win32 psql client to see if it might be ODBC
which is causing the problem?

John
=:-


 
 Query for the search criteria is
  *select * from tbmstdebtordetails where debtorid ='234'*
  
  Where as when I am using a like query to search a record starting with
 debtor id having a characters then it took around 10-15 sec to return a
 record set having records.
 query is 
 *select * from tbmstdebtordetails where debtorid like '234%'*
  
 Explain output on the database
  Index Scan using tbmstban_debtorid on tbmstbandetails  (cost=0.00..6.01
 rows=2 width=143)
Index Cond: ((debtorid)::text = '234%'::text)
 Thanks  regards,
 Mahesh Shinde
 --
 Codec Communications (I) Pvt. Ltd.
 PUNE (INDIA)
 T # 91-20-24221460/70(Ext 43)
 Desk No. 25143
 Email – [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-17 Thread John A Meinel
Jeremiah Jahn wrote:
 I just put together a system with 6GB of ram on a 14 disk raid 10 array.
 When I run my usual big painful queries, I get very little to know
 memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
 most of the time. the new devel box sits at around 250MB.

 I've switched to an 8.0 system on the new devel box, but the .conf
 really didn't change. Index usage is the same. Something seems wrong and
 I'm not sure why.


How big is your actual database on disk? And how much of it is actually
touched by your queries?

It seems that your tough queries might only be exercising a portion of
the database. If you really want to make memory usage increase try
something like:
find . -type f -print0 | xargs -0 cat /dev/null
Which should read all the files. After doing that, does the memory usage
increase?


 any thoughts,
 -jj-


 shared_buffers = 32768  # min 16, at least max_connections*2, 8KB each
 work_mem = 2097151  # min 64, size in KB

This seems awfully high. 2GB Per sort? This might actually be flushing
some of your ram, since it would get allocated and filled, and then
freed when finished. Remember, depending on what you are doing, this
amount can get allocated more than once per query.

 maintenance_work_mem = 819200   # min 1024, size in KB
 max_fsm_pages = 8   # min max_fsm_relations*16, 6 bytes each
 checkpoint_segments = 30# in logfile segments, min 1, 16MB each
 effective_cache_size = 360   -this is a little out of control, but 
 would it have any real effect?

It should just tell the planner that it is more likely to have buffers
in cache, so index scans are slightly cheaper than they would otherwise be.

 random_page_cost = 2# units are one sequential page fetch cost
 log_min_duration_statement = 1 # -1 is disabled, in milliseconds.
 lc_messages = 'C'   # locale for system error message strings
 lc_monetary = 'C'   # locale for monetary formatting
 lc_numeric = 'C'# locale for number formatting
 lc_time = 'C'   # locale for time formatting


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Need for speed

2005-08-16 Thread John A Meinel
Ulrich Wisser wrote:
 Hello,

 one of our services is click counting for on line advertising. We do
 this by importing Apache log files every five minutes. This results in a
 lot of insert and delete statements. At the same time our customers
 shall be able to do on line reporting.

What are you deleting? I can see having a lot of updates and inserts,
but I'm trying to figure out what the deletes would be.

Is it just that you completely refill the table based on the apache log,
rather than doing only appending?
Or are you deleting old rows?


 We have a box with
 Linux Fedora Core 3, Postgres 7.4.2
 Intel(R) Pentium(R) 4 CPU 2.40GHz
 2 scsi 76GB disks (15.000RPM, 2ms)

 I did put pg_xlog on another file system on other discs.

 Still when several users are on line the reporting gets very slow.
 Queries can take more then 2 min.

If it only gets slow when you have multiple clients it sounds like your
select speed is the issue, more than conflicting with your insert/deletes.


 I need some ideas how to improve performance in some orders of
 magnitude. I already thought of a box with the whole database on a ram
 disc. So really any idea is welcome.

How much ram do you have in the system? It sounds like you only have 1
CPU, so there is a lot you can do to make the box scale.

A dual Opteron (possibly a dual motherboard with dual core (but only
fill one for now)), with 16GB of ram, and an 8-drive RAID10 system would
perform quite a bit faster.

How big is your database on disk? Obviously it isn't very large if you
are thinking to hold everything in RAM (and only have 76GB of disk
storage to put it in anyway).

If your machine only has 512M, an easy solution would be to put in a
bunch more memory.

In general, your hardware is pretty low in overall specs. So if you are
willing to throw money at the problem, there is a lot you can do.

Alternatively, turn on statement logging, and then post the queries that
are slow. This mailing list is pretty good at fixing poor queries.

One thing you are probably hitting is a lot of sequential scans on the
main table.

If you are doing mostly inserting, make sure you are in a transaction,
and think about doing a COPY.

There is a lot more that can be said, we just need to have more
information about what you want.

John
=:-


 Ulrich






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] PG8 Tuning

2005-08-16 Thread John A Meinel
Alvaro Herrera wrote:
 On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote:


However, you are absolutely correct in that it's *relative* advice, not
absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
storage you'll probably be better off giving it everything and letting its
controller and cache handle disk allocation etc.   On the other hand, if
you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter
a case where a separate xlog disk did not benefit an OLTP application.


 I've been asked this a couple of times and I don't know the answer: what
 happens if you give XLog a single drive (unmirrored single spindle), and
 that drive dies?  So the question really is, should you be giving two
 disks to XLog?


I can propose a simple test. Create a test database. Run postgres,
insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
directory. Start postgres again, what does it do?

I suppose to simulate more of a failure mode, you could kill -9 the
postmaster (and all children processes) perhaps during an insert, and
then delete pg_xlog.

But I would like to hear from the postgres folks what they *expect*
would happen if you ever lost pg_xlog.

What about something like keeping pg_xlog on a ramdisk, and then
rsyncing it to a hard-disk every 5 minutes. If you die in the middle,
does it just restore back to the 5-minutes ago point, or does it get
more thoroughly messed up?
For some people, a 5-minute old restore would be okay, as long as you
still have transaction safety, so that you can figure out what needs to
be restored.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] choosing RAID level for xlogs

2005-08-16 Thread John A Meinel
Anjan Dave wrote:
 Yes, that's true, though, I am a bit confused because the Clariion array
 document I am reading talks about how the write cache can eliminate the
 RAID5 Write Penalty for sequential and large IOs...resulting in better
 sequential write performance than RAID10.

 anjan


Well, if your stripe size is 128k, and you have N disks in the RAID (N
must be even and  4 for RAID10).

With RAID5 you have a stripe across N-1 disks, and 1 parity entry.
With RAID10 you have a stripe across N/2 disks, replicated on the second
set.

So if the average write size is 128k*N/2, then you will generally be
using all of the disks during a write, and you can expect a the maximum
scale up of about N/2 for RAID10.

If your average write size is 128k*(N-1) then you can again write an
entire stripe at a time and even the parity since you already know all
of the information you don't have to do any reading. So you can get a
maximum speed up of N-1.

If you are doing infrequent smallish writes, it can be buffered by the
write cache, and isn't disk limited at all. And the controller can write
it out when it feels like it. So it should be able to do more buffered
all-at-once writes.

If you are writing a little bit more often (such that the cache fills
up), depending on your write pattern, it is possible that all of the
stripes are already in the cache, so again there is little penalty for
the parity stripe.

I suppose the worst case is if you were writing lots of very small
chunks, all over the disk in random order. In which case each write
encounters a 2x read penalty for a smart controller, or a Nx read
penalty if you are going for more safety than speed. (You can read the
original value, and the parity, and re-compute the parity with the new
value (2x  read penalty), but if there is corruption it would not be
detected, so you might want to read all of the stripes in the block, and
recompute the parity with the new data (Nx read penalty)).

I think the issue for Postgres is that it writes 8k pages, which is
quite small relative to the stripe size. So you don't tend to build up
big buffers to write out the entire stripe at once.

So if you aren't filling up your write buffer, RAID5 can do quite well
with bulk loads.
I also don't know about the penalties for a read followed immediately by
a write. Since you will be writing to the same location, you know that
you have to wait for the disk to spin back to the same location. At 10k
rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms.

Just to say that there are some specific extra penalties when you are
reading the location that you are going to write right away. Now a
really smart controller with lots of data to write could read the whole
circle on the disk, and then start writing out the entire circle, and
not have any spin delay. But you would have to know the size of the
circle, and that depends on what block you are on, and the heads
arrangement and everything else.
Though since hard-drives also have small caches in them, you could hide
some of the spin delay, but not a lot, since you have to leave the head
there until you are done writing, so while the current command would
finish quickly, the next command couldn't start until the first actually
finished.

Writing large buffers hides all of these seek/spin based latencies, so
you can get really good throughput. But a lot of DB action is small
buffers randomly distributed, so you really do need low seek time, of
which RAID10 is probably better than RAID5.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] choosing RAID level for xlogs

2005-08-16 Thread John A Meinel
Anjan Dave wrote:
 Yes, that's true, though, I am a bit confused because the Clariion array
 document I am reading talks about how the write cache can eliminate the
 RAID5 Write Penalty for sequential and large IOs...resulting in better
 sequential write performance than RAID10.

 anjan


To give a shorter statement after my long one...
If you have enough cache that the controller can write out big chunks to
the disk at a time, you can get very good sequential RAID5 performance,
because the stripe size is large (so it can do a parallel write to all
disks).

But for small chunk writes, you suffer the penalty of the read before
write, and possible multi-disk read (depends on what is in cache).

RAID10 generally handles small writes better, and I would guess that
4disks would perform almost identically to 6disks, since you aren't
usually writing enough data to span multiple stripes.

If your battery-backed cache is big enough that you don't fill it, they
probably perform about the same (superfast) since the cache hides the
latency of the disks.

If you start filling up your cache, RAID5 probably can do better because
of the parallelization.

But small writes followed by an fsync do favor RAID10 over RAID5.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] How many views is ok?

2005-08-14 Thread John Arbash Meinel
Petr Kavan wrote:

 I have database of company data, and some of them is table of
 information about employees. I need each employee to have access only
 to his own row. Postgre cannot do this by system of privileges,
 because that can give privileges only to whole tables.

 Possibility is to create a view for each employee that chooses only
 his data and give employee privileges to this view. But I am not sure
 if such number of views does not have some performance drawbacks or
 even if postgre can support it (I expect i can). I would need several
 tables protected like this and it can result in, say 1000 views in
 maximum.

 Because access to DB will go through PHP information system, other
 possibility to protect data is to let IS connect as more privileged
 than user really is, but let it retrieve only data for that user.

 View-approach seems far more clear than this, but im not sure if
 postgre can handle it without problems.

We do a similar thing tying user to per-row permissions. We have 1 view
per table, and it works fine.
I would recommend that you do something similar. Basically, just make
the view:

CREATE VIEW just_me SECURITY DEFINER AS
SELECT * FROM user_table WHERE username=session_user;
REVOKE ALL FROM user_table;
GRANT SELECT TO just_me TO PUBLIC;

security definer, means that the 'just_me' view will be executed as the
user who created the function (superuser).
The REVOKE ALL (my syntax might be wrong) prevents users from querying
the user tables directly.
The 'session_user' makes the view use the name of the actual connected
user (because of security definer, plain 'user' is the superuser)
This should allow a user to see only their own row in the database.
(Whichever rows that have username matching the connected name).

Now, this only works if the php front end connects specifically as the
given user (our system is setup to do this).

If you don't do it this way, you'll always be stuck with the IS layer
doing the restriction. Even if you create a view per user, if your PHP
layer has the right to look at other tables/views, it doesn't really help.

Good luck,
John
=:-


 Thanks for any reply :-)

 ---
 Petr Kavan
 Database Development



 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance pb vs SQLServer.

2005-08-14 Thread John Arbash Meinel
Stéphane COEZ wrote:

Hi,

I have a perfomance issue :

I run PG (8.0.3) and SQLServer2000 on a Windows2000 Server (P4 1,5Ghz 512Mo)
I have a table (320 rows) and I run this single query :

select cod from mytable group by cod
I have an index on cod (char(4) - 88 different values)

PG = ~ 20 sec.
SQLServer =  8 sec


the explain is :

HashAggregate  (cost=64410.09..64410.09 rows=55 width=8)
  -  Seq Scan on mytable  (cost=0.00..56325.27 rows=3233927 width=8)


if I switch to enable_hashagg = false (just for a try...)
the planner will choose my index :

Group  (cost=0.00..76514.01 rows=55 width=8)
  -  Index Scan using myindex on mytable  (cost=0.00..68429.20 rows=3233927
width=8)

but performance will be comparable to previous test.

So with or without using Index I have the same result.
  


My guess is that this is part of a larger query. There isn't really much
you can do. If you want all 3.2M rows, then you have to wait for them to
be pulled in.

What you generally can do for performance, is to restructure things, so
that you *don't* have to touch all 3.2M rows.
If you are just trying to determine what the unique entries are for cod,
you probably are better off doing some normalization, and keeping a
separate table of cod values.

I'm guessing the reason your query is faster with SQLServer is because
of how postgres handles MVCC. Basically, it still has to fetch the main
page to determine if a row exists. While SQL server doesn't do MVCC, so
it can just look things up in the index.

You might also try a different query, something like:

SELECT DISTINCT cod FROM mytable ORDER BY cod GROUP BY cod;
(You may or may not want order by, or group by, try the different
combinations.)
It might be possible to have the planner realize that all you want is
unique rows, just doing a group by doesn't give you that.

John
=:-


Thanks for help.
 
Stéphane COEZ




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

  





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance pb vs SQLServer.

2005-08-14 Thread John Arbash Meinel
Steinar H. Gunderson wrote:

On Sun, Aug 14, 2005 at 07:27:38PM -0500, John Arbash Meinel wrote:
  

My guess is that this is part of a larger query. There isn't really much
you can do. If you want all 3.2M rows, then you have to wait for them to
be pulled in.



To me, it looks like he'll get 88 rows, not 3.2M. Surely we must be able to
do something better than a full sequential scan in this case?

test=# create table foo ( bar char(4) );
CREATE TABLE
test=# insert into foo values ('');
INSERT 24773320 1
test=# insert into foo values ('');
INSERT 24773321 1
test=# insert into foo values ('');
INSERT 24773322 1
test=# select * from foo group by bar;
 bar  
--
 
 
(2 rows)

I considered doing some odd magic with generate_series() and subqueries with
LIMIT 1, but it was a bit too weird in the end :-)

/* Steinar */
  

I think a plain GROUP BY is not smart enough to detect it doesn't need
all rows (since it is generally used because you want to get aggregate
values of other columns).
I think you would want something like SELECT DISTINCT, possibly with an
ORDER BY rather than a GROUP BY (which was my final suggestion).

John
=:-



signature.asc
Description: OpenPGP digital signature


[PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
I'm having an odd case where my system is locking such that if I insert
into a table during a transaction, if I start a new connection and
transaction, it blocks while trying to do a similar insert until the
first transaction is committed or rolled back.

The schema is rather complex (currently 157 tables, 200 views), and I
still haven't been able to create a small test case. Everything I've
tried so far just works.

The data is private, but the schema is open source, so I probably could
work with someone on it. When I look at the pg_locks table, I seem to be
blocked on:
SELECT * FROM pg_locks WHERE granted = false;
 relation | database | transaction |  pid  |   mode   | granted
--+--+-+---+--+-
  |  | 1525932 | 30175 | ShareLock| f
...

Which if I understand correctly, means that the current transaction is
intentionally blocking waiting for the other transaction to finish.

I'm currently running 8.0.3, but the database was first created under
7.4.? I confirmed this behavior on both systems.

Under what circumstances would this occur?

To try and outline the situation there is a main object table, which is
the root object. It contains a group column which is used for access
rights. There is a groupref table, which keeps track of the group rights
for each user. (Each user has specific insert,update,select rights per
group).

The select rights are enforced by views (the tables are not publicly
accessible, the views join against the groupref table to check for
select permission).
Insert and update rights are validated by BEFORE INSERT triggers.

Most tables references the object table. Basically it is OO, but doesn't
use the postgres inheritance (in our testing postgres inheritance didn't
scale well for deep inheritance, and wasn't able to enforce uniqueness
anyway.) The views present an OO appearance, and behind the scenes
direct table foreign keys maintain referential integrity.

I have checked using RAISE NOTICE and the BEFORE INSERT trigger gets all
the way to the RETURN statement before things hang, so I haven't figured
out what is actually hanging.

I have a bzip'd version of the schema and just enough data to be useful
available here:
http://www.arbash-meinel.com/extras/schema_and_data.sql.bz2

This is the commands to replicate the locking:

-- Connect as postgres

-- Required before any inserts, so that the TEMP env table is
-- created and filled out.
select mf_setup_env();

-- Begin a transaction and insert some data
BEGIN;
INSERT INTO object(vgroup,otype,oname) VALUES ('test',1,'test');

-- Start a new shell, and connect again and do exactly the same thing
-- as the above.
-- It should hang until you either do END/ROLLBACK in the first
-- connection.

Thanks for any help,
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
Alvaro Herrera wrote:
 On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote:

I'm having an odd case where my system is locking such that if I insert
into a table during a transaction, if I start a new connection and
transaction, it blocks while trying to do a similar insert until the
first transaction is committed or rolled back.


 Are there foreign keys here?  I can duplicate the problem easily with
 them:

 -- session 1
 create table a (a serial primary key);
 create table b (a int references a);
 insert into a values (1);

 begin;
 insert into b values (1);


 -- session 2
 insert into b values (1);
 -- hangs


Actually, there are but the insert is occurring into table 'a' not table
'b'.
'a' refers to other tables, but these should not be modified.


 If I commit on session 1, session 2 is unlocked.

 This is a known problem, solved in 8.1.  A workaround for previous
 releases is to defer FK checks until commit:

 create table b (a int references a initially deferred);

I'll try one of the CVS entries and see if it happens there. Good to
hear there has been work done.

John
=:-





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Mostly read performance

2005-08-11 Thread John A Meinel
Jeffrey Tenny wrote:
 I have a largely table-append-only application where most transactions
 are read-intensive and many are read-only.  The transactions may span
 many tables, and in some cases might need to pull 70 MB of data out of a
 couple of the larger tables.


 In 7.3, I don't seem to see any file system or other caching that helps
 with repeated reads of the 70MB of data.  Secondary fetches are pretty
 much as slow as the first fetch. (The 70MB in this example might take
 place via 2000 calls to a parameterized statement via JDBC).

 Were there changes after 7.3 w.r.t. caching of data? I read this list
 and see people saying that 8.0 will use the native file system cache to
 good effect.  Is this true? Is it supposed to work with 7.3?  Is there
 something I need to do to get postgresql to take advatage of large ram
 systems?

 Thanks for any advice.


Well, first off, the general recommendation is probably that 7.3 is
really old, and you should try to upgrade to at least 7.4, though
recommended to 8.0.

The bigger questions: How much RAM do you have? How busy is your system?

8.0 doesn't really do anything to do make the system cache the data.
What kernel are you using?

Also, if your tables are small enough, and your RAM is big enough, you
might already have everything cached.

One way to flush the caches, is to allocate a bunch of memory, and then
scan through it. Or maybe mmap a really big file, and access every byte.
But if your kernel is smart enough, it could certainly deallocate pages
after you stopped accessing them, so I can't say for sure that you can
flush the memory cache. Usually, I believe these methods are sufficient.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
Alvaro Herrera wrote:
 On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote:


...


 This is a known problem, solved in 8.1.  A workaround for previous
 releases is to defer FK checks until commit:

So I don't know exactly what the fix was, but I just tested, and my
problem is indeed fixed with the latest CVS head. It no longer blocks.


 create table b (a int references a initially deferred);


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread John A Meinel
Dan Harris wrote:
 I have a web page for my customers that shows them count of records  and
 some min/max date ranges in each table of a database, as this is  how we
 bill them for service.  They can log in and check the counts  at any
 time.  I'd like for the counts to be as fresh as possible by  keeping
 this dynamic, but I will use a periodic 'snapshot'/cron job  if that is
 the only option to speed this up.   I have thought about  using the
 table statistics, but the estimate error is probably  unacceptable
 because of the billing purposes.

 For some reason, the SQL Server we migrated the app from can return
 count(*) in a split second on multi-million row tables, even though  it
 is a MUCH slower box hardware-wise, but it's now taking many  seconds to
 run. I have read in the archives the problems MVCC brings  into the
 count(*) dilemma forcing Pg to run a seq scan to get  counts.  Does
 SQLServer not use MVCC or have they found another  approach for arriving
 at this number?  Compounding all the min/max  and counts from other
 tables and all those queries take about a  minute to run. The tables
 will contain anywhere from 1 million to 40  million rows.

I believe SQL Server doesn't use MVCC in the same way. At the very
least, it stores some row information in the index, so it can get some
info from just an index, without having to go to the actual page (MVCC
requires a main page visit to determine visibility.)

Depending on how much it impacts performance, you can create an
INSERT/UPDATE trigger so that whenever a new entry is added, it
automatically updates a statistics table. It would be maintained as you
go, rather than periodically like a cron job.

I would go Cron if things can be slightly out of date (like 1 hour at
least), and you need updates  inserts to not be slowed down.
Otherwise I think the trigger is nicer, since it doesn't do redundant
work, and means everything stays up-to-date.



 Also, I am using select ... group by ... order by .. limit 1 to get
 the min/max since I have already been bit by the issue of min() max()
 being slower.


 -Dan

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure even if
RAID1 is on top of RAID0.



Well, you tell me if I stated incorrectly.  There are two raid  
enclosures with 7 drives in each.  Each is on its own bus on a dual- 
channel controller.  Each box has a stripe across its drives and the  
enclosures are mirrors of each other.  I understand the controller  
could be a single point of failure, but I'm not sure I understand  your 
concern about the RAID structure itself.


In this configuration, if you have a drive fail on both controllers, the 
entire RAID dies. Lets label them A1-7, B1-7, because you stripe within 
a set, if a single one of A dies, and a single one of B dies, you have 
lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and then 
stripe above that. Since you are using 2 7-disk enclosures, I'm not sure 
how you can do it well, since it is not an even number of disks. Though 
if you are using software RAID, there should be no problem.


The difference is that in this scenario, *all* of the A drives can die, 
and you haven't lost any data. The only thing you can't lose is a 
matched pair (eg losing both A1 and B1 will cause complete data loss)


I believe the correct notation for this last form is RAID 1 + 0 (RAID10) 
since you have a set of RAID1 drives, with a RAID0 on-top of them.






How many users are connected when your update / delete queries are
hanging? Have you done an analyze verbose on those queries?



Most of the traffic is from programs we run to do analysis of the  data 
and managing changes.  At the time I noticed it this morning,  there 
were 10 connections open to the database.  That rarely goes  above 20 
concurrent.  As I said in my other response, I believe that  the log 
will only contain the query at the point the query finishes,  so if it 
never finishes...




Have you made changes to the postgresql.conf? kernel.vm settings? IO
scheduler?



I set shmmax appropriately for my shared_buffers setting, but that's  
the only kernel tweak.




If you're not doing so already, you may consider running sar  (iostat) to
monitor when the hanging occurs if their is a memory / IO bottleneck
somewhere.



I will try that.  Thanks



When you discover that an update is hanging, can you get into the 
database, and see what locks currently exist? (SELECT * FROM pg_locks)


That might help you figure out what is being locked and possibly 
preventing your updates.


It is also possible that your UPDATE query is trying to do something 
funny (someone just recently was talking about an UPDATE that wanted to 
do a hash join against 12M rows). Which probably meant that it had to 
spill to disk, where a merge join would have worked better.


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 9, 2005, at 3:51 PM, John A Meinel wrote:


Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure  even if
RAID1 is on top of RAID0.

Well, you tell me if I stated incorrectly.  There are two raid   
enclosures with 7 drives in each.  Each is on its own bus on a  dual- 
channel controller.  Each box has a stripe across its drives  and 
the  enclosures are mirrors of each other.  I understand the  
controller  could be a single point of failure, but I'm not sure I  
understand  your concern about the RAID structure itself.




In this configuration, if you have a drive fail on both  controllers, 
the entire RAID dies. Lets label them A1-7, B1-7,  because you stripe 
within a set, if a single one of A dies, and a  single one of B dies, 
you have lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and  
then stripe above that. Since you are using 2 7-disk enclosures,  I'm 
not sure how you can do it well, since it is not an even number  of 
disks. Though if you are using software RAID, there should be no  
problem.


The difference is that in this scenario, *all* of the A drives can  
die, and you haven't lost any data. The only thing you can't lose  is 
a matched pair (eg losing both A1 and B1 will cause complete  data loss)


I believe the correct notation for this last form is RAID 1 + 0  
(RAID10) since you have a set of RAID1 drives, with a RAID0 on-top  of 
them.




I have read up on the difference now. I don't understand why it's a  
single point of failure.  Technically any array could be a single  
point depending on your level of abstraction.   In retrospect, I  
probably should have gone 8 drives in each and used RAID 10 instead  for 
the better fault-tolerance,  but it's online now and will require  some 
planning to see if I want to reconfigure that in the future.  I  wish 
HP's engineer would have promoted that method instead of 0+1..


I wouldn't say that it is a single point of failure, but I *can* say 
that it is much more likely to fail. (2 drives rather than on average n 
drives)


If your devices will hold 8 drives, you could simply do 1 8-drive, and 
one 6-drive. And then do RAID1 with pairs, and RAID0 across the 
resultant 7 RAID1 sets.


I'm really surprised that someone promoted RAID 0+1 over RAID10. I think 
I've heard that there is a possible slight performance improvement, but 
really the failure mode makes it a poor tradeoff.


John
=:-



-Dan



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 Okay,

 Here is the status of the SA updates and a question:

 Michael got SA changed to pass an array of tokens to the proc so right
 there we gained a ton of performance due to connections and transactions
 being grouped into one per email instead of one per token.

 Now I am working on making the proc even faster.  Since we have all of
 the tokens coming in as an array, it should be possible to get this down
 to just a couple of queries.

 I have the proc using IN and NOT IN statements to update everything at
 once from a temp table, but it progressively gets slower because the
 temp table is growing between vacuums.  At this point it's slightly
 slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.


 What I really want to do is have the token array available as a record
 so that I can query against it, but not have it take up the resources of
 a real table.  If I could copy from an array into a record then I can
 even get rid of the loop.  Anyone have any thoughts on how to do this?


My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:


 CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
  intokenary BYTEA[],
  inspam_count INTEGER,
  inham_count INTEGER,
  inatime INTEGER)
 RETURNS VOID AS '
 DECLARE
   _token BYTEA;
 BEGIN


-- create the table at the start of the procedure
CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
-- You might also add primary key if you are going to be adding
-- *lots* of entries, but it sounds like you are going to have
-- less than 1 page, so it doesn't matter

   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
   LOOP
 _token := intokenary[i];
 INSERT INTO bayes_token_tmp VALUES (_token);
   END LOOP;

   UPDATE
 bayes_token
   SET
 spam_count = greatest_int(spam_count + inspam_count, 0),
 ham_count = greatest_int(ham_count + inham_count , 0),
 atime = greatest_int(atime, 1000)
   WHERE
 id = inuserid
   AND

--(token) IN (SELECT intoken FROM bayes_token_tmp);
  EXISTS (SELECT token FROM bayes_token_tmp
   WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
   DELETE FROM bayes_token_tmp
WHERE NOT EXISTS (SELECT token FROM bayes_token
   WHERE token=intoken);



   UPDATE
 bayes_vars
   SET

-- token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
   token_count = token_count + (SELECT count(intoken)
  FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

 newest_token_age = greatest_int(newest_token_age, inatime),
 oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE
 id = inuserid;

   INSERT INTO
 bayes_token
 SELECT
   inuserid,
   intoken,
   inspam_count,
   inham_count,
   inatime
 FROM
   bayes_token_tmp
 WHERE
   (inspam_count  0 OR inham_count  0)

-- AND
--   (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

--   delete from bayes_token_tmp;
--   And rather than deleting all of the entries just
 DROP TABLE bayes_token_tmp;


   RETURN;
 END;
 ' LANGUAGE 'plpgsql';

 CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
  LANGUAGE SQL;

 CREATE OR REPLACE FUNCTION least_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
  LANGUAGE SQL;


So to clarify, here is my finished function:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
 intokenary BYTEA[],
 inspam_count INTEGER,
 inham_count INTEGER,
 inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
_token := 

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Tom Lane wrote:
 Matthew Schumacher [EMAIL PROTECTED] writes:

  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;


  UPDATE
bayes_token
  SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, 1000)
  WHERE
id = inuserid
  AND
(token) IN (SELECT intoken FROM bayes_token_tmp);


 I don't really see why you think that this path is going to lead to
 better performance than where you were before.  Manipulation of the
 temp table is never going to be free, and IN (sub-select) is always
 inherently not fast, and NOT IN (sub-select) is always inherently
 awful.  Throwing a pile of simple queries at the problem is not
 necessarily the wrong way ... especially when you are doing it in
 plpgsql, because you've already eliminated the overhead of network
 round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 10-100200, 20-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:-




   regards, tom lane

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq


mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM 
object_t);
   QUERY PLAN
-
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual 
time=0.085..183.889 rows=1351 loops=1)
   -  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual 
time=0.014..24.032 rows=2009 loops=1)
   -  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.071..0.071 rows=1 loops=2009)
 Index Cond: (outer.id = object_t.id)
 Total runtime: 184.823 ms
(5 rows)

Time: 186.931 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual 
time=0.086..165.053 rows=1351 loops=1)
   Filter: (subplan)
   SubPlan
 -  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.025..0.025 rows=1 loops=2009)
   Index Cond: (id = $0)
 Total runtime: 165.995 ms
(6 rows)

Time: 167.795 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM 
object_t);
QUERY PLAN
---
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual 
time=4168.247..4172.080 rows=658 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) 
(actual time=0.044..2464.296 rows=811481 loops=1)
 Total runtime: 4210.784 ms
(5 rows)

Time: 4212.276 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual 

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 Matthew Schumacher wrote:

Tom Lane wrote:



I don't really see why you think that this path is going to lead to
better performance than where you were before.  Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.  Throwing a pile of simple queries at the problem is not
necessarily the wrong way ... especially when you are doing it in
plpgsql, because you've already eliminated the overhead of network
round trips and repeated planning of the queries.

 regards, tom lane


The reason why I think this may be faster is because I would avoid
running an update on data that needs to be inserted which saves
searching though the table for a matching token.

Perhaps I should do the insert first, then drop those tokens from the
temp table, then do my updates in a loop.

I'll have to do some benchmarking...

schu


 Tom, I think your right, whenever I do a NOT IN it does a full table
 scan against bayes_token and since that table is going to get very big
 doing the simple query in a loop that uses an index seems a bit faster.

 John, thanks for your help, it was worth a try, but it looks like the
 looping is just faster.

 Here is what I have so far in case anyone else has ideas before I
 abandon it:

Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using EXECUTE which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:-



 CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
  intokenary BYTEA[],
  inspam_count INTEGER,
  inham_count INTEGER,
  inatime INTEGER)
 RETURNS VOID AS '
 DECLARE
   _token BYTEA;
 BEGIN

   UPDATE
 bayes_token
   SET
 spam_count = greatest_int(spam_count + inspam_count, 0),
 ham_count = greatest_int(ham_count + inham_count , 0),
 atime = greatest_int(atime, inatime)
   WHERE
 id = inuserid
   AND
 (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

   UPDATE
 bayes_vars
   SET
 token_count = token_count + (
   SELECT
 count(bayes_token_tmp)
   FROM
 bayes_token_tmp(intokenary)
   WHERE
 bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
 newest_token_age = greatest_int(newest_token_age, inatime),
 oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE
 id = inuserid;

   INSERT INTO
 bayes_token
 SELECT
   inuserid,
   bayes_token_tmp,
   inspam_count,
   inham_count,
   inatime
 FROM
   bayes_token_tmp(intokenary)
 WHERE
   (inspam_count  0 OR inham_count  0)
 AND
   (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

   RETURN;
 END;
 ' LANGUAGE 'plpgsql';

 CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
 SETOF bytea AS
 '
 BEGIN
   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
   LOOP
 return next intokenary[i];
   END LOOP;
   RETURN;
 end
 '
 language 'plpgsql';

 CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
  LANGUAGE SQL;

 CREATE OR REPLACE FUNCTION least_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
  LANGUAGE SQL;

 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?

http://archives.postgresql.org




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 John A Meinel wrote:


Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using EXECUTE which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:-


 It is the final function.  It doesn't need a loop because of the
 bayes_token_tmp function I added.  The array is passed to it and it
 returns a record set so I can work off of it like it's a table.  So the
 function works the same way it before, but instead of using SELECT
 intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
 bayes_token_tmp(intokenary).

 I think this is more efficient than the create table overhead,
 especially because the incoming record set won't be to big.

 Thanks,

 schu



Well, I would at least recommend that you change the WHERE
bayes_token_tmp NOT IN (SELECT token FROM bayes_token)
with a
WHERE NOT EXISTS (SELECT toke FROM bayes_token WHERE
token=bayes_token_tmp)

You might try experimenting with the differences, but on my system the
NOT IN has to do a full sequential scan on bayes_token and load all
entries into a list, while NOT EXISTS can do effectively a nested loop.

The nested loop requires that there is an index on bayes_token(token),
but I'm pretty sure there is anyway.

Again, in my testing, it was a difference of 4200ms versus 180ms. (800k
rows in my big table, 2k in the temp one)

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Is There A Windows Version of Performance Tuning Documents?

2005-08-03 Thread John A Meinel
Lane Van Ingen wrote:
 I have in my possession some performance tuning documents authored by Bruce
 Momjian, Josh Berkus, and others. They give good information on utilities to
 use (like ipcs, sar, vmstat, etc) to evaluate disk, memory, etc. performance
 on Unix-based systems.

 Problem is, I have applications running on Windows 2003, and have worked
 mostly on Unix before. Was wondering if anyone knows where there might be a
 Windows performance document that tells what to use / where to look in
 Windows for some of this data. I am thinking that I may not seeing what I
 need
 in perfmon or the Windows task manager.

 Want to answer questions like:
   How much memory is being used for disk buffer cache?
   How to I lock shared memory for PostgreSQL (if possible at all)?
   How to determine if SWAP (esp. page-in) activity is hurting me?
   Does Windows use a 'unified buffer cache' or not?
   How do I determine how much space is required to do most of my sorts in
 RAM?


I don't know of any specific documentation. I would mention the
TaskManager as the first place I would look (Ctrl+Shift+Esc, or right
click on the task bar).
You can customize the columns that it shows in the process view, so you
can get an idea if something is paging, how much I/O it is using, etc.

I'm sure there are other better tools, but this one is pretty easy to
get to, and shows quite a bit.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Planner incorrectly choosing seq scan over index scan

2005-08-01 Thread John Arbash Meinel
Meetesh Karia wrote:
 Hi all,
 
 We're using 8.0.3 and we're seeing a problem where the planner is
 choosing a seq scan and hash join over an index scan.  If I set
 enable_hashjoin to off, then I get the plan I'm expecting and the query
 runs a lot faster.  I've also tried lowering the random page cost (even
 to 1) but the planner still chooses to use the hash join.
 
 Does anyone have any thoughts/suggestions?  I saw that there was a
 thread recently in which the planner wasn't correctly estimating the
 cost for queries using LIMIT.  Is it possible that something similar is
 happening here (perhaps because of the sort) and that the patch Tom
 proposed would fix it?
 
 Thanks.  Here are the various queries and plans:
 
 Normal settings

...

 QUERY PLAN
 Sort  (cost=13430.57..13439.24 rows=3467 width=48) (actual
 time=1390.000..1390.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Merge Join  (cost=9912.07..13226.72 rows=3467 width=48) (actual
 time=1344.000..1375.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..95.000
 rows=50034 loops=1)

This is where the planner is messing up, and mis-estimating the
selectivity. It is expecting to get 280k rows, but only needs to get 50k.
I assume lte_user is the bigger table, and that candidates617004 has
some subset.

Has lte_user and candidates617004 been recently ANALYZEd? All estimates,
except for the expected number of rows from lte_user seem to be okay.

Is user_id the primary key for lte_user?
I'm trying to figure out how you can get 50k rows, by searching a
primary key, against a 3.5k rows. Is user_id only part of the primary
key for lte_user?

Can you give us the output of:
\d lte_user
\d candidates617004

So that we have the description of the tables, and what indexes you have
defined?

Also, if you could describe the table layouts, that would help.

John
=:-


 -  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
 time=1156.000..1156.000 rows=3467 loops=1)
   Sort Key: c.sourceid
   -  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
 (actual time=1125.000..1156.000 rows=3467 loops=1)
 Hash Cond: (outer.targetid = inner.user_id)
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
 rows=3467 loops=1)
 -  Hash  (cost=8011.95..8011.95 rows=279395
 width=16) (actual time=1125.000..1125.000 rows=0 loops=1)
   -  Seq Scan on lte_user t 
 (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000
 rows=279395 loops=1)
 Total runtime: 1406.000 ms
 
 enable_hashjoin disabled
 
 QUERY PLAN
 Sort  (cost=14355.37..14364.03 rows=3467 width=48) (actual
 time=391.000..391.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Nested Loop  (cost=271.52..14151.51 rows=3467 width=48) (actual
 time=203.000..359.000 rows=3467 loops=1)
 -  Merge Join  (cost=271.52..3490.83 rows=3467 width=40)
 (actual time=203.000..218.000 rows=3467 loops=1)
   Merge Cond: (outer.user_id = inner.sourceid)
   -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000
 rows=50034 loops=1)
   -  Sort  (cost=271.52..280.19 rows=3467 width=32) (actual
 time=15.000..30.000 rows=3467 loops=1)
 Sort Key: c.sourceid
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
 rows=3467 loops=1)
 -  Index Scan using lte_user_pkey on lte_user t 
 (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1
 loops=3467)
   Index Cond: (outer.targetid = t.user_id)
 Total runtime: 406.000 ms
 
 random_page_cost set to 1.5
 --
 QUERY PLAN
 Sort  (cost=12702.62..12711.29 rows=3467 width=48) (actual
 time=1407.000..1407.000 rows=3467 loops=1)
   Sort Key: c.sourceid, c.targetid
   -  Merge Join  (cost=9912.07..12498.77 rows=3467 width=48) (actual
 time=1391.000..1407.000 rows=3467 loops=1)
 Merge Cond: (outer.user_id = inner.sourceid)
 -  Index Scan using lte_user_pkey on lte_user s 
 (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000
 rows=50034 loops=1)
 -  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
 time=1188.000..1188.000 rows=3467 loops=1)
   Sort Key: c.sourceid
   -  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
 (actual time=1157.000..1188.000 rows=3467 loops=1)
 Hash Cond: (outer.targetid = inner.user_id)
 -  Seq Scan on candidates617004 c 
 (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000
 rows=3467 loops=1)
  

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-31 Thread John Arbash Meinel
John Arbash Meinel wrote:

Matthew Schumacher wrote:

  

All it's doing is trying the update before the insert to get around the
problem of not knowing which is needed.  With only 2-3 of the queries
implemented I'm already back to running about the same speed as the
original SA proc that is going to ship with SA 3.1.0.

All of the queries are using indexes so at this point I'm pretty
convinced that the biggest problem is the sheer number of queries
required to run this proc 200 times for each email (once for each token).

I don't see anything that could be done to make this much faster on the
postgres end, it's looking like the solution is going to involve cutting
down the number of queries some how.

One thing that is still very puzzling to me is why this runs so much
slower when I put the data.sql in a transaction.  Obviously transactions
are acting different when you call a proc a zillion times vs an insert
query.
 



Well, I played with adding a COMMIT;BEGIN; statement to your exact test
every 1000 lines. And this is what I got:
  

Just for reference, I also tested this on my old server, which is a dual
Celeron 450 with 256M ram. FC4 and Postgres 8.0.3
Unmodified:
real54m15.557s
user0m24.328s
sys 0m14.200s

With Transactions every 1000 selects, and vacuum every 5000:
real8m36.528s
user0m16.585s
sys 0m12.569s

With Transactions every 1000 selects, and vacuum every 1:
real7m50.748s
user0m16.183s
sys 0m12.489s

On this machine vacuum is more expensive, since it doesn't have as much ram.

Anyway, on this machine, I see approx 7x improvement. Which I think is
probably going to satisfy your spamassassin needs.
John
=:-

PS Looking forward to having a spamassassin that can utilize my
favorite db. Right now, I'm not using a db backend because it wasn't
worth setting up mysql.

Unmodified:
real17m53.587s
user0m6.204s
sys 0m3.556s

With BEGIN/COMMIT:
real1m53.466s
user0m5.203s
sys 0m3.211s

So I see the potential for improvement almost 10 fold by switching to
transactions. I played with the perl script (and re-implemented it in
python), and for the same data as the perl script, using COPY instead of
INSERT INTO means 5s instead of 33s.

I also played around with adding VACUUM ANALYZE every 10 COMMITS, which
brings the speed to:

real1m41.258s
user0m5.394s
sys 0m3.212s

And doing VACUUM ANALYZE every 5 COMMITS makes it:
real1m46.403s
user0m5.597s
sys 0m3.244s

I'm assuming the slowdown is because of the extra time spent vacuuming.
Overall performance might still be improving, since you wouldn't
actually be inserting all 100k rows at once.
  

...

This is all run on Ubuntu, with postgres 7.4.7, and a completely
unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with
3GB of RAM).

John
=:-
  




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread John Arbash Meinel
Matthew Schumacher wrote:

All it's doing is trying the update before the insert to get around the
problem of not knowing which is needed.  With only 2-3 of the queries
implemented I'm already back to running about the same speed as the
original SA proc that is going to ship with SA 3.1.0.

All of the queries are using indexes so at this point I'm pretty
convinced that the biggest problem is the sheer number of queries
required to run this proc 200 times for each email (once for each token).

I don't see anything that could be done to make this much faster on the
postgres end, it's looking like the solution is going to involve cutting
down the number of queries some how.

One thing that is still very puzzling to me is why this runs so much
slower when I put the data.sql in a transaction.  Obviously transactions
are acting different when you call a proc a zillion times vs an insert
query.
  

Well, I played with adding a COMMIT;BEGIN; statement to your exact test
every 1000 lines. And this is what I got:

Unmodified:
real17m53.587s
user0m6.204s
sys 0m3.556s

With BEGIN/COMMIT:
real1m53.466s
user0m5.203s
sys 0m3.211s

So I see the potential for improvement almost 10 fold by switching to
transactions. I played with the perl script (and re-implemented it in
python), and for the same data as the perl script, using COPY instead of
INSERT INTO means 5s instead of 33s.

I also played around with adding VACUUM ANALYZE every 10 COMMITS, which
brings the speed to:

real1m41.258s
user0m5.394s
sys 0m3.212s

And doing VACUUM ANALYZE every 5 COMMITS makes it:
real1m46.403s
user0m5.597s
sys 0m3.244s

I'm assuming the slowdown is because of the extra time spent vacuuming.
Overall performance might still be improving, since you wouldn't
actually be inserting all 100k rows at once.


Just to complete the reference, the perl version runs as:
10:44:02 -- START
10:44:35 -- AFTER TEMP LOAD : loaded 120596 records
10:44:39 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
10:44:41 -- AFTER bayes_vars UPDATE : updated 1 records
10:46:42 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

My python version runs as:
00:22:54 -- START
00:23:00 -- AFTER TEMP LOAD : loaded 120596 records
00:23:03 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
00:23:06 -- AFTER bayes_vars UPDATE : updated 1 records
00:25:04 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

The python is effectively just a port of the perl code (with many lines
virtually unchanged), and really the only performance difference is that
the initial data load is much faster with a COPY.

This is all run on Ubuntu, with postgres 7.4.7, and a completely
unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with
3GB of RAM).

John
=:-

Anyway, if anyone else has any ideas I'm all ears, but at this point
it's looking like raw query speed is needed for this app and while I
don't care for mysql as a database, it does have the speed going for it.

schu

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

  




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-30 Thread John Arbash Meinel
Matthew Schumacher wrote:

Tom Lane wrote:

  

I looked into this a bit.  It seems that the problem when you wrap the
entire insertion series into one transaction is associated with the fact
that the test does so many successive updates of the single row in
bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
49383 dead versions of the one row.)  This is bad enough when it's in
separate transactions, but when it's in one transaction, none of those
dead row versions can be marked fully dead yet --- so for every update
of the row, the unique-key check has to visit every dead version to make
sure it's dead in the context of the current transaction.  This makes
the process O(N^2) in the number of updates per transaction.  Which is
bad enough if you just want to do one transaction per message, but it's
intolerable if you try to wrap the whole bulk-load scenario into one
transaction.

I'm not sure that we can do anything to make this a lot smarter, but
in any case, the real problem is to not do quite so many updates of
bayes_vars.

How constrained are you as to the format of the SQL generated by
SpamAssassin?  In particular, could you convert the commands generated
for a single message into a single statement?  I experimented with
passing all the tokens for a given message as a single bytea array,
as in the attached, and got almost a factor of 4 runtime reduction
on your test case.

BTW, it's possible that this is all just a startup-transient problem:
once the database has been reasonably well populated, one would expect
new tokens to be added infrequently, and so the number of updates to
bayes_vars ought to drop off.

  regards, tom lane




The spamassassins bayes code calls the _put_token method in the storage
module a loop.  This means that the storage module isn't called once per
message, but once per token.
  

Well, putting everything into a transaction per email might make your
pain go away.
If you saw the email I just sent, I modified your data.sql file to add a
COMMIT;BEGIN every 1000 selects, and I saw a performance jump from 18
minutes down to less than 2 minutes. Heck, on my machine, the advanced
perl version takes more than 2 minutes to run. It is actually slower
than the data.sql with commit statements.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.
  

Well, you could do that. Or you could just have the bayes code issue
BEGIN; when it starts processing an email, and a COMMIT; when it
finishes. From my testing, you will see an enormous speed improvement.
(And you might consider including a fairly frequent VACUUM ANALYZE)

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.
  

Good luck,
John
=:-

Thanks,

schu

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

  




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-07-29 Thread John Arbash Meinel
Josh Berkus wrote:

Dennis,

  

  EXCEPTION WHEN unique_violation THEN



I seem to remember that catching an exception in a PL/pgSQL procedure was a 
large performance cost.   It'd be better to do UPDATE ... IF NOT FOUND.

  

Actually, he was doing an implicit UPDATE IF NOT FOUND in that he was doing:

UPDATE

IF found THEN return;

INSERT
EXCEPT
...

So really, the exception should never be triggered.
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Two queries are better than one?

2005-07-28 Thread John A Meinel
Karim Nassar wrote:
 I ran into a situation today maintaining someone else's code where the
 sum time running 2 queries seems to be faster than 1. The original code
 was split into two queries. I thought about joining them, but
 considering the intelligence of my predecessor, I wanted to test it.

 The question is, which technique is really faster? Is there some hidden
 setup cost I don't see with explain analyze?

Yes, the time it takes your user code to parse the result, and create
the new query. :)

It does seem like you are taking an extra 0.1ms for the combined query,
but that means you don't have another round trip to the database. So
that would mean one less context switch, and you don't need to know what
the cli_code is before you can get the cli_name.

I would guess the overhead is the time for postgres to parse out the
text, place another index query, and then combine the rows. It seems
like this shouldn't take 0.1ms, but then again, that isn't very long.

Also, did you run it *lots* of times to make sure that this isn't just
noise?

John
=:-


 Postgres 7.4.7, Redhat AES 3

 Each query individually:

 test= explain analyze
 test- select * from order  WHERE ord_batch='343B' AND ord_id='12-645';
  QUERY PLAN
 
  Index Scan using order_pkey on order  (cost=0.00..6.02 rows=1 width=486) 
 (actual time=0.063..0.066 rows=1 loops=1)
Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = '12-645'::bpchar))
  Total runtime: 0.172 ms
 (3 rows)


 test= explain analyze
 test- select cli_name from client where cli_code='1837';
  QUERY PLAN
 -
  Index Scan using client_pkey on client  (cost=0.00..5.98 rows=2 width=39) 
 (actual time=0.043..0.047 rows=1 loops=1)
Index Cond: (cli_code = '1837'::bpchar)
  Total runtime: 0.112 ms
 (3 rows)

 Joined:

 test= explain analyze
 test-SELECT cli_name,order.*
 test-   FROM order
 test-   JOIN client ON (ord_client = cli_code)
 test-  WHERE ord_batch='343B' AND ord_id='12-645';
 QUERY PLAN
 --
  Nested Loop  (cost=0.00..12.00 rows=2 width=525) (actual time=0.120..0.128 
 rows=1 loops=1)
-  Index Scan using order_pkey on order  (cost=0.00..6.02 rows=1 
 width=486) (actual time=0.064..0.066 rows=1 loops=1)
  Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = 
 '12-645'::bpchar))
-  Index Scan using client_pkey on client  (cost=0.00..5.98 rows=1 
 width=51) (actual time=0.023..0.026 rows=1 loops=1)
  Index Cond: (outer.ord_client = client.cli_code)
  Total runtime: 0.328 ms
 (6 rows)





signature.asc
Description: OpenPGP digital signature


[PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel
I saw a review of a relatively inexpensive RAM disk over at 
anandtech.com, the Gigabyte i-RAM

http://www.anandtech.com/storage/showdoc.aspx?i=2480

Basically, it is a PCI card, which takes standard DDR RAM, and has a 
SATA port on it, so that to the system, it looks like a normal SATA drive.


The card costs about $100-150, and you fill it with your own ram, so for 
a 4GB (max size) disk, it costs around $500. Looking for solid state 
storage devices, the cheapest I found was around $5k for 2GB.


Gigabyte claims that the battery backup can last up to 16h, which seems 
decent, if not really long (the $5k solution has a built-in harddrive so 
that if the power goes out, it uses the battery power to copy the 
ramdisk onto the harddrive for more permanent storage).


Anyway, would something like this be reasonable as a drive for storing 
pg_xlog? With 4GB you could have as many as 256 checkpoint segments.


I'm a little leary as it is definitely a version 1.0 product (it is 
still using an FPGA as the controller, so they were obviously pushing to 
get the card into production).


But it seems like this might be a decent way to improve insert 
performance, without setting fsync=false.


Probably it should see some serious testing (as in power spikes/pulled 
plugs, etc). I know the article made some claim that if you actually 
pull out the card it goes into high consumption mode which is somehow 
greater than if you leave it in the slot with the power off. Which to me 
seems like a lot of bull, and really means the 16h is only under 
best-case circumstances. But even 1-2h is sufficient to handle a simple 
power outage.


And if you had a UPS with detection of power failure, you could always 
sync the ramdisk to a local partition before the power goes out. Though 
you could do that with a normal in-memory ramdisk (tmpfs) without having 
to buy the card. Though it does give you up-to an extra 4GB of ram, for 
machines which have already maxed out their slots.


Anyway, I thought I would mention it to the list, to see if anyone else 
has heard of it, or has any thoughts on the matter. I'm sure there are 
some people who are using more expensive ram disks, maybe they have some 
ideas about what this device is missing. (other than costing about 
1/10th the price)


John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Dan Harris wrote:
I am working on a process that will be inserting tens of million rows  
and need this to be as quick as possible.


The catch is that for each row I could potentially insert, I need to  
look and see if the relationship is already there  to prevent  multiple 
entries.  Currently I am doing a SELECT before doing the  INSERT, but I 
recognize the speed penalty in doing to operations.  I  wonder if there 
is some way I can say insert this record, only if it  doesn't exist 
already.  To see if it exists, I would need to compare  3 fields 
instead of just enforcing a primary key.


Even if this could be a small increase per record, even a few percent  
faster compounded over the whole load could be a significant reduction.


Thanks for any ideas you might have.

-Dan



You could insert all of your data into a temporary table, and then do:

INSERT INTO final_table SELECT * FROM temp_table WHERE NOT EXISTS 
(SELECT info FROM final_table WHERE id=id, path=path, y=y);


Or you could load it into the temporary table, and then:
DELETE FROM temp_table WHERE EXISTS (SELECT FROM final_table WHERE id...);

And then do a plain INSERT INTO.

I can't say what the specific performance increases would be, but 
temp_table could certainly be an actual TEMP table (meaning it only 
exists during the connection), and you could easily do a COPY into that 
table to load it up quickly, without having to check any constraints.


Just a thought,
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel

Luke Lonergan wrote:

Yup - interesting and very niche product - it seems like it's only obvious
application is for the Postgresql WAL problem :-)


Well, you could do it for any journaled system (XFS, JFS, ext3, reiserfs).

But yes, it seems specifically designed for a battery backed journal. 
Though the article reviews it for very different purposes.


Though it was a Windows review, and I don't know of any way to make NTFS 
use a separate device for a journal. (Though I expect it is possible 
somehow).


John
=:-




The real differentiator is the battery backup part.  Otherwise, the
filesystem caching is more effective, so put the RAM on the motherboard.

- Luke






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel

Alex Turner wrote:

Also seems pretty silly to put it on a regular SATA connection, when
all that can manage is 150MB/sec.  If you made it connection directly
to 66/64-bit PCI then it could actualy _use_ the speed of the RAM, not
to mention PCI-X.

Alex Turner
NetEconomist



Well, the whole point is to have it look like a normal SATA drive, even 
to the point that you can boot off of it, without having to load a 
single driver.


Now, you could offer that you could recreate a SATA controller on the 
card, with a SATA bios, etc. And then you could get the increased speed, 
and still have bootable functionality.


But it is a version 1.0 of a product, and I'm sure they tried to make it 
as cheap as possible (and within their own capabilities.)


John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Matthew Nuzum wrote:

On 7/26/05, Dan Harris [EMAIL PROTECTED] wrote:


I am working on a process that will be inserting tens of million rows
and need this to be as quick as possible.

The catch is that for each row I could potentially insert, I need to
look and see if the relationship is already there  to prevent
multiple entries.  Currently I am doing a SELECT before doing the
INSERT, but I recognize the speed penalty in doing to operations.  I
wonder if there is some way I can say insert this record, only if it
doesn't exist already.  To see if it exists, I would need to compare
3 fields instead of just enforcing a primary key.



I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.



Did you ever try the temp table approach? You could:

COPY all records into temp_table, with an empty row for ip_id
-- Get any entries which already exist
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
-- Create new entries
INSERT INTO ipaddress(add) SELECT add FROM temp_table
  WHERE ip_id IS NULL;
-- Update the rest
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE ip_id IS NULL AND
EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);

This would let the database do all of the updating work in bulk on it's 
side, rather than you pulling all the data out and doing it locally.


An alternative would be something like:

CREATE TEMP TABLE new_ids (address text, ip_id int);
COPY all potentially new addresses into that table.
-- Delete all entries which already exist
DELETE FROM new_ids WHERE EXISTS
(SELECT ip_id FROM ipaddresses
  WHERE add=new_ids.address);
-- Now create the new entries
INSERT INTO ipaddresses(add) SELECT address FROM new_ids;

-- At this point you are guaranteed to have all addresses existing in
-- the database

If you then insert your full data into the final table, only leave the 
ip_id column as null. Then if you have a partial index where ip_id is 
NULL, you could use the command:


UPDATE final_table SET ip_id =
(SELECT ip_id FROM ipaddresses WHERE add=final_table.add)
WHERE ip_id IS NULL;

You could also do this in a temporary table, before bulk inserting into 
the final table.


I don't know what you have tried, but I know that for Dan, he easily has 
 36M rows. So I don't think he wants to pull that locally and create a 
in-memory hash just to insert 100 rows or so.


Also, for your situation, if you do keep a local cache, you could 
certainly save the cache between runs, and use a temp table to determine 
what new ids you need to add to it. Then you wouldn't have to pull the 
complete set each time. You just pull new values for entries you haven't 
added yet.


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] COPY insert performance

2005-07-25 Thread John A Meinel
Chris Isaacson wrote:
 I need COPY via libpqxx to insert millions of rows into two tables.  One
 table has roughly have as many rows and requires half the storage.  In
 production, the largest table will grow by ~30M rows/day.  To test the
 COPY performance I split my transactions into 10,000 rows.  I insert
 roughly 5000 rows into table A for every 10,000 rows into table B.

 Table A has one unique index:

 order_main_pk UNIQUE, btree (cl_ord_id)

 Table B has 1 unique index and 2 non-unique indexes:

 order_transition_pk UNIQUE, btree (collating_seq)
 order_transition_ak2 btree (orig_cl_ord_id)
 order_transition_ak3 btree (exec_id)

Do you have any foreign key references?
If you are creating a table for the first time (or loading a large
fraction of the data), it is common to drop the indexes and foreign keys
first, and then insert/copy, and then drop them again.

Is InnoDB the backend with referential integrity, and true transaction
support? I believe the default backend does not support either (so it is
cheating to give you speed, which may be just fine for your needs,
especially since you are willing to run fsync=false).

I think moving pg_xlog to a dedicated drive (set of drives) could help
your performance. As well as increasing checkpoint_segments.

I don't know if you gain much by changing the bg_writer settings, if you
are streaming everything in at once, you probably want to have it
written out right away. My understanding is that bg_writer settings are
for the case where you have mixed read and writes going on at the same
time, and you want to make sure that the reads have time to execute (ie
the writes are not saturating your IO).

Also, is any of this tested under load? Having a separate process issue
queries while you are loading in data. Traditionally MySQL is faster
with a single process inserting/querying for data, but once you have
multiple processes hitting it at the same time, it's performance
degrades much faster than postgres.

You also seem to be giving MySQL 512M of ram to work with, while only
giving 2M/200M to postgres. (re)creating indexes uses
maintenance_work_mem, but updating indexes could easily use work_mem.
You may be RAM starved.

John
=:-



 My testing environment is as follows:
 -Postgresql 8.0.1
 -libpqxx 2.5.0
 -Linux 2.6.11.4-21.7-smp x86_64
 -Dual Opteron 246
 -System disk (postgres data resides on this SCSI disk) -  Seagate
 (ST373453LC) - 15K, 73 GB
 (http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,549,00.html)
 -2nd logical disk - 10K, 36GB IBM SCSI (IC35L036UCDY10-0) - WAL reside
 on this disk
 -NO RAID

 *PostgreSQL*
 Here are the results of copying in 10M rows as fast as possible:
 (10K/transaction)
 Total Time:1129.556 s
 Rows/sec: 9899.922
 Transaction1.2s225
 Transaction1.5s 77
 Transaction2.0s  4
 Max Transaction   2.325s

 **MySQL**
 **I ran a similar test with MySQL 4.1.10a (InnoDB) which produced these
 results: (I used MySQL's INSERT INTO x VALUES
 (1,2,3)(4,5,6)(...,...,...) syntax) (10K/transaction)
 Total Time: 860.000 s
 Rows/sec:11627.91
 Transaction1.2s  0
 Transaction1.5s  0
 Transaction2.0s  0
 Max Transaction   1.175s

 Considering the configurations shown below, can anyone offer advice to
 close the 15% gap and the much worse variability I'm experiencing.  Thanks

 My *postgresql.conf* has the following non-default values:
 # -
 # PostgreSQL configuration file
 # -
 listen_addresses = '*' # what IP interface(s) to listen on;
 max_connections = 100
 #---
 # RESOURCE USAGE (except WAL)
 #---
 shared_buffers = 65536  # min 16, at least max_connections*2, 8KB each
 work_mem = 2048   # min 64, size in KB
 maintenance_work_mem = 204800 # min 1024, size in KB
 max_fsm_pages = 225  # min max_fsm_relations*16, 6 bytes each
 bgwriter_delay = 200  # 10-1 milliseconds between rounds
 bgwriter_percent = 10  # 0-100% of dirty buffers in each round
 bgwriter_maxpages = 1000 # 0-1000 buffers max per round
 #---
 # WRITE AHEAD LOG
 #---
 fsync = false   # turns forced synchronization on or off
 wal_buffers = 64  # min 4, 8KB each
 checkpoint_segments = 40 # in logfile segments, min 1, 16MB each
 checkpoint_timeout = 600 # range 30-3600, in seconds
 #---
 # QUERY TUNING
 #---
 effective_cache_size = 65536 # typically 8KB each
 random_page_cost = 2  # units are one sequential page fetch cost
 #---
 # ERROR 

Re: [PERFORM] Vacuum Full Analyze taking so long

2005-07-25 Thread John A Meinel
Tomeh, Husam wrote:

 Nothing was running except the job. The server did not look stressed out
 looking at top and vmstat. We have seen slower query performance when
 performing load tests, so I run the re-index on all application indexes
 and then issue a full vacuum. I ran the same thing on a staging server
 and it took less than 12 hours. Is there a possibility the DB pages are
 corrupted. Is there a command to verify that. (In Oracle, there's a
 dbverify command that checks for corruption on the data files level).

 The other question I have. What would be the proper approach to rebuild
 indexes. I re-indexes and then run vacuum/analyze. Should I not use the
 re-index approach, and instead, drop the indexes, vacuum the tables, and
 then create the indexes, then run analyze on tables and indexes??

I *think* if you are planning on dropping the indexes anyway, just drop
them, VACUUM ANALYZE, and then recreate them, I don't think you have to
re-analyze after you have recreated them.

John
=:-


 Thanks,





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Optimizer seems to be way off, why?

2005-07-20 Thread John A Meinel
Dirk Lutzebäck wrote:
 Richard Huxton wrote:
 
 Dirk Lutzebäck wrote:

 Hi,

 I do not under stand the following explain output (pgsql 8.0.3):

 explain analyze
 select b.e from b, d
 where b.r=516081780 and b.c=513652057 and b.e=d.e;

 QUERY PLAN
 

 Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
 time=0.213..2926.845 rows=324503 loops=1)
   -  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
 width=4) (actual time=0.104..17.418 rows=3293 loops=1)
 Index Cond: (r = 516081780::oid)
 Filter: (c = 513652057::oid)
   -  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
 width=4) (actual time=0.009..0.380 rows=99 loops=3293)
 Index Cond: (outer.e = d.e)
 Total runtime: 3638.783 ms
 (7 rows)

 Why is the rows estimate for b_index and the nested loop 1? It is
 actually 3293 and 324503.



 I'm guessing (and that's all it is) that b.r and b.c have a higher
 correlation than the planner is expecting. That is, it expects the
 b.c=... to reduce the number of matching rows much more than it is.

 Try a query just on WHERE b.r=516081780 and see if it gets the
 estimate right for that.

 If it's a common query, it might be worth an index on (r,c)

 -- 
   Richard Huxton
   Archonet Ltd

 
 Thanks Richard, dropping the join for b.c now gives better estimates (it
 also uses a different index now) although not accurate (off by factor
 10). This query is embedded in a larger query which now got a 1000 times
 speed up (!) because I can drop b.c because it is redundant.

Well, part of the problem is that the poorly estimated row is not 'b.e'
but 'b.r', it expects to only find one row that matches, and instead
finds 3293 rows.

Now, that *could* be because it mis-estimates the selectivity of b.r  b.c.

It actually estimated the join with d approximately correctly. (It
thought that for each row it would find 140, and it averaged 99).

 
 Though, why can't the planner see this correlation? I think somebody
 said the planner does not know about multiple column correlations, does it?

The planner does not maintain cross-column statistics, so you are
correct. I believe it assumes distributions are independent. So that if
r=R is 10% selective, and c= is 20% selective, the total
selectivity of r= AND c= is 2%. I could be wrong on this, but I
think it is approximately correct.

Now if you created the index on b(r,c), then it would have a much better
idea of how selective that would be. At the very least, it could index
on (r,c) rather than indexing on (r) and filtering by (c).

Also, if you have very skewed data (where you have 1 value 100k times,
and 50 values only 10times each), the planner can overestimate the low
values, and underestimate the high one. (It uses random sampling, so it
kind of depends where the entries are.)

Have you tried increasing the statistics on b.r and or b.c? Do you have
an index on b.c or just b.r?

To see what the planner thinks, you might try:

EXPLAIN ANALYZE
select count(*) from b where r=516081780;

That would tell you how selective the planner thinks the r= is.
 
 Regards,
 
 Dirk
 
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Looking for tips

2005-07-19 Thread John A Meinel

Oliver Crosby wrote:

Hi,
I'm running Postgres 7.4.6 on a dedicated server with about 1.5gigs of ram.
Running scripts locally, it takes about 1.5x longer than mysql, and the
load on the server is only about 21%.
I upped the sort_mem to 8192 (kB), and shared_buffers and
effective_cache_size to 65536 (512MB), but neither the timing nor the
server load have changed at all. FYI, I'm going to be working on data
sets in the order of GB.

I think I've gone about as far as I can with google.. can anybody give
me some advice on how to improve the raw performance before I start
looking at code changes?

Thanks in advance.


First, try to post in plain-text rather than html, it is easier to read. :)

Second, if you can determine what queries are running slow, post the
result of EXPLAIN ANALYZE on them, and we can try to help you tune
them/postgres to better effect.

Just a blanket question like this is hard to answer. Your new
shared_buffers are probably *way* too high. They should be at most
around 10% of ram. Since this is a dedicated server effective_cache_size
should be probably ~75% of ram, or close to 1.2GB.

There are quite a few things that you can tweak, so the more information
you can give, the more we can help.

For instance, if you are loading a lot of data into a table, if
possible, you want to use COPY not INSERT.
If you have a lot of indexes and are loading a significant portion, it
is sometimes faster to drop the indexes, COPY the data in, and then
rebuild the indexes.

For tables with a lot of inserts/updates, you need to watch out for
foreign key constraints. (Generally, you will want an index on both
sides of the foreign key. One is required, the other is recommended for
faster update/deletes).

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-15 Thread John A Meinel
Dan Harris wrote:

 On Jul 14, 2005, at 10:12 PM, John A Meinel wrote:


 My biggest question is why the planner things the Nested Loop would be
 so expensive.
 Have you tuned any of the parameters? It seems like something is  out of
 whack. (cpu_tuple_cost, random_page_cost, etc...)


 here's some of my postgresql.conf.  Feel free to blast me if I did
 something idiotic here.

 shared_buffers = 5
 effective_cache_size = 1348000
 random_page_cost = 3
 work_mem = 512000

Unless you are the only person connecting to this database, your
work_mem is very high. And if you haven't modified maintenance_work_mem
it is probably very low. work_mem might be causing postgres to think it
can fit all of a merge into ram, making it faster, I can't say for sure.

 max_fsm_pages = 8

This seems high, but it depends how many updates/deletes you get
in-between vacuums. It may not be too excessive. VACUUM [FULL] VERBOSE
replies with how many free pages are left, if you didn't use that
already for tuning. Though it should be tuned based on a steady state
situation. Not a one time test.

 log_min_duration_statement = 6
 fsync = true ( not sure if I'm daring enough to run without this )
 wal_buffers = 1000
 checkpoint_segments = 64
 checkpoint_timeout = 3000


These seem fine to me.

Can you include the output of EXPLAIN SELECT both with and without SET
join_collapselimit? Since your tables have grown, I can't compare the
estimated number of rows, and costs very well.

EXPLAIN without ANALYZE is fine, since I am wondering what the planner
is thinking things cost.

John
=:-


 # FOR PG_AUTOVACUUM --#
 stats_command_string = true
 stats_row_level = true



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:
 I'm trying to improve the speed of this query:

 explain select recordtext from eventactivity inner join ( select
 incidentid from k_r where id = 94 ) a using ( incidentid ) inner join  (
 select incidentid from k_b where id = 107 ) b using ( incidentid );

You might try giving it a little bit more freedom with:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity, k_r, k_b
 WHERE eventactivity.incidentid = k_r.incidentid
   AND eventactivity.incidentid = k_b.incidentid
   AND k_r.id = 94
   AND k_b.id = 107
-- AND k_r.incidentid = k_b.incidentid
;

I'm pretty sure that would give identical results, just let the planner
have a little bit more freedom about how it does it.
Also the last line is commented out, because I think it is redundant.

You might also try:
EXPLAIN ANALYZE
SELECT recordtext
  FROM eventactivity JOIN k_r USING (incidentid)
  JOIN k_b USING (incidentid)
 WHERE k_r.id = 94
   AND k_b.id = 107
;

Also, if possible give us the EXPLAIN ANALYZE so that we know if the
planner is making accurate estimates. (You might send an EXPLAIN while
waiting for the EXPLAIN ANALYZE to finish)

You can also try disabling merge joins, and see how that changes things.

   QUERY PLAN
 
 --
 Merge Join  (cost=2747.29..4249364.96 rows=11968693 width=35)
Merge Cond: ((outer.incidentid)::text = inner.?column2?)
-  Merge Join  (cost=1349.56..4230052.73 rows=4413563 width=117)
  Merge Cond: ((outer.incidentid)::text = inner.?column2?)
  -  Index Scan using eventactivity1 on eventactivity
 (cost=0.00..4051200.28 rows=44519781 width=49)
  -  Sort  (cost=1349.56..1350.85 rows=517 width=68)
Sort Key: (k_b.incidentid)::text
-  Index Scan using k_b_idx on k_b   (cost=0.00..1326.26
 rows=517 width=68)
  Index Cond: (id = 107)
-  Sort  (cost=1397.73..1399.09 rows=542 width=68)
  Sort Key: (k_r.incidentid)::text
  -  Index Scan using k_r_idx on k_r  (cost=0.00..1373.12
 rows=542 width=68)
Index Cond: (id = 94)
 (13 rows)


 There are many millions of rows in eventactivity.  There are a few
 ten-thousand rows in k_r and k_b.  There is an index on 'incidentid'  in
 all three tables.  There should only be less than 100 rows matched  in
 k_r and k_b total.  That part on its own is very very fast.  But,  it
 should have those 100 or so incidentids extracted in under a  second and
 then go into eventactivity AFTER doing that.  At least,  that's my
 intention to make this fast.

Well, postgres is estimating around 500 rows each, is that way off? Try
just doing:
EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

And see if postgres estimates the number of rows properly.

I assume you have recently VACUUM ANALYZEd, which means you might need
to update the statistics target (ALTER TABLE k_b ALTER COLUMN
incidientid SET STATISTICS 100) default is IIRC 10, ranges from 1-1000,
higher is more accurate, but makes ANALYZE slower.


 Right now, it looks like pg is trying to sort the entire  eventactivity
 table for the merge join which is taking several  minutes to do.  Can I
 rephrase this so that it does the searching  through k_r and k_b FIRST
 and then go into eventactivity using the  index on incidentid?  It seems
 like that shouldn't be too hard to  make fast but my SQL query skills
 are only average.

To me, it looks like it is doing an index scan (on k_b.id) through k_b
first, sorting the results by incidentid, then merge joining that with
eventactivity.

I'm guessing you actually want it to merge k_b and k_r to get extra
selectivity before joining against eventactivity.
I think my alternate forms would let postgres realize this. But if not,
you could try:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
 JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
WHERE k_r.id = 94 AND k_b.id = 107)
USING (incidentid);

I don't know how selective your keys are, but one of these queries
should probably structure it better for the planner. It depends a lot on
how selective your query is.
If you have 100M rows, the above query looks like it expects k_r to
restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
should be a seq scan ( 10% of the rows = seq scan). But if you are
saying the selectivity is mis-estimated it could be different.

John
=:-

 Thanks
 -Dan

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:

 On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:

...
Did you try doing this to see how good the planners selectivity
estimates are?

 Well, postgres is estimating around 500 rows each, is that way off?  Try
 just doing:
 EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
 EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

These should be fast queries.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:

 On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:



 You might try giving it a little bit more freedom with:

 EXPLAIN ANALYZE
 SELECT recordtext FROM eventactivity, k_r, k_b
  WHERE eventactivity.incidentid = k_r.incidentid
AND eventactivity.incidentid = k_b.incidentid
AND k_r.id = 94
AND k_b.id = 107
 -- AND k_r.incidentid = k_b.incidentid
 ;

 I'm pretty sure that would give identical results, just let the  planner
 have a little bit more freedom about how it does it.
 Also the last line is commented out, because I think it is redundant.


 Ok, I tried this one.  My ssh keeps getting cut off by a router
 somewhere between me and the server due to inactivity timeouts, so  all
 I know is that both the select and explain analyze are taking  over an
 hour to run.  Here's the explain select for that one, since  that's the
 best I can get.

 explain select recordtext from eventactivity,k_r,k_b where
 eventactivity.incidentid = k_r.incidentid and  eventactivity.incidentid
 = k_b.incidentid and k_r.id = 94 and k_b.id  = 107;
   QUERY PLAN
 
 --
 Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
Merge Cond: ((outer.incidentid)::text = inner.?column2?)
-  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
  Merge Cond: ((outer.incidentid)::text = inner.?column2?)
  -  Index Scan using eventactivity1 on eventactivity
 (cost=0.00..4186753.16 rows=46029271 width=49)
  -  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
Sort Key: (k_b.incidentid)::text
-  Index Scan using k_b_idx on k_b   (cost=0.00..4668.31
 rows=1821 width=68)
  Index Cond: (id = 107)
-  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
  Sort Key: (k_r.incidentid)::text
  -  Index Scan using k_r_idx on k_r  (cost=0.00..4755.52
 rows=1879 width=68)
Index Cond: (id = 94)
 (13 rows)


If anything, the estimations have gotten worse. As now it thinks there
will be 1800 rows returned each, whereas you were thinking it would be
more around 100.

Since you didn't say, you did VACUUM ANALYZE recently, right?



...


 You can also try disabling merge joins, and see how that changes  things.


 Are there any negative sideaffects of doing this?

If the planner is estimating things correctly, you want to give it the
most flexibility of plans to pick from, because sometimes a merge join
is faster (postgres doesn't pick things because it wants to go slower).
The only reason for the disable flags is that sometimes the planner
doesn't estimate correctly. Usually disabling a method is not the final
solution, but a way to try out different methods, and see what happens
to the results.

Using: SET enable_mergejoin TO off;
You can disable it just for the current session (not for the entire
database). Which is the recommended way if you have a query that
postgres is messing up on. (Usually it is correct elsewhere).



 Well, postgres is estimating around 500 rows each, is that way off?  Try
 just doing:
 EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
 EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

Once again, do this and post the results. We might just need to tweak
your settings so that it estimates the number of rows correctly, and we
don't need to do anything else.


 And see if postgres estimates the number of rows properly.

 I assume you have recently VACUUM ANALYZEd, which means you might need
 to update the statistics target (ALTER TABLE k_b ALTER COLUMN
 incidientid SET STATISTICS 100) default is IIRC 10, ranges from  1-1000,
 higher is more accurate, but makes ANALYZE slower.



...

 EXPLAIN ANALYZE
 SELECT recordtext FROM eventactivity
  JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
 WHERE k_r.id = 94 AND k_b.id = 107)
 USING (incidentid);


 This one looks like the same plan as the others:

 explain select recordtext from eventactivity join ( select incidentid
 from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =  107
 ) a  using (incidentid );

Well, the planner is powerful enough to flatten nested selects. To make
it less intelligent you can do:
SET join_collapse_limit 1;
or
SET join_collapse_limit 0;
Which should tell postgres to not try and get tricky with your query.
Again, *usually* the planner knows better than you do. So again just do
it to see what you get.

The problem is that if you are only using EXPLAIN SELECT, you will
probably get something which *looks* worse. Because if it looked better,
the planner would have used it. That is why you really need the EXPLAIN
ANALYZE, so that you can see where the planner is incorrect in it's
estimates.


   QUERY PLAN

Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Tom Lane wrote:
 John A Meinel [EMAIL PROTECTED] writes:

What I don't understand is that the planner is actually estimating that
joining against the new table is going to *increase* the number of
returned rows.


 It evidently thinks that incidentid in the k_r table is pretty
 nonunique.  We really need to look at the statistics data to
 see what's going on.

   regards, tom lane


Okay, sure. What about doing this, then:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
  JOIN (SELECT DISTINCT incidentid FROM k_r JOIN k_b USING (incidentid)
 WHERE k_r.id = ?? AND k_b.id = ??)
 USING (incidentid)
;

Since I assume that eventactivity is the only table with recordtext,
and that you don't get any columns from k_r and k_b, meaning it would be
pointless to get duplicate incidentids.

I may be misunderstanding what the query is trying to do, but depending
on what is in k_r and k_b, is it possible to use a UNIQUE INDEX rather
than just an index on incidentid?

There is also the possibility of
EXPLAIN ANALYZE
SELECT recordtext FROM eventactivtity
  JOIN (SELECT incidentid FROM k_r WHERE k_r.id = ??
 UNION SELECT incidentid FROM k_b WHERE k_b.id = ??)
 USING (incidentid)
;

But both of these would mean that you don't actually want columns from
k_r or k_b, just a unique list of incident ids.

But first, I agree, we should make sure the pg_stats values are reasonable.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] lots of updates on small table

2005-07-14 Thread John A Meinel
Alison Winters wrote:
 Hi,


Our application requires a number of processes to select and update rows
from a very small (10 rows) Postgres table on a regular and frequent
basis.  These processes often run for weeks at a time, but over the
space of a few days we find that updates start getting painfully slow.
We are running a full vacuum/analyze and reindex on the table every day,

Full vacuum, eh?  I wonder if what you really need is very frequent
non-full vacuum.  Say, once in 15 minutes (exact rate depending on dead
tuple rate.)


 Is there a difference between vacuum and vacuum full?  Currently we have
 a cron job going every hour that does:

 VACUUM FULL VERBOSE ANALYZE plc_fldio
 REINDEX TABLE plc_fldio

VACUUM FULL exclusively locks the table (so that nothing else can
happen) and the compacts it as much as it can.
You almost definitely want to only VACUUM every 15min, maybe VACUUM FULL
1/day.

VACUUM FULL is more for when you haven't been VACUUMing often enough. Or
have major changes to your table.
Basically VACUUM marks rows as empty and available for reuse, VACUUM
FULL removes empty space (but requires a full lock, because it is moving
rows around).

If anything, I would estimate that VACUUM FULL would be hurting your
performance. But it may happen fast enough not to matter.


 The most recent output was this:

 INFO:  --Relation public.plc_fldio--
 INFO:  Pages 1221: Changed 3, reaped 256, Empty 0, New 0; Tup 108137: Vac 
 4176, Keep/VTL 108133/108133, UnUsed 19, MinLen 84, MaxLen 84; Re-using: 
 Free/Avail. Space 445176/371836; EndEmpty/Avail. Pages 0/256.
 CPU 0.04s/0.14u sec elapsed 0.18 sec.
 INFO:  Index plcpage_idx: Pages 315; Tuples 108137: Deleted 4176.
 CPU 0.03s/0.04u sec elapsed 0.14 sec.
 INFO:  Rel plc_fldio: Pages: 1221 -- 1221; Tuple(s) moved: 0.
 CPU 0.03s/0.04u sec elapsed 0.36 sec.
 INFO:  Analyzing public.plc_fldio
 VACUUM
 REINDEX

 We'll up it to every 15 minutes, but i don't know if that'll help
 because even with the current vacuuming the updates are still getting
 slower and slower over the course of several days.  What really puzzles
 me is why restarting the processes fixes it.  Does PostgreSQL keep some
 kind of backlog of transactions all for one database connection?  Isn't
 it normal to have processes that keep a single database connection open
 for days at a time?

I believe that certain locks are grabbed per session. Or at least there
is some command that you can run, which you don't want to run in a
maintained connection. (It might be VACUUM FULL, I don't remember which
one it is).

But the fact that your application works at all seems to be that it
isn't acquiring any locks.

I know VACUUM cannot clean up any rows that are visible in one of the
transactions, I don't know if this includes active connections or not.


 Regarding the question another poster asked: all the transactions are
 very short.  The table is essentially a database replacement for a
 shared memory segment - it contains a few rows of byte values that are
 constantly updated byte-at-a-time to communicate data between different
 industrial control processes.

 Thanks for the thoughts everyone,

 Alison


Is it possible to have some sort of timer that would recognize it has
been connected for too long, drop the database connection, and
reconnect? I don't know that it would solve anything, but it would be
something you could try.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:

 On Jul 14, 2005, at 7:15 PM, John A Meinel wrote:



 Is the distribution of your rows uneven? Meaning do you have more rows
 with a later id than an earlier one?


 There are definitely some id's that will have many times more than  the
 others.  If I group and count them, the top 10 are fairly  dominant in
 the table.

That usually skews the estimates. Since the estimate is more of an
average (unless the statistics are higher).




 Hmm.. How to do it permanantly? Well you could always issue set
 join_collapse set 1; select * from 
 But obviously that isn't what you prefer. :)

 I think there are things you can do to make merge join more expensive
 than a nested loop, but I'm not sure what they are.


 Maybe someone else has some ideas to encourage this behavior for  future
 work?  Setting it on a per-connection basis is doable, but  would add
 some burden to us in code.

My biggest question is why the planner things the Nested Loop would be
so expensive.
Have you tuned any of the parameters? It seems like something is out of
whack. (cpu_tuple_cost, random_page_cost, etc...)



 What I really don't understand is that the estimates dropped as well.
 The actual number of estimate rows drops to 3k instead of  1M.
 The real question is why does the planner think it will be so  expensive?


 select count(*) from k_b join k_r using (incidentid) where k_b.id=107
 and k_r.id=94;
 count
 ---
373



 Well, this says that they are indeed much more selective.
 Each one has  1k rows, but together you end up with only 400.


 Is this a bad thing?  Is this not selective enough to make it much
 faster?

Yes, being more selective is what makes it faster. But the planner
doesn't seem to notice it properly.


 Overall, I'm much happier now after seeing the new plan come about,  if
 I can find a way to make that join_collapse behavior permanent, I  can
 certainly live with these numbers.


I'm sure there are pieces to tune, but I've reached my limits of
parameters to tweak :)

 Thanks again for your continued efforts.

 -Dan


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] join and query planner

2005-07-11 Thread John A Meinel
Dario Pudlo wrote:
 (first at all, sorry for my english)
 Hi.
- Does left join restrict the order in which the planner must join
 tables? I've read about join, but i'm not sure about left join...
- If so: Can I avoid this behavior? I mean, make the planner resolve the
 query, using statistics (uniqueness, data distribution) rather than join
 order.

   My query looks like:
   SELECT ...
   FROM a, b,
   LEFT JOIN c ON (c.key = a.key)
   LEFT JOIN d on (d.key=a.key)
   WHERE (a.key = b.key)  AND (b.column = 100)

   b.column has a lot better selectivity, but planner insist on resolve
 first c.key = a.key.

   Of course, I could rewrite something like:
   SELECT ...
   FROM
(SELECT ...
 FROM a,b
 LEFT JOIN d on (d.key=a.key)
 WHERE (b.column = 100)
 )
 as aa
   LEFT JOIN c ON (c.key = aa.key)

   but this is query is constructed by an application with a multicolumn
 filter. It's dynamic.
   It means that a user could choose to look for c.column = 1000. And
 also, combinations of filters.

   So, I need the planner to choose the best plan...

Probably forcing the other join earlier could help:
SELECT ...
  FROM a JOIN b ON (a.key = b.key)
  LEFT JOIN c ON (c.key = a.key)
...

I think the problem is that postgresql can't break JOIN syntax very
easily. But you can make the JOIN earlier.

John
=:-

 I've already change statistics, I clustered tables with cluster, ran vacuum
 analyze, changed work_mem, shared_buffers...

 Greetings. TIA.


 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sorting on longer key is faster ?

2005-07-11 Thread John A Meinel
Chris Travers wrote:
 John A Meinel wrote:

 jobapply wrote:


 The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER
 BY x..

 How can that be possible?

 Btw: x and x||t are same ordered

 phoeniks= explain analyze SELECT * FROM test WHERE i20 ORDER BY x
 || t;
QUERY PLAN




 What types are x and t, I have the feeling x || t is actually a
 boolean, so it is only a True/False sort, while ORDER BY x has to do
 some sort of string comparison (which might actually be a locale
 depended comparison, and strcoll can be very slow on some locales)



 Am I reading this that wrong?  I would think that x || t would mean
 concatenate x  and t.

Sorry, I think you are right. I was getting my operators mixed up.

 This is interesting.  I never through of writing a multicolumn sort this
 way

I'm also surprised that the sort is faster with a merge operation. Are
you using UNICODE as the database format? I'm just wondering if it is
doing something funny like casting it to an easier to sort type.


 Best Wishes,
 Chris Travers
 Metatron Technology Consulting

PS Don't forget to Reply All so that your messages go back to the list.


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sorting on longer key is faster ?

2005-07-11 Thread John A Meinel
jobapply wrote:
 The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER BY x..

 How can that be possible?

 Btw: x and x||t are same ordered

 phoeniks= explain analyze SELECT * FROM test WHERE i20 ORDER BY x || t;
 QUERY PLAN

I also thought of another possibility. Are there a lot of similar
entries in X? Meaning that the same value is repeated over and over? It
is possible that the sort code has a weakness when sorting equal values.

For instance, if it was doing a Hash aggregation, you would have the
same hash repeated. (It isn't I'm just mentioning a case where it might
affect something).

If it is creating a tree representation, it might cause some sort of
pathological worst-case behavior, where all entries keep adding to the
same side of the tree, rather than being more balanced.

I don't know the internals of postgresql sorting, but just some ideas.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] CURSOR slowes down a WHERE clause 100 times?

2005-07-06 Thread John A Meinel
Niccolo Rigacci wrote:

Hi to all,

I have a performace problem with the following query:

  BEGIN;
  DECLARE mycursor BINARY CURSOR FOR
SELECT
  toponimo,
  wpt
  FROM wpt_comuni_view
WHERE (
  wpt 
  setSRID('BOX3D(4.83 36, 20.16 47.5)'::BOX3D, 4326)
);
  FETCH ALL IN mycursor;
  END;

I get the results in about 108 seconds (8060 rows).

If I issue the SELECT alone (without the CURSOR) I get the
same results in less than 1 second.

The wpt_comuni_view is a VIEW of a 3 tables JOIN, and the wpt
field is a PostGIS geometry column. The  is the PostGIS
overlaps operator.

If I CURSOR SELECT from a temp table instead of the JOIN VIEW the
query time 1 second.

If I omit the WHERE clause the CURSOR fetches results in 1
second.

Can the CURSOR on JOIN affects so heavly the WHERE clause? I
suspect that - with the CURSOR - a sequential scan is performed
on the entire data set for each fetched record...

Any idea?


What does it say if you do EXPLAIN ANALYZE SELECT... both with and
without the cursor?
It may not say much for the cursor, but I think you can explain analyze
the fetch statements.

It is my understanding that Cursors generally favor using an
slow-startup style plan, which usually means using an index, because it
expects that you won't actually want all of the data. A seqscan is not
always slower, especially if you need to go through most of the data.

Without an explain analyze it's hard to say what the planner is thinking
and doing.

This is the definition of the VIEW:

  CREATE VIEW wpt_comuni_view AS
SELECT istat_wpt.oid, istat_wpt.id, istat_wpt.toponimo,
  istat_comuni.residenti, istat_wpt.wpt
  FROM istat_comuni
  JOIN istat_comuni2wpt
USING (idprovincia, idcomune)
  JOIN istat_wpt
ON (idwpt = id);

Thank you for any hint.



You might also try comparing your CURSOR to a prepared statement. There
are a few rare cases where preparing is worse than issuing the query
directly, depending on your data layout.

John
=:-




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] planner picking more expensive plan

2005-07-01 Thread John A Meinel

Sam Mason wrote:


Hi,

I've just been referred here after a conversion on IRC and everybody
seemed to think I've stumbled upon some strangeness.

The planner (in PG version 8.0.2) is choosing what it thinks is a more
expensive plan.  I've got a table of animals (about 3M rows) and their
movements (about 16M rows), and I'm trying to execute this query:

 SELECT a.birthlocnid, m.locnid
 FROM animals a
   LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0)
 LIMIT 10;




Why are you using LIMIT without having an ORDER BY?
What are actually trying to get out of this query? Is it just trying to
determine where the 'home' locations are?
It just seems like this query isn't very useful. As it doesn't restrict
by animal id, and it just gets 10 randomly selected animals where
m.mtypeid=0.
And why a LEFT JOIN instead of a normal join?
Anyway, the general constraints you are applying seem kind of confusing.
What happens if you change the plan to:

 SELECT a.birthlocnid, m.locnid
 FROM animals a
   LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0)
 ORDER BY a.animalid LIMIT 10;


I would guess that this would help the planner realize it should try to
use an index, since it can realize that it wants only a few rows by
a.animalid in order.
Though I also recognize that you aren't returning a.animalid so you
don't really know which animals you are returning.

I get the feeling you are trying to ask something like do animals stay
at their birth location, or at least how are animals moving around. I
don't know what m.typeid = 0 means, but I'm guessing it is something
like where their home is.

Anyway, I would say you need to put a little bit more restriction in, so
the planner can figure out how to get only 10 rows.

John
=:-


If I have work_mem set to something small (1000) it uses this plan:

   QUERY PLAN

Limit  (cost=0.00..202.52 rows=10 width=8) (actual time=0.221..0.600 rows=10 
loops=1)
  -  Merge Left Join  (cost=0.00..6628.30 rows=3302780 width=8) (actual 
time=0.211..0.576 rows=10 loops=1)
Merge Cond: (outer.animalid = inner.animalid)
-  Index Scan using animals_pkey on animals a  (cost=0.00..10198983.91 
rows=3302780 width=8) (actual time=0.112..0.276 rows=10 loops=1)
-  Index Scan using movement_animal on movements m  
(cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.088..0.235 rows=10 
loops=1)
  Filter: (mtypeid = 0)
Total runtime: 0.413 ms

But if I increase work_mem to 1 it uses this plan:

   QUERY PLAN

Limit  (cost=565969.42..566141.09 rows=10 width=8) (actual 
time=27769.047..27769.246 rows=10 loops=1)
  -  Merge Right Join  (cost=565969.42..57264070.77 rows=3302780 width=8) 
(actual time=27769.043..27769.228 rows=10 loops=1)
Merge Cond: (outer.animalid = inner.animalid)
-  Index Scan using movement_animal on movements m  
(cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.022..0.154 rows=10 
loops=1)
  Filter: (mtypeid = 0)
-  Sort  (cost=565969.42..574226.37 rows=3302780 width=8) (actual 
time=27768.991..27769.001 rows=10 loops=1)
  Sort Key: a.animalid
  -  Seq Scan on animals a  (cost=0.00..77086.80 rows=3302780 
width=8) (actual time=0.039..5620.651 rows=3303418 loops=1)
Total runtime: 27851.097 ms


I've tried playing with the statistics as people suggested on IRC but to
no effect.  There was some discussion about why it would be doing this,
but nothing obvious came out of it.

SHOW ALL output is at the end of this mail but it should be pretty
standard apart from:

 shared_buffers = 1
 work_mem = 8192
 max_connections = 100
 effective_cache_size = 1

Hope that's enough information to be useful.

Thanks.

  Sam





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Planner constants for RAM resident databases

2005-07-01 Thread John A Meinel

Emil Briggs wrote:


I'm working with an application where the database is entirely resident in RAM
(the server is a quad opteron with 16GBytes of memory). It's a web
application and handles a high volume of queries. The planner seems to be
generating poor  plans for some of our queries which I can fix by raising
cpu_tuple_cost. I have seen some other comments in the archives saying that
this is a bad idea  but is that necessarily the case when the database is
entirely resident in RAM?

Emil





Generally, the key knob to twiddle when everything fits in RAM is
random_page_cost. If you truly have everything in RAM you could set it
almost to 1. 1 means that it costs exactly the same to go randomly
through the data then it does to go sequential. I would guess that even
in RAM it is faster to go sequential (since you still have to page and
deal with L1/L2/L3 cache, etc). But the default random_page_cost of 4 is
probably too high for you.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Planner constants for RAM resident databases

2005-07-01 Thread John A Meinel

Emil Briggs wrote:


I just mentioned random_page_cost, but you should also tune
effective_cache_size, since that is effectively most of your RAM. It
depends what else is going on in the system, but setting it as high as
say 12-14GB is probably reasonable if it is a dedicated machine. With
random_page_cost 1.5-2, and higher effective_cache_size, you should be
doing pretty well.
John
=:-




I tried playing around with these and they had no effect. It seems the only
thing that makes a difference is cpu_tuple_cost.




I'm surprised. I know cpu_tuple_cost can effect it as well, but usually
the recommended way to get indexed scans is the above two parameters.

When you do explain analyze of a query that you have difficulties
with, how are the planner's estimates. Are the estimated number of rows
about equal to the actual number of rows?
If the planner is mis-estimating, there is a whole different set of
tuning to do to help it estimate correctly.

John
=:-

PS Use reply-all so that your comments go to the list.


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Too slow querying a table of 15 million records

2005-06-28 Thread John A Meinel

Tobias Brox wrote:


[EMAIL PROTECTED] - Tue at 08:33:58PM +0200]



I use FreeBSD 4.11 with PostGreSQL 7.3.8.



(...)



database= explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time  now() - interval '24 hours' group by
date_trunc order by date_trunc ;




I haven't looked through all your email yet, but this phenomena have been up
at the list a couple of times.  Try replacing now() - interval '24 hours'
with a fixed time stamp, and see if it helps.

pg7 will plan the query without knowledge of what now() - interval '24
hours' will compute to.  This should be fixed in pg8.




The grandparent was a mailing list double send. Notice the date is 1
week ago. It has already been answered (though your answer is still
correct).

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin




Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
   WHERE (id+1) NOT IN (SELECT id FROM t)
   ORDER BY id LIMIT 1;

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how



you



can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.




it does, but it is still faster than generate_series(), which requires
both a seqscan and a materialization of the function.




I don't know how big you want to scale to.




big. :)

merlin




See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

John A Meinel wrote:


Merlin Moncure wrote:


I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin




Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
   WHERE (id+1) NOT IN (SELECT id FROM t)
   ORDER BY id LIMIT 1;

John
=:-


Well, I was able to improve it to using appropriate index scans.
Here is the query:

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

I created a test table which has 90k randomly inserted rows. And this is
what EXPLAIN ANALYZE says:

  QUERY PLAN


Limit  (cost=0.00..12.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 
loops=1)
  -  Index Scan using id_test_pkey on id_test t1  (cost=0.00..544423.27 
rows=45000 width=4) (actual time=0.000..0.000 rows=1 loops=1)
Filter: (NOT (subplan))
SubPlan
  -  Index Scan using id_test_pkey on id_test t2  (cost=0.00..6.01 
rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=15)
Index Cond: (id = ($0 + 1))
Total runtime: 0.000 ms
(7 rows)

The only thing I have is a primary key index on id_test(id);

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


John Meinel wrote:



See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:-




Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response and
had given up on it.  I think your solution (smallest X1 not in X) is a
good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

Merlin



Just be aware that as your table fills it's holes, this query gets
slower and slower.
I've been doing some testing. And it starts at 0.00 when the first entry
is something like 3, but when you start getting to 16k it starts taking
more like 200 ms.

So it kind of depends how your table fills (and empties I suppose).

The earlier query was slower overall (since it took 460ms to read in the
whole table).
I filled up the table such that 63713 is the first empty space, and it
takes 969ms to run.
So actually if your table is mostly full, the first form is better.

But if you are going to have 100k rows, with basically random
distribution of empties, then the NOT EXISTS works quite well.

Just be aware of the tradeoff. I'm pretty sure the WHERE NOT EXISTS will
always use a looping structure, and go through the index in order.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


On Tue, Jun 28, 2005 at 12:02:09 -0400,
 Merlin Moncure [EMAIL PROTECTED] wrote:



Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response



and



had given up on it.  I think your solution (smallest X1 not in X) is



a



good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;



You need to rework this to check to see if row '1' is missing. The
above returns the start of the first gap after the first row that
isn't missing.




Correct.

In fact, I left out a detail in my original request in that I had a
starting value (easily supplied with where clause)...so what I was
really looking for was a query which started at a supplied value and
looped forwards looking for an empty slot.  John's supplied query is a
drop in replacement for a plpgsql routine which does exactly this.

The main problem with the generate_series approach is that there is no
convenient way to determine a supplied upper bound.  Also, in some
corner cases of my problem domain the performance was not good.

Merlin



Actually, if you already have a lower bound, then you can change it to:

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE t1.id  id_min
AND NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

This would actually really help performance if you have a large table
and then empty entries start late.

On my system, where the first entry is 64k, doing where id  6
speeds it up back to 80ms instead of 1000ms.
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] max_connections / shared_buffers / effective_cache_size

2005-06-24 Thread John A Meinel

Puddle wrote:


Hello, I'm a Sun Solaris sys admin for a start-up
company.  I've got the UNIX background, but now I'm
having to learn PostgreSQL to support it on our
servers :)

Server Background:

Solaris 10 x86
PostgreSQL 8.0.3
Dell PowerEdge 2650 w/4gb ram.
This is running JBoss/Apache as well (I KNOW the bad
juju of running it all on one box, but it's all we
have currently for this project). I'm dedicating 1gb
for PostgreSQL alone.

So, far I LOVE it compared to MySQL it's solid.

The only things I'm kind of confused about (and I've
been searching for answers on lot of good perf docs,
but not too clear to me) are the following:

1.) shared_buffers I see lot of reference to making
this the size of available ram (for the DB).  However,
I also read to make it the size of pgdata directory.

I notice when I load postgres each daemon is using the
amount of shared memory (shared_buffers).  Our current
dataset (pgdata) is 85mb in size.  So, I'm curious
should this size reflect the pgdata or the 'actual'
memory given?

I currently have this at 128mb



You generally want shared_buffers to be no more than 10% of available
ram. Postgres expects the OS to do it's own caching. 128M/4G = 3% seems
reasonable to me. I would certainly never set it to 100% of ram.


2.) effective_cache_size - from what I read this is
the 'total' allowed memory for postgresql to use
correct? So, if I am willing to allow 1GB of memory
should I make this 1GB?



This is the effective amount of caching between the actual postgres
buffers, and the OS buffers. If you are dedicating this machine to
postgres, I would set it to something like 3.5G. If it is a mixed
machine, then you have to think about it.

This does not change how postgres uses RAM, it changes how postgres
estimates whether an Index scan will be cheaper than a Sequential scan,
based on the likelihood that the data you want will already be cached in
Ram.

If you dataset is only 85MB, and you don't think it will grow, you
really don't have to worry about this much. You have a very small database.


3.) max_connections, been trying to figure 'how' to
determine this #.  I've read this is buffer_size+500k
per a connection.

ie.  128mb(buffer) + 500kb = 128.5mb per connection?



Max connections is just how many concurrent connections you want to
allow. If you can get away with lower, do so.  Mostly this is to prevent
connections * work_mem to get bigger than your real working memory and
causing you to swap.


I was curious about 'sort_mem' I can't find reference
of it in the 8.0.3 documentation, has it been removed?



sort_mem changed to work_mem in 8.0, same thing with vacuum_mem -
maintenance_work_mem.


work_mem and max_stack_depth set to 4096
maintenance_work_mem set to 64mb



Depends how much space you want to give per connection. 4M is pretty
small for a machine with 4G of RAM, but if your DB is only 85M it might
be plenty.
work_mem is how much memory a sort/hash/etc will use before it spills to
disk. So look at your queries. If you tend to sort most of your 85M db
in a single query, you might want to make it a little bit more. But if
all of your queries are very selective, 4M could be plenty.

I would make maintenance_work_mem more like 512M. It is only used for
CREATE INDEX, VACUUM, etc. Things that are not generally done by more
than one process at a time. And it's nice for them to have plenty of
room to run fast.


Thanks for any help on this.  I'm sure bombardment of
newbies gets old :)

-William



Good luck,
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Speed with offset clause

2005-06-24 Thread John A Meinel

Yves Vindevogel wrote:


Hi again all,

My queries are now optimised. They all use the indexes like they should.
However, there's still a slight problem when I issue the offset clause.

We have a table that contains 600.000 records
We display them by 25 in the webpage.
So, when I want the last page, which is: 600k / 25 = page 24000 - 1 = 
23999, I issue the offset of 23999 * 25
This take a long time to run, about 5-10 seconds whereas offset below 
100 take less than a second.


Can I speed this up ?


Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*

Postgres has the optimization that it will plan a query, and once it 
reaches the limit, it can stop even though there is more data available.
The problem you are having is that it has to go through offset rows 
first, before it can apply the limit.
If you can, (as mentioned in the other post), try to refine your index 
so that you can reverse it for the second half of the data.


This is probably tricky, as you may not know how many rows you have (or 
the amount might be changing).


A potentially better thing, is if you have an index you are using, you 
could use a subselect so that the only portion that needs to have 60k 
rows is a single column.


Maybe an example:
Instead of saying:

SELECT * FROM table1, table2 WHERE table1.id = table2.id ORDER BY 
table1.date OFFSET x LIMIT 25;


You could do:

SELECT * FROM
   (SELECT id FROM table1 OFFSET x LIMIT 25) as subselect
   JOIN table1 ON subselect.id = table1.id
   , table2
   WHERE table1.id = table2.id;

That means that the culling process is done on only a few rows of one 
table, and the rest of the real merging work is done on only a few rows.


It really depends on you query, though, as what rows you are sorting on 
has a big influence on how well this will work.


John
=:-




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Limit clause not using index

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


Hi,

I have a very simple query on a big table. When I issue a limit 
and/or offset clause, the query is not using the index.

Can anyone explain me this ?


You didn't give enough information. What does you index look like that 
you are expecting it to use?

Generally, you want to have matching columns. So you would want
CREATE INDEX blah ON tblprintjobs(loginuser, desceventdate, desceventtime);

Next, you should post EXPLAIN ANALYZE instead of regular explain, so we 
can have an idea if the planner is actually making correct estimations.


John
=:-



rvponp=# explain select * from tblprintjobs order by loginuser, 
desceventdate, desceventtime offset 25 limit 25 ;

QUERY PLAN
--- 


Limit (cost=349860.62..349860.68 rows=25 width=206)
- Sort (cost=349860.56..351416.15 rows=622236 width=206)
Sort Key: loginuser, desceventdate, desceventtime
- Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206)
(4 rows)

rvponp=# explain select * from tblprintjobs order by loginuser, 
desceventdate, desceventtime ;

QUERY PLAN
- 


Sort (cost=349860.56..351416.15 rows=622236 width=206)
Sort Key: loginuser, desceventdate, desceventtime
- Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206)
(3 rows)

Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Do Views execute underlying query everytime ??

2005-06-21 Thread John A Meinel

Amit V Shah wrote:


After I sent out this email, I found this article from google

http://jonathangardner.net/PostgreSQL/materialized_views/matviews.html

Looks like we can control as to when the views refresh... I am still kind of
confused, and would appreciate help !!

The create/drop table does sound a solution that can work, but the thing is
I want to get manual intervention out, and besides, my work flow is very
complex so this might not be an option for me :-(

Thanks,
Amit



Just to make it clear, a view is not the same as a materialized view.
A view is just a set of rules to the planner so that it can simplify
interactions with the database. A materialized view is a query which has
been saved into a table.

To set it up properly, really depends on what your needs are.

  1. How much time can elapse between an update to the system, and an
 update to the materialized views?
  2. How many updates / (sec, min, hour, month) do you expect. Is
 insert performance critical, or secondary.

For instance, if you get a lot of updates, but you can have a 1 hour lag
between the time a new row is inserted and the view is updated, you can
just create a cron job that runs every hour to regenerate the
materialized view.

If you don't get many updates, but you need them to show up right away,
then you can add triggers to the affected tables, such that
inserting/updating to a specific table causes an update to the
materialized view.

There are quite a few potential tradeoffs. Rather than doing a
materialized view, you could just improve your filters. If you are doing
a query to show people the results, you generally have some sort of
upper bound on how much data you can display. Humans don't like reading
more than 100 or 1000 rows. So create your normal query, and just take
on a LIMIT 100 at the end. If you structure your query properly, and
have appropriate indexes, you should be able to make the LIMIT count,
and allow you to save a lot of overhead of generating rows that you
don't use.

I would probably start by posting the queries you are currently using,
along with an EXPLAIN ANALYZE, and a description of what you actually
need from the query. Then this list can be quite helpful in
restructuring your query to make it faster.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Limit clause not using index

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


rvponp=# explain analyze select * from tblPrintjobs order by
loginuser, desceventdate, desceventtime ;
QUERY PLAN


Sort (cost=345699.06..347256.49 rows=622972 width=203) (actual
time=259438.952..268885.586 rows=622972 loops=1)
Sort Key: loginuser, desceventdate, desceventtime
- Seq Scan on tblprintjobs (cost=0.00..25596.72 rows=622972
width=203) (actual time=21.155..8713.810 rows=622972 loops=1)
Total runtime: 271583.422 ms
(4 rows)



Can you post it with the limit? I realize the query takes a long time,
but that is the more important query to look at.

Also, just as a test, if you can, try dropping most of the indexes
except for the important one. It might be that the planner is having a
hard time because there are too many permutations to try.
I believe if you drop the indexes inside a transaction, they will still
be there for other queries, and if you rollback instead of commit, you
won't lose anything.

BEGIN;
DROP INDEX ...
EXPLAIN ANALYZE SELECT *...
ROLLBACK;

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow growing table

2005-06-21 Thread John A Meinel

Jone C wrote:


On second thought... Does a VACUUM FULL help? If so, you might want to
increase your FSM settings.




Thank you for the reply, sorry for delay I was on holiday.

I tried that it had no effect. I benchmarked 2x before, peformed
VACUUM FULL on the table in question post inserts, then benchmarked 2x
after. Same results...

Should I try your suggestion on deleting the indexes? This table needs
to be accessible for reads at all times however though...

thank you kindly




I believe dropping an index inside a transaction is only visible to that
transaction. (Can someone back me up on this?)
Which means if you did:

BEGIN;
DROP INDEX index in question;
CREATE INDEX same index ON same stuff;
COMMIT;

The only problem is that if you are using a unique or primary key index,
a foreign key which is referencing that index would have to be dropped
and re-created as well. So you could have a pretty major cascade effect.

A better thing to do if your table only has one (or at least only a few)
indexes, would be to CLUSTER, which is effectively a VACUUM FULL + a
REINDEX (plus sorting the rows so that they are in index order). It
holds a full lock on the table, and takes a while, but when you are
done, things are cleaned up quite a bit.

You might also try just a REINDEX on the indexes in question, but this
also holds a full lock on the table. (My DROP + CREATE might also as
well, I'm not really sure, I just think of it as a way to recreate
without losing it for other transactions)

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


Hi,

I have another question regarding indexes.

I have a table with a lot of indexes on it. Those are needed to 
perform my searches.

Once a day, a bunch of records is inserted in my table.

Say, my table has 1.000.000 records and I add 10.000 records (1% new)
What would be faster.

1) Dropping my indexes and recreating them after the inserts
2) Just inserting it and have PG manage the indexes

Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*



I'm guessing for 1% new that (2) would be faster.
John
=:-




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


And, after let's say a week, would that index still be optimal or
would it be a good idea to drop it in the weekend and recreate it.



It depends a little bit on the postgres version you are using. If you
are only ever adding to the table, and you are not updating it or
deleting from it, I think the index is always optimal.
Once you start deleting from it there are a few cases where older
versions would not properly re-use the empty entries, requiring a
REINDEX. (Deleting low numbers and always adding high numbers was one of
the cases)

However, I believe that as long as you vacuum often enough, so that the
system knows where the unused entries are, you don't ever have to drop
and re-create the index.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


I only add records, and most of the values are random
Except the columns for dates, 


I doubt that you would need to recreate indexes. That really only needs
to be done in pathological cases, most of which have been fixed in the
latest postgres.

If you are only inserting (never updating or deleting), the index can
never bloat, since you are only adding new stuff.
(You cannot get dead items to bloat your index if you never delete
anything.)

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?

2005-06-20 Thread John Arbash Meinel
Alex Stapleton wrote:


 On 20 Jun 2005, at 15:59, Jacques Caron wrote:

...

 ANALYZE is not a very expensive operation, however VACUUM can
 definitely be a big strain and take a long time on big tables,
 depending on your setup. I've found that partitioning tables (at  the
 application level) can be quite helpful if you manage to keep  each
 partition to a reasonable size (under or close to available  memory),
 especially if the partitioning scheme is somehow time- related. YMMV.

 Jacques.


 That's not currently an option as it would require a pretty large
 amount of work to implement. I think we will have to keep that in
 mind though.

Remember, you can fake it with a low-level set of tables, and then wrap
them into a UNION ALL view.
So you get something like:

CREATE VIEW orig_table AS
SELECT * FROM table_2005_04
UNION ALL SELECT * FROM table_2005_05
UNION ALL SELECT * FROM table_2005_06
...
;

Then at least your individual operations are fast. As you insert, you
can create a rule that on insert into orig_table do instead ... insert
into table_2005_07 (or whatever the current table is).
It takes a little bit of maintenance on the DB admin's part, since every
month they have to create a new table, and then update all of the views
and triggers. But it is pretty straightforward.
If you are doing append-only inserting, then you have the nice feature
that only the last table is ever modified, which means that the older
tables don't really need to be vacuumed or analyzed.
And even if you have to have each table modified as you go, you still
can break up a VACUUM into only doing one of the sub tables at a time.

I don't know you db schema, but I thought I would mention that true
partitioning isn't implemented yet, you can still get something very
similar with views, triggers and rules.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] How does the transaction buffer work?

2005-06-16 Thread John A Meinel

Veikko Mkinen wrote:


Hey,

How does Postgres (8.0.x) buffer changes to a database within a 
transaction? I need to insert/update more than a thousand rows (mayde 
even more than 1 rows, ~100 bytes/row) in a table but the changes 
must not be visible to other users/transactions before every row is 
updated. One way of doing this that I thought of was start a 
transaction, delete everything and then just dump new data in (copy 
perhaps). The old data would be usable to other transactions until I 
commit my insert. This would be the fastest way, but how much memory 
would this use? Will this cause performance issues on a heavily loaded 
server with too little memory even to begin with :)


Postgres does indeed keep track of who can see what. Such that changes 
won't be seen until a final commit.

If you are trying to insert bulk data, definitely consider COPY.

But UPDATE should also be invisible until the commit. So if you are only 
changing data, there really isn't any reason to do a DELETE and INSERT. 
Especially since you'll have problems with foreign keys at the DELETE stage.


John
=:-



-veikko






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Index ot being used

2005-06-13 Thread John A Meinel

Kevin Grittner wrote:


It sure would be nice if the optimizer would consider that it had the
leeway to add any column which was restricted to a single value to any
point in the ORDER BY clause.  Without that, the application programmer
has to know what indexes are on the table, rather than being able to
just worry about the set of data they want.  Obviously, if a column can
have only one value in the result set, adding to any point in the ORDER
BY can't change anything but performance.  That sure sounds like
something which should fall within the scope of an optimizer.

It really should be a DBA function to add or drop indexes to tune the
performance of queries, without requiring application programmers to
modify the queries for every DBA adjustment.  (When you have a database
with over 350 tables and thousands of queries, you really begin to
appreciate the importance of this.)



I agree that having a smarter optimizer, which can recognize when an
index can be used for ORDER BY would be useful.

I don't know if there are specific reasons why not, other than just not
being implemented yet. It might be tricky to get it correct (for
instance, how do you know which columns can be added, which ones will be
constant) Perhaps you could just potentially add the WHERE items if they
have an equality constraint with a constant. But I'm guessing there are
more cases than that where the optimization could be performed.

Also, the more options you give the planner, the longer it takes on
average to plan any single query. Yes, it is beneficial for this use
case, but does that balance out slowing down all the other queries by a
tiny bit.

I'm guessing the optimization wasn't as important as some of the others
that have been done, so it hasn't been implemented yet.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] PostgreSQL using the wrong Index

2005-06-13 Thread John A Meinel

Alex Stapleton wrote:


Oh, we are running 7.4.2 btw. And our random_page_cost = 1


Which is only correct if your entire db fits into memory. Also, try
updating to a later 7.4 version if at all possible.


On 13 Jun 2005, at 14:02, Alex Stapleton wrote:


We have two index's like so

l1_historical=# \d N_intra_time_idx
   Index N_intra_time_idx
Column |Type
+-
time   | timestamp without time zone
btree


Just so you are aware, writing this as: We have an index on
N_intra(time) and one on N_Intra(symbol, time) is a lot more succinct.



l1_historical=# \d N_intra_pkey
 Index N_intra_pkey
Column |Type
+-
symbol | text
time   | timestamp without time zone
unique btree (primary key)

and on queries like this

select * from N_intra where symbol='SOMETHING WHICH DOESNT EXIST'
order by time desc limit 1;

PostgreSQL takes a very long time to complete, as it effectively
scans the entire table, backwards. And the table is huge, about 450
million rows. (btw, there are no triggers or any other exciting
things like that on our tables in this db.)

but on things where the symbol does exist in the table, it's more  or
less fine, and nice and fast.



What happens if you do:
SELECT * FROM N_intra WHERE symbol='doesnt exist' ORDER BY symbol,
time DESC LIMIT 1;

Yes, symbol is constant, but it frequently helps the planner realize it
can use an index scan if you include all terms in the index in the ORDER
BY clause.



Whilst the option the planner has taken might be faster most of the
time, the worst case scenario is unacceptable for obvious reasons.
I've googled for trying to force the use of a specific index, but
can't find anything relevant. Does anyone have any suggestions on
getting it to use an index which hopefully will have better worst
case performance?



Try the above first. You could also create a new index on symbol
   CREATE INDEX N_intra_symbol_idx ON N_intra(symbol);

Then the WHERE clause should use the symbol index, which means it can
know quickly that an entry doesn't exist. I'm not sure how many entries
you have per symbol, though, so this might cause problems in the ORDER
BY time portion.

I'm guessing what you really want is to just do the ORDER BY symbol, time.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster search

2005-06-10 Thread John A Meinel
Clark Slater wrote:
 Hi-
 
 Would someone please enlighten me as
 to why I'm not seeing a faster execution
 time on the simple scenario below?
 
 there are 412,485 rows in the table and the
 query matches on 132,528 rows, taking
 almost a minute to execute.  vaccuum
 analyze was just run.

Well, if you are matching 130k out of 400k rows, then a sequential scan
is certainly prefered to an index scan. And then you have to sort those
130k rows by partnumber. This *might* be spilling to disk depending on
what your workmem/sortmem is set to.

I would also say that what you would really want is some way to get the
whole thing from an index. And I think the way to do that is:

CREATE INDEX test_partnum_listid_typeid_idx ON
test(partnumber, productlistid, typeid);

VACUUM ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test
WHERE productlistid=3 AND typeid=9
ORDER BY partnumber, productlistid, typeid
LIMIT 15
;

The trick is that you have to match the order by exactly with the index,
so the planner realizes it can do an indexed lookup to get the information.

You could also just create an index on partnumber, and see how that
affects your original query. I think the planner could use an index
lookup on partnumber to get the ordering correct. But it will have to do
filtering after the fact based on productlistid and typeid.
With my extended index, I think the planner can be smarter and lookup
all 3 by the index.

 
 Thanks!
 Clark

Good luck,
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Postgresql on an AMD64 machine

2005-06-07 Thread John A Meinel
Neil Conway wrote:
 Tom Arthurs wrote:
 
 Yes, shared buffers in postgres are not used for caching
 
 
 Shared buffers in Postgres _are_ used for caching, they just form a
 secondary cache on top of the kernel's IO cache. Postgres does IO
 through the filesystem, which is then cached by the kernel. Increasing
 shared_buffers means that less memory is available for the kernel to
 cache IO -- increasing shared_buffers has been shown to be a net
 performance loss beyond a certain point. Still, there is value in
 shared_buffers as it means we can avoid a read() system call for hot
 pages. We can also do better buffer replacement in the PG shared buffer
 than the kernel can do (e.g. treating IO caused by VACUUM specially).
 

As I recall, one of the performance problems with a large shared_buffers
is that there are some commands which require looking at *all* of the
shared buffer space. So the larger it gets, the longer those functions take.

 My biggest challenge with solaris/sparc is trying to reduce context
 switching.
 
 
 It would be interesting to see if this is improved with current sources,
 as Tom's bufmgr rewrite should have hopefully have reduced this problem.
 

These might be what was fixed with Tom's rewrite. I don't really know.

John
=:-

 -Neil
 
 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)

2005-06-06 Thread John A Meinel
Casey Allen Shobe wrote:
 On Wednesday 01 June 2005 20:19, Casey Allen Shobe wrote:
 
...
 Long-term, whenever we hit the I/O limit again, it looks like we really don't 
 have much of a solution except to throw more hardware (mainly lots of disks 
 in RAID0's) at the problem. :(  Fortunately, with the above two changes I/O 
 usage on the PG data disk is a quarter of what it was, so theoretically we 
 should be able to quadruple the number of users on current hardware.
 

Be very careful in this situation. If any disks in a RAID0 fails, the
entire raid is lost. You *really* want a RAID10. It takes more drives,
but then if anything dies you don't lose everything.

If you are running RAID0 and you *really* want performance, and aren't
concerned about safety (at all), you could also set fsync=false. That
should also speed things up. But you are really risking corruption/data
loss on your system.

 Our plan forward is to increase the number of disks in the two redundant mail 
 servers, so that each has a single ultra320 disk for O/S and pg_xlog, and a 
 3-disk RAID0 for the data.  This should triple our current capacity.

I don't know if you can do it, but it would be nice to see this be 1
RAID1 for OS, 1 RAID10 for pg_xlog, and another RAID10 for data. That is
the recommended performance layout. It takes quite a few drives (minimum
of 10). But it means your data is safe, and your performance should be
very good.

 
 The general opinion of the way dspam uses the database among people I've 
 talked to on #postgresql is not very good, but of course the dspam folk blame 
 PostgreSQL and say to use MySQL if you want reasonable performance.  Makes it 
 real fun to be a DSpam+PostgreSQL user when limits are reached, since 
 everyone denies responsibility.  Fortunately, PostgreSQL people are pretty 
 helpful even if they think the client software sucks. :)
 

I can't say how dspam uses the database. But they certainly could make
assumptions about how certain actions are done by the db, which are not
quite true with postgres. (For instance MySQL can use an index to return
information, because Postgres supports transactions, it cannot, because
even though a row is in the index, it may not be visible to the current
transaction.)

They also might be doing stuff like select max(row) instead of select
row ORDER BY row DESC LIMIT 1. In postgres the former will be a
sequential scan, the latter will be an index scan. Though I wonder about
select max(row) ORDER BY row DESC LIMIT 1. to me, that should still
return the right answer, but I'm not sure.

 Cheers,

Good luck,
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)

2005-06-06 Thread John A Meinel
Michael Stone wrote:
 On Mon, Jun 06, 2005 at 10:08:23AM -0500, John A Meinel wrote:
 
 I don't know if you can do it, but it would be nice to see this be 1
 RAID1 for OS, 1 RAID10 for pg_xlog, 
 
 
 That's probably overkill--it's a relatively small sequential-write
 partition with really small writes; I don't see how pg_xlog would
 benefit from raid10 as opposed to raid1.
 Mike Stone
 

pg_xlog benefits from being super fast. Because it has to be fully
synced before the rest of the data can be committed. Yes they are small,
but if you can make it fast, you eliminate that overhead. It also
benefits from having it's own spindle, because you eliminate the seek
time. (Since it is always appending)

Anyway, my point is that pg_xlog isn't necessarily tiny. Many people
seem to set it as high as 100-200, and each one is 16MB.

But one other thing to consider is to make pg_xlog on a battery backed
ramdisk. Because it really *can* use the extra speed. I can't say that a
ramdisk is more cost effective than faster db disks. But if you aren't
using many checkpoint_segments, it seems like you could get a 1GB
ramdisk, and probably have a pretty good performance boost. (I have not
tested this personally, though).

Since he is using the default settings (mostly) for dspam, he could
probably get away with something like a 256MB ramdisk.

The only prices I could find with a few minutes of googleing was:
http://www.cenatek.com/store/category.cfm?Category=15
Which is $1.6k for 2GB.

But there is also a product that is being developed, which claims $60
for the PCI card, you supply the memory. It has 4 DDR slots
http://www.engadget.com/entry/1234000227045399/
And you can get a 128MB SDRAM ECC module for around $22
http://www.newegg.com/Product/Product.asp?Item=N82E16820998004
So that would put the total cost of a 512MB battery backed ramdisk at
$60 + 4*22 = $150.

That certainly seems less than what you would pay for the same speed in
hard-drives.
Unfortunately the Giga-byte iRam seems to just be in the demo stage. But
if they aren't lying in the press releases, it would certainly be
something to keep an eye on.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Postgresql and Software RAID/LVM

2005-06-06 Thread John A Meinel
Marty Scholes wrote:
 Has anyone ran Postgres with software RAID or LVM on a production box?
 What have been your experience?
 
 Yes, we have run for a couple years Pg with software LVM (mirroring)
 against two hardware RAID5 arrays.  We host a production Sun box that
 runs 24/7.
 
 My experience:
 * Software RAID (other than mirroring) is a disaster waiting to happen.
  If the metadata for the RAID set gives out for any reason (CMOS
 scrambles, card dies, power spike, etc.) then you are hosed beyond
 belief.  In most cases it is almost impossible to recover.  With
 mirroring, however, you can always boot and operate on a single mirror,
 pretending that no LVM/RAID is underway.  In other words, each mirror is
 a fully functional copy of the data which will operate your server.

Isn't this actually more of a problem for the meta-data to give out in a
hardware situation? I mean, if the card you are using dies, you can't
just get another one.
With software raid, because the meta-data is on the drives, you can pull
it out of that machine, and put it into any machine that has a
controller which can read the drives, and a similar kernel, and you are
back up and running.
 
 * Hardware RAID5 is a terrific way to boost performance via write
 caching and spreading I/O across multiple spindles.  Each of our
 external arrays operates 14 drives (12 data, 1 parity and 1 hot spare).
  While RAID5 protects against single spindle failure, it will not hedge
 against multiple failures in a short time period, SCSI contoller
 failure, SCSI cable problems or even wholesale failure of the RAID
 controller.  All of these things happen in a 24/7 operation.  Using
 software RAID1 against the hardware RAID5 arrays hedges against any
 single failure.

No, it hedges against *more* than one failure. But you can also do a
RAID1 over a RAID5 in software. But if you are honestly willing to
create a full RAID1, just create a RAID1 over RAID0. The performance is
much better. And since you have a full RAID1, as long as both drives of
a pairing don't give out, you can lose half of your drives.

If you want the space, but you feel that RAID5 isn't redundant enough,
go to RAID6, which uses 2 parity locations, each with a different method
of storing parity, so not only is it more redundant, you have a better
chance of finding problems.

 
 * Software mirroring gives you tremendous ability to change the system
 while it is running, by taking offline the mirror you wish to change and
 then synchronizing it after the change.


That certainly is a nice ability. But remember that LVM also has the
idea of snapshoting a running system. I don't know the exact details,
just that there is a way to have some processes see the filesystem as it
existed at an exact point in time. Which is also a great way to handle
backups.

 On a fully operational production server, we have:
 * restriped the RAID5 array
 * replaced all RAID5 media with higher capacity drives
 * upgraded RAID5 controller
 * moved all data from an old RAID5 array to a newer one
 * replaced host SCSI controller
 * uncabled and physically moved storage to a different part of data center
 
 Again, all of this has taken place (over the years) while our machine
 was fully operational.
 
So you are saying that you were able to replace the RAID controller
without turning off the machine? I realize there does exist
hot-swappable PCI cards, but I think you are overstating what you mean
by fully operational. For instance, it's not like you can access your
data while it is being physically moved.

I do think you had some nice hardware. But I know you can do all of this
in software as well. It is usually a price/performance tradeoff. You
spend quite a bit to get a hardware RAID card that can keep up with a
modern CPU. I know we have an FC raid box at work which has a full 512MB
of cache on it, but it wasn't that much cheaper than buying a dedicated
server.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] postgresql-8.0.1 performance tuning

2005-05-31 Thread John A Meinel

Martin Fandel wrote:


Hi @ all,

i'm trying to tune my postgresql-db but i don't know if the values are
right
set.

I use the following environment for the postgres-db:

# Hardware 
cpu: 2x P4 3Ghz
ram: 1024MB DDR 266Mhz

partitions:
/dev/sda3  23G  9,6G   13G  44% /
/dev/sda1  11G  156M  9,9G   2% /var
/dev/sdb1  69G   13G   57G  19% /var/lib/pgsql

/dev/sda is in raid 1  (2x 35GB / 1upm / sca)
/dev/sdb is in raid 10 (4x 35GB / 1upm / sca)
# /Hardware 


You probably want to put the pg_xlog file onto /dev/sda rather than
having it in /dev/sdb. Having it separate from the data usually boosts
performance a lot. I believe you can just mv it to a different
directory, and then recreate it as a symlink. (Stop the database first :)



# Config 
/etc/sysctl.conf:
kernel.shmall = 786432000
kernel.shmmax = 786432000


Not really sure about these two.


/etc/fstab:
/dev/sdb1 /var/lib/pgsql reiserfs
acl,user_xattr,noatime,data=writeback 1 2


Seems decent.


/var/lib/pgsql/data/postgresql.conf
superuser_reserved_connections  = 2
shared_buffers  = 3000
work_mem= 131072
maintenance_work_mem= 131072


These both seem pretty large. But it depends on how many concurrent
connections doing sorting/hashing/etc you expect. If you are only
expecting 1 connection, these are probably fine. Otherwise with 1GB of
RAM I would probably make work_mem more like 4096/8192.
Remember, running out of work_mem means postgres will spill to disk,
slowing that query. Running out of RAM causes the system to swap, making
everything slow.


max_stack_depth = 2048
max_fsm_pages   = 2
max_fsm_relations   = 1000
max_files_per_process   = 1000
vacuum_cost_delay   = 10
vacuum_cost_page_hit= 1
vacuum_cost_page_miss   = 10
vacuum_cost_page_dirty  = 20
vacuum_cost_limit   = 200
bgwriter_delay  = 200
bgwriter_percent= 1
bgwriter_maxpages   = 100
fsync   = true
wal_sync_method = fsync
wal_buffers = 64
commit_delay= 0
commit_siblings = 5
checkpoint_segments = 256
checkpoint_timeout  = 900
checkpoint_warning  = 30
effective_cache_size= 1
random_page_cost= 4
cpu_tuple_cost  = 0.01
cpu_index_tuple_cost= 0.001
cpu_operator_cost   = 0.0025
geqo= true
geqo_threshold  = 12
geqo_effort = 5
geqo_pool_size  = 0
geqo_generations= 0
geqo_selection_bias = 2.0
deadlock_timeout= 1000
max_locks_per_transaction   = 64
# /Config 

# Transactions 
we have about 115-300 transactions/min in about 65 tables.
# /Transactions 

I'm really new at using postgres. So i need some experience to set this
parameters in the postgresql- and the system-config. I can't find standard
calculations for this. :/ The postgresql-documentation doesn't help me to
set the best values for this.

The database must be high-availble. I configured rsync to sync the
complete
/var/lib/pgsql-directory to my hot-standby. On the hotstandby i will
make the
dumps of the database to improve the performance of the master-db.


I didn't think an rsync was completely valid. Probably you should look
more into Slony.
http://slony.info

It is a single-master asynchronous replication system. I believe it is
pretty easy to setup, and does what you really want.


In my tests the synchronization works fine. I synchronised the hole
directory
and restarted the database of the hotstandby. While restarting,
postgresql turned
back the old (not archived) wals and the database of my hotstandby was
consistent. Is this solution recommended? Or must i use archived wal's
with
real system-snapshots?

best regards,

Martin Fandel


John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow queries, possibly disk io

2005-05-26 Thread John Arbash Meinel
Josh Close wrote:

I have some queries that have significan't slowed down in the last
couple days. It's gone from 10 seconds to over 2 mins.

The cpu has never gone over 35% in the servers lifetime, but the load
average is over 8.0 right now. I'm assuming this is probably due to
disk io.

I need some help setting up postgres so that it doesn't need to go to
disk. I think the shared_buffers and effective_cache_size values are
the one's I need to look at.

Would setting shmmax and smmall to 90% or so of available mem and
putting a lot for postgres be helpful?


Setting shared buffers above something like 10-30% of memory is counter
productive.

Effective cach size says this:
Sets the planner's assumption about the effective size of the disk
cache (that is, the portion of the kernel's disk cache that will be
used for PostgreSQL data files).

Does that mean the total available ram? Or what's left over from 
shared_buffers?

I've tried different things and not much has been working. Is there a
good way to ensure that most of the tables accessed in postgres will
be cached in mem and not have to go to disk?

If I'm joining a lot of tables, should the sort_mem be set high also?
Do shared_buffers, effective_cache_size, and sort_mem all use
different mem? Or are they seperate?



Increasing sort_mem can help with various activities, but increasing it
too much can cause you to swap, which kills performance. The caution is
that you will likely use at least 1 sort_mem per connection, and can
likely use more than one if the query is complicated.

effective_cache_size changes how Postgres plans queries, but given the
same query plan, it doesn't change performance at all.

I've looked for information and haven't found any useful pages about this.

Any help would be greatly appreciated.

Thanks.

-Josh



John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Inner join on two OR conditions dont use index

2005-05-25 Thread John A Meinel

Jocelyn Turcotte wrote:


Hi all
i dont know if this is normal, but if yes i would like to know why and
how I could do it another way other than using unions.




The only thing that *might* work is if you used an index on both keys.
So if you did:

CREATE INDEX rt_edge_start_end_node ON rt_edge(start_node_id,end_node_id);

The reason is that in an OR construct, you have to check both for being true. 
So in the general case where you don't know the correlation between the columns, you have 
to check all of the entries, because even if you know the status of one side of the OR, 
you don't know the other.

Another possibility would be to try this index:

CREATE INDEX rt_edge_stare_or_end ON rt_edge(start_node_id OR end_node_id);

I'm not sure how smart the planner can be, though.

John
=:-




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Select performance vs. mssql

2005-05-24 Thread John A Meinel

Michael Stone wrote:


On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote:


Pretty much.  There has been discussion about allowing index-only
access to frozen tables, i.e. archive partitions.  But it all sort
of hinges on someone implementing it and testing 



Is there any way to expose the planner estimate? For some purposes it's
enough to just give a rough ballpark (e.g., a google-esque results 1-10
of approximately 1000) so a user knows whether its worth even
starting to page through.

Mike Stone

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Well, you could always do:

EXPLAIN SELECT ...

And then parse out the rows= in the first line.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Can anyone explain this: duplicate dbs.

2005-05-24 Thread John A Meinel

SpaceBallOne wrote:


Wondering if someone could explain a pecularity for me:

We have a database which takes 1000ms to perform a certain query on.

If I pg_dump that database then create a new database (e.g. tempdb)
and upload the dump file (thus making a duplicate) then the same query
only takes 190ms !!
Vacuum, vacuum analyse, and vacuum full analyse does not seem to have
an impact on these times.

Can anyone explain why this may be occurring and how I might be able
to keep the original database running at the same speed as tempdb?

Thanks in advance,

Dave.


What version of postgres?

There are a few possibilities. If you are having a lot of updates to the
table, you can get index bloat. And vacuum doesn't fix indexes. You have
to REINDEX to do that. Though REINDEX has the same lock that VACUUM
FULL has, so you need to be a little careful with it.

Probably better is to do CLUSTER, as it does a REINDEX and a sort, so
your table ends up nicer when you are done.

Also, older versions of postgres had a worse time with index bloat. One
thing that caused a lot of problem is a table that you insert into over
time, so that all the values are incrementing. If you are deleting older
entries, that area won't be re-used because they fall at the back end. I
believe newer versions have been fixed.

By the way, I think doing:

CREATE DATABASE tempdb WITH TEMPLATE = originaldb;

Is a much faster way of doing dump and load. I *think* it would recreate
indexes, etc. If it just does a copy it may not show the dump/restore
improvement.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Can anyone explain this: duplicate dbs.

2005-05-24 Thread John A Meinel

SpaceBallOne wrote:


What version of postgres?



8.0.2 ... but I think I've seen this before on 7.3 ...


There are a few possibilities. If you are having a lot of updates to the
table, you can get index bloat. And vacuum doesn't fix indexes. You have
to REINDEX to do that. Though REINDEX has the same lock that VACUUM
FULL has, so you need to be a little careful with it.




Probably better is to do CLUSTER, as it does a REINDEX and a sort, so
your table ends up nicer when you are done.



Thanks, will try those next time this problem crops up (i just deleted
/ recreated the database to speed things for its users in the office
... probably should have held off to see if I could find a solution
first!).

Yes, the database / table-in-question does have a lot of updates,
deletes, and new rows (relatively speaking for a small business).

Would CLUSTER / REINDEX still have an effect if our queries were done
via sequential scan? This is a old database (as in built by me when i
was just starting to learn unix / postgres) so the database design is
pretty horrible (little normalisation, no indexes).


Well, my first recommendation is to put in some indexes. :) They are
relatively easy to setup and can drastically improve select performance.

What version of postgres are you using?
What does it say at the end of VACUUM FULL ANALYZE VERBOSE, that
should tell you how many free pages were reclaimed and how big your free
space map should be.

If you only did 1 VACUUM FULL, you might try another, as it sounds like
your tables aren't properly filled. I'm pretty sure vacuum only removes
empty pages/marks locations for the free space map so they can be
re-used, while vacuum full will move entries around to create free pages.

It sounds like it didn't do it properly.

But even so, CLUSTER is still your friend, as it allows you to presort
the rows in your tables.



Have taken Chris's advice onboard too and setup cron to do a vacuumdb
hourly instead of my weekly vacuum.

Cheers,

Dave.



John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] seqential vs random io

2005-05-23 Thread John A Meinel
David Parker wrote:
 I just got a question from one our QA guys who is configuring a RAID 10
 disk that is destined to hold a postgresql database. The disk
 configuration procedure is asking him if he wants to optimize for
 sequential or random access. My first thought is that random is what we
 would want, but then I started wondering if it's not that simple, and my
 knowledge of stuff at the hardware level is, well, limited.
  
 If it were your QA guy, what would you tell him?
 
 - DAP

Random. Sequential is always pretty fast, it is random that hurts.

The only time I would say sequential is if you were planning on
streaming large files (like iso images) with low load.

But for a DB, even a sequential scan will probably not be that much data.

At least, that's my 2c.

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Optimize complex join to use where condition before

2005-05-13 Thread John A Meinel
Greg Stark wrote:
Sebastian Hennebrueder [EMAIL PROTECTED] writes:

User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
...
Nested Loop  (cost=1349.13..1435.29 rows=1 width=2541) (actual
time=1640.000..3687.000 rows=62 loops=1)
  Join Filter: (inner.fid = outer.faufgaben_id)
  -  Index Scan using idx_tauf_mit_mitid on taufgaben_mitarbeiter am
(cost=0.00..80.65 rows=35 width=4) (actual time=0.000..0.000 rows=765
loops=1)

Is it really Mozilla Thunderbird that's causing this new craptastic mangling
of plans in people's mails? I was assuming it was some new idea of how to mess
up people's mail coming out of Exchange or Lotus or some other such corporate
messaging software that only handled SMTP mail as an afterthought. This is,
uh, disappointing.
Are you talking about the quotes, or just the fact that it is wrapped?
I don't know where the quotes came from, but in Thunderbird if you are
writing in text mode (not html) it defaults to wrapping the text at
something like 78 characters. That includes copy/paste text.
If you want it to *not* wrap, it turns out that Paste as quotation
will not wrap, but then you have to remove the   from the beginning
of every line.
In html mode, it also defaults to wrapping, but if you switch to
PREFORMAT text first, it doesn't wrap.
At least, those are the tricks that I've found. Safest bet is to just
use an attachment, though.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Optimize complex join to use where condition before

2005-05-13 Thread John Arbash Meinel
Sebastian Hennebrueder wrote:

 I found a solution to improve my query. I do not know why but the
 statistics for all column has been 0.
 I changed this to 10 for index columns and to 20 for all foreign key
 columns.
 and to 100 for foreign key columns.
 I set the random page cost to 2
 and now the query runs as expected.

 Many thanks to all of the posts in my and in other threads which
 helped a lot.

 Sebastian


I think 0 = use default. But still, changing to 20 and 100 probably
fixes your problems.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread John A Meinel
Marc Mamin wrote:
Hello,
I'm not an expert, but I'll give some suggestions.
I'd like to tune Postgres for large data import (using Copy from).
I believe that COPY FROM file is supposed to be faster than COPY FROM
STDIN, but file must be available to the backend process. If you can
do it, you should think about it, as it eliminates the communication
between the client and the backend.
here are a few steps already done:

1) use 3 different disks for:
-1: source data
-2: index tablespaces
-3: data tablespaces

Make sure pg_xlog is on it's own filesystem. It contains the
write-ahead-log, and putting it by itself keeps the number of seeks
down. If you are constrained, I think pg_xlog is more important than
moving the index tablespaces.

2) define all foreign keys as initially deferred
3) tune some parameters:

max_connections =20
shared_buffers =3
work_mem = 8192
maintenance_work_mem = 32768
checkpoint_segments = 12
(I also modified the kernel accordingly)
Don't forget to increase your free space map if you are going to be
doing deletes frequently.

4) runs VACUUM regulary
The server runs RedHat and has 1GB RAM
In the production (which may run on a better server), I plan to:
- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data

I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?
I doubt it.
How does Postgres handle concurrent copy from on: same table / different
tables ?
I think it is better with different tables. If using the same table, and
there are indexes, it has to grab a lock for updating the index, which
causes contention between 2 processes writing to the same table.
I'd be glad on any further suggestion on how to further increase my
performances.
Since you are deleting data often, and copying often, I might recommend
using a partition scheme with a view to bind everything together. That
way you can just drop the old table rather than doing a delete. I don't
know how this would affect foreign key references.
But basically you can create a new table, and do a copy without having
any indexes, then build the indexes, analyze, update the view.
And when deleting you can update the view, and drop the old table.
Something like this:
CREATE TABLE table_2005_05_11 AS (blah);
COPY FROM ... ;
CREATE INDEX blah ON table_2005_05_11(blah);
CREATE OR REPLACE VIEW table AS
SELECT * FROM table_2005_05_10
UNION ALL SELECT * FROM table_2005_05_11;
VACUUM ANALYZE table_2005_05_11;
...
John
=:-

Marc





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread John A Meinel
Alex Turner wrote:
Ok - my common sense alarm is going off here...
There are only 6.446 billion people worldwide.  100 Billion page views
would require every person in the world to view 18 pages of yahoo
every day.  Not very likely.
http://www.internetworldstats.com/stats.htm
suggests that there are around 1 billion people actualy on the internet.
That means each and every person on the internet has to view 100 pages
per day of yahoo.
pretty unlikely IMHO.  I for one don't even use Yahoo ;)
100 million page views per day suggests that 1 in 100 people on the
internet each viewed 10 pages of a site.  Thats a pretty high
percentage if you ask me.
In general I think your point is valid. Just remember that it probably
also matters how you count page views. Because technically images are a
separate page (and this thread did discuss serving up images). So if
there are 20 graphics on a specific page, that is 20 server hits just
for that one page.
I could easily see an image heavy site getting 100 hits / page. Which
starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G.
I still think 100G views on a single website is a lot, but 100M is
certainly possible.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Recommendations for set statistics

2005-05-12 Thread John A Meinel
Sebastian Hennebrueder wrote:
Hello,
I could not find any recommandations for the level of set statistics and
what a specific level does actually mean.
What is the difference between 1, 50 and 100? What is recommanded for a
table or column?
Default I believe is 10. The higher the number, the more statistics are
kept, with a maximum of 1000.
The default is a little bit low for columns used in foreign keys, though
frequently it is okay.
When problems start, try setting them to 100 or 200. Higher is more
accurate, but takes longer to compute, *and* takes longer when planning
the optimal query method. It can be worth it, though.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join

2005-05-11 Thread John A Meinel
Edin Kadribasic wrote:
Hi,
I have a query that is giving the optimizer (and me) great headache. When
its in the good mood the optimizer chooses Hash Left Join and the query
executes in 13ms or so, but sometimes (more and more often) it chooses
Nested Loop Left Join and the execution time goes up to 2-30sec.
The query:
SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN (
SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON
ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status = 100
AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON
(a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072;
Unfortunately, because Hash Join doesn't report the number of rows
(rows=0 always), it's hard to tell how good the estimator is. But I
*can* say that the NestLoop estimation is way off.
Good plan:
=
Aggregate  (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266
rows=1 loops=1)
   -  Hash Left Join  (cost=9.07..220.86 rows=426 width=4) (actual
time=34.237..34.237 rows=0 loops=1)
 Hash Cond: (outer.tid = inner.tid)
 Filter: (inner.tid IS NULL)
 -  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
time=0.277..0.512 rows=6 loops=1)
   Index Cond: (sid = 16072)
 -  Hash  (cost=9.07..9.07 rows=1 width=4) (actual
time=32.777..32.777 rows=0 loops=1)
   -  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.208..31.563 rows=284 loops=1)
 -  Nested Loop  (cost=0.00..6.05 rows=1 width=4)
(actual time=0.178..20.684 rows=552 loops=1)
   -  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.036..3.973 rows=610 loops=1)
 Index Cond: ((day = '2005-05-12'::date)
AND (day = '2005-05-12'::date))
 Filter: (used = B'1'::bit)
   -  Index Scan using axp_order_orid_key on
axp_order o  (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013
rows=1 loops=610)
 Index Cond: (outer.orid = o.orid)
 Filter: (status = 100)
 -  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=552)
   Index Cond: (ol.olid = outer.olid)
 Total runtime: 34.581 ms
Bad plan (same query different values):
===
 Aggregate  (cost=11.54..11.54 rows=1 width=4) (actual
time=11969.281..11969.285 rows=1 loops=1)
   -  Nested Loop Left Join  (cost=0.00..11.53 rows=1 width=4) (actual
time=25.730..11967.180 rows=338 loops=1)
See here, it thinks it will only have to do 1 nestloop, which would be
quite fast, but it hast to do 338.
 Join Filter: (outer.tid = inner.tid)
 Filter: (inner.tid IS NULL)
 -  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.027..2.980 rows=471 loops=1)
   Index Cond: (sid = 16092)
 -  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.088..24.350 rows=285 loops=471)
Same thing here.
   -  Nested Loop  (cost=0.00..6.04 rows=1 width=8) (actual
time=0.067..15.649 rows=317 loops=471)
And here.
 -  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.015..3.557 rows=606 loops=471)
This estimate is way off too, but it is off in both plans.
   Index Cond: ((day = '2005-05-13'::date) AND
(day = '2005-05-13'::date))
   Filter: (used = B'1'::bit)
 -  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=285426)
This is probably what is killing you. It is doing a single lookup 285k
times. The above plan only does it 552 times.
   Index Cond: (ol.olid = outer.olid)
   -  Index Scan using axp_order_orid_key on axp_order o
(cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1
loops=149307)
 Index Cond: (outer.orid = o.orid)
 Filter: (status = 100)
 Total runtime: 11969.443 ms
Please note that sometimes when I get bad plan in the logfile, I just
re-run the query and the optimizer chooses the more efficient one. Sometime
it does not.
You work_mem is quite high relative to your total Ram, hopefully you
don't have many allowed concurrent connections. But that is a side point.
I assume the tables are freshly VACUUM ANALYZEd. Have you tried altering
the statistics for the columns, one of them to look at is
axp_dayschedule(day). That one seems to be consistently incorrect.
Perhaps because a between with the same 

Re: [PERFORM] full outer performance problem

2005-05-10 Thread John A Meinel
Kim Bisgaard wrote:
Hi,
I'm having problems with the query optimizer and FULL OUTER JOIN on
PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
I might be naive, but I think that it should be possible?
I have two BIG tables (virtually identical) with 3 NOT NULL columns
Station_id, TimeObs, Temp_, with indexes on (Station_id, TimeObs)
and valid ANALYSE (set statistics=100). I want to join the two tables
with a FULL OUTER JOIN.
When I specify the query as:
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get the correct results, BUT LOUSY performance, and the following explain:
 Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual 
time=66146.815..119005.381 rows=1 loops=1)
   Filter: (COALESCE(outer.timeobs, inner.timeobs) = '2004-01-01 
00:00:00'::timestamp without time zone)
   -  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual 
time=8644.449..110836.038 rows=109826 loops=1)
Well, the estimate here is quite a bit off. It thinks you will be
getting 1349 (which is probably why it picked a nested loop plan), but
then it is getting 109826 rows.
I'm guessing it is misunderstanding the selectivity of the timeobs column.
 Hash Cond: (outer.station_id = inner.station_id)
 -  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 
width=16) (actual time=0.024..104548.515 rows=6956994 loops=1)
 -  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 
rows=0 loops=1)
   -  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 
width=4) (actual time=0.105..0.108 rows=1 loops=1)
 Index Cond: ((wmo_id = 6065) AND ('2004-01-01 
00:00:00'::timestamp without time zone = startdate) AND ('2004-01-01 
00:00:00'::timestamp without time zone = enddate))
   -  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826)
 Index Cond: ((outer.station_id = b.station_id) AND (outer.timeobs 
= b.timeobs))
 Total runtime: 119005.499 ms
(11 rows)
I think the bigger problem is that a full outer join says grab all rows,
even if they are null.
What about this query:
SELECT temp_max_60min,temp_dry_at_2m
  FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
   LEFT JOIN temp_max_60min b USING (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
After that, you should probably have a multi-column index on
(station_id, timeobs), which lets postgres use just that index for the
lookup, rather than using an index and then a filter. (Looking at your
next query you might already have that index).
If I change the query to (and thus negates the full outer join):
This is the same query, I think you messed up your copy and paste.
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065
and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get wrong results (In the case where one of the records is missing in
one of the tables), BUT GOOD performance, and this query plan:
 Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 
rows=1 loops=1)
   -  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual 
time=65.517..65.526 rows=1 loops=1)
 -  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 
width=4) (actual time=0.022..0.026 rows=1 loops=1)
   Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp 
without time zone = startdate) AND ('2004-01-01 00:00:00'::timestamp without time 
zone = enddate))
 -  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1)
   Index Cond: ((outer.station_id = b.station_id) AND (b.timeobs 
= '2004-01-01 00:00:00'::timestamp without time zone))
   -  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  
(cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1)
 Index Cond: ((outer.station_id = a.station_id) AND (a.timeobs = 
'2004-01-01 00:00:00'::timestamp without time zone))
 Total runtime: 79.340 ms
(9 rows)
If further info like EXPLAIN VERBOSE is useful please say so and I will
provide it.
Thanks in advance!
Kim Bisgaard.
I still feel like you will have a problem with an outer join in this
circumstance, because it will have to scan all of both tables.
I think what you are wanting is give me everything where station_id =
X, and there is a row in either a or b.
I think my LEFT JOIN example does that, but I also think there would be
a 

  1   2   >