Re: [PERFORM] Tuning PostgreSQL

2003-07-22 Thread Vincent van Leeuwen
On 2003-07-22 09:04:42 +0200, Alexander Priem wrote:
 Hi all,
 
 Vincent, You said that using RAID1, you don't have real redundancy. But
 RAID1 is mirroring, right? So if one of the two disks should fail, there
 should be no data lost, right?
 

Right. But the proposal was a single disk for WAL, without redundancy, and I
argued that wasn't really safe. RAID1 by itself is extremely safe, possibly
even the safest RAID type there is.

 I have been thinking some more. 18Gb drives are cheaper than 36 or 72Gb
 drives. I don't know if I can get the money for this, but how would the
 following setup sound?
 
 Two 18Gb (15.000rpm) disks in RAID1 array for Operating System + WAL.
 Four 18Gb (15.000rpm) disks in RAID5 array for data.
 

Our own testing has shown that a 6 disk RAID-10 array is faster than what you
describe. Of course, this is very much dependant on how much INSERT/UPDATES
you generate (which taxes your WAL more), so your mileage may vary.

 For the same amount of money, I could also get:
 
 Two 36Gb (10.000rpm) disks in RAID1 array for Operating System + WAL.
 Five/Six 36Gb (10.000rpm) disks in RAID5 array for data.
 

It is said that a higher RPM is particularly useful for a WAL disk. So you
might consider using two 18GB 15K rpm drives for a RAID-1 WAL disk (+OS and
swap), and using 36GB 10K rpm disks in a RAID-5 array if you need that
diskspace.

 Which would be the best of the above? The one with four 15k-rpm disks or the
 one with five/six 10k-rpm disks?
 Would these configs be better than all disks in one huge RAID5 array? There
 are so many possible configs with RAID...
 

15K rpm disks are significantly faster than 10K rpm disks. If your only
concern is performance, buy 15K rpm disks. If you want more diskspace for your
money, fall back to larger 10K rpm disks.

I personally think seperate WAL disks are vastly overrated, since they haven't
shown a big performance gain in our own tests. But as I have said, this is
extremely dependant on the type of load you generate, so only your own tests
can tell you what you should do in this respect.

About RAID types: the fastest RAID type by far is RAID-10. However, this will
cost you a lot of useable diskspace, so it isn't for everyone. You need at
least 4 disks for a RAID-10 array. RAID-5 is a nice compromise if you want as
much useable diskspace as possible and still want to be redundant. RAID-1 is
very useful for small (2-disk) arrays.

If you have the time and are settled on buying 6 disks, I'd test the following
scenarios:
- 6-disk RAID-10 array (should perform best)
- 4-disk RAID-10 array containing data, 2-disk RAID-1 array for WAL, OS, etc
- 4-disk RAID-5 array containing data, 2-disk RAID-1 array for WAL, OS, etc
- 6-disk RAID-5 array (will probably perform worst)


Hope this helps.

Vincent van Leeuwen
Media Design - http://www.mediadesign.nl/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Tuning PostgreSQL

2003-07-22 Thread Ron Johnson
On Mon, 2003-07-21 at 04:33, Shridhar Daithankar wrote:
 Hi Alexander ,
 
 On 21 Jul 2003 at 11:23, Alexander Priem wrote:
[snip]
  I use ext3 filesystem, which probably is not the best performer, is it?
 
 No. You also need to check ext2, reiser and XFS. There is no agreement between 
 users as in what works best. You need to benchmark and decide.

According to Jeremy Allison of SAMBA, They used ext3, which is one
of the slowest filesystems on Linux, Allison said. In a real
comparative test, you would use XFS.
http://www.linuxworld.com/story/32673.htm

-- 
+-+
| Ron Johnson, Jr.Home: [EMAIL PROTECTED] |
| Jefferson, LA  USA  |
| |
| I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!|
|unknown  |
+-+



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Tuning PostgreSQL

2003-07-22 Thread Alexander Priem
Wow, I never figured how many different RAID configurations one could think
of   :)

After reading lots of material, forums and of course, this mailing-list, I
think I am going for a RAID5 configuration of 6 disks (18Gb, 15.000 rpm
each), one of those six disks will be a 'hot spare'. I will just put the OS,
the WAL and the data one one volume. RAID10 is way to expensive   :)

If I understand correctly, this will give great read-performance, but less
write-performance. But since this server will be equipped with an embedded
RAID controller featuring 128Mb of battery-backed cache, I figure that this
controller will negate that (at least somewhat). I will need to find out
whether this cache can be configured so that it will ONLY cache WRITES, not
READS

Also because of this battery backed cache controller, I will go for the ext2
file system, mounted with 'noatime'. I will use a UPS, so I don't think I
need the journaling of ext3. XFS is not natively supported by RedHat and I
will go for the easy way here   :)

1 Gb of RAM should be enough, I think. That is about the only point that
almost everyone agrees on   :)   Do you think ECC is very important? The
server I have in mind does not support it. Another one does, but is is about
1.000 euros more expensive   :(

One CPU should also be enough.

As for postgresql.conf settings, I think I will start with the following :

max_connections = 128
superuser_reserved_connections = 1
shared_buffers = 8192
max_fsm_relations = 1000
max_fsm_pages = 10
wal_buffers = 32
sort_mem = 2048
vacuum_mem = 32768
effective_cache_size = 28672 (this one I'm not sure about, maybe this one
needs to be higher)
random_page_cost = 2
geq0_threshold = 20

This pretty much sums it up. What do you think about this config? It may not
be the fastest, but a server like this will cost about 4750 euros, and that
is including an Intel Xeon 2.4GHz cpu, redundant power supply, WITHOUT the
UPS. Seems very reasonable to me...

Kind regards,
Alexander Priem.



- Original Message -
From: Vincent van Leeuwen [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, July 22, 2003 11:40 AM
Subject: Re: [PERFORM] Tuning PostgreSQL


 On 2003-07-22 09:04:42 +0200, Alexander Priem wrote:
  Hi all,
 
  Vincent, You said that using RAID1, you don't have real redundancy. But
  RAID1 is mirroring, right? So if one of the two disks should fail, there
  should be no data lost, right?
 

 Right. But the proposal was a single disk for WAL, without redundancy, and
I
 argued that wasn't really safe. RAID1 by itself is extremely safe,
possibly
 even the safest RAID type there is.

  I have been thinking some more. 18Gb drives are cheaper than 36 or 72Gb
  drives. I don't know if I can get the money for this, but how would the
  following setup sound?
 
  Two 18Gb (15.000rpm) disks in RAID1 array for Operating System + WAL.
  Four 18Gb (15.000rpm) disks in RAID5 array for data.
 

 Our own testing has shown that a 6 disk RAID-10 array is faster than what
you
 describe. Of course, this is very much dependant on how much
INSERT/UPDATES
 you generate (which taxes your WAL more), so your mileage may vary.

  For the same amount of money, I could also get:
 
  Two 36Gb (10.000rpm) disks in RAID1 array for Operating System + WAL.
  Five/Six 36Gb (10.000rpm) disks in RAID5 array for data.
 

 It is said that a higher RPM is particularly useful for a WAL disk. So you
 might consider using two 18GB 15K rpm drives for a RAID-1 WAL disk (+OS
and
 swap), and using 36GB 10K rpm disks in a RAID-5 array if you need that
 diskspace.

  Which would be the best of the above? The one with four 15k-rpm disks or
the
  one with five/six 10k-rpm disks?
  Would these configs be better than all disks in one huge RAID5 array?
There
  are so many possible configs with RAID...
 

 15K rpm disks are significantly faster than 10K rpm disks. If your only
 concern is performance, buy 15K rpm disks. If you want more diskspace for
your
 money, fall back to larger 10K rpm disks.

 I personally think seperate WAL disks are vastly overrated, since they
haven't
 shown a big performance gain in our own tests. But as I have said, this is
 extremely dependant on the type of load you generate, so only your own
tests
 can tell you what you should do in this respect.

 About RAID types: the fastest RAID type by far is RAID-10. However, this
will
 cost you a lot of useable diskspace, so it isn't for everyone. You need at
 least 4 disks for a RAID-10 array. RAID-5 is a nice compromise if you want
as
 much useable diskspace as possible and still want to be redundant. RAID-1
is
 very useful for small (2-disk) arrays.

 If you have the time and are settled on buying 6 disks, I'd test the
following
 scenarios:
 - 6-disk RAID-10 array (should perform best)
 - 4-disk RAID-10 array containing data, 2-disk RAID-1 array for WAL, OS,
etc
 - 4-disk RAID-5 array containing data, 2-disk RAID-1 array for WAL, OS,
etc
 - 6-disk RAID-5 array (will probably perform 

Re: [PERFORM] Tuning PostgreSQL

2003-07-22 Thread Vivek Khera
 AP == Alexander Priem [EMAIL PROTECTED] writes:

AP Hmmm. I keep changing my mind about this. My Db would be mostly
AP 'selecting', but there would also be pretty much inserting and
AP updating done. But most of the work would be selects. So would
AP this config be OK?

I'm about to order a new server.  I haven't decided exactly how many
disks I will get, but my plan is to get an 8-disk RAID10 with 15k RPM
drives.  I don't need the volume, just the speed and number of
spindles, so I'm buying the smallest drives that meet my speed
probably 18Gb each (sheesh! I remember getting my first 5Mb disk for
my 8088 PC in college and thinking that was too much space).

My mix is nearly even read/write, but probably a little biased towards
the reading.

This machine is replacing a 5-disk box that was switched from RAID5 to
4-disk RAID10 for data plus one system disk in January (what a pain
that was to re-index, but that's another story).  The switch from
RAID5 to RAID10 made an enormous improvement in performance.  The
speedup wasn't from recreating the database:  It was restored from a
file-level backup so the actual files were not compacted or secretly
improved in any way, other than my occasional reindexing.

So I think your 6-disk RAID10 will be good.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.Khera Communications, Inc.
Internet: [EMAIL PROTECTED]   Rockville, MD   +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Dual Xeon + HW RAID question

2003-07-22 Thread Bruce Momjian
Mindaugas Riauba wrote:
 
   I missed your orig. post, but AFAIK multiprocessing kernels will handle
 HT
   CPUs as 2 CPUs each. Thus, our dual Xeon 2.4 is recognized as 4 Xeon 2.4
   CPUs.
  
   This way, I don't think HT would improve any single query (afaik no
 postgres
   process uses more than one cpu), but overall multi-query performance has
 to
   improve.
 
  When you use hyperthreading, each virtual cpu runs at 70% of a full CPU,
  so hyperthreading could be slower than non-hyperthreading.  On a fully
  loaded dual cpu system, you are looking at 2.8 cpu's (0.70 * 4), while
  if it isn't loaded, you are looking at slowing down if you are only
  using 1 or 2 cpu's.
 
   Virtual cpus are not running at 70% of real cpus :). Slowdown will happen
 if
 scheduler will run 2 processes on the same real cpu. And I read that there
 are
 patches for Linux kernel to fix that. Sooner rather than later they will
 appear
 in Linus kernel.

Right, I simplified it.  The big deal is whether the OS favors the
second real CPU over one of the virtual CPU's on the same die --- by
default, it doesn't.  Ever if it did work perfectly, you are talking
about going from 1 to 1.4 or 2 to 2.8, which doesn't seem like much.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Tuning PostgreSQL

2003-07-22 Thread Bruno Wolff III
On Tue, Jul 22, 2003 at 11:40:35 +0200,
  Vincent van Leeuwen [EMAIL PROTECTED] wrote:
 
 About RAID types: the fastest RAID type by far is RAID-10. However, this will
 cost you a lot of useable diskspace, so it isn't for everyone. You need at
 least 4 disks for a RAID-10 array. RAID-5 is a nice compromise if you want as
 much useable diskspace as possible and still want to be redundant. RAID-1 is
 very useful for small (2-disk) arrays.

Note that while raid 10 requires 4 disks, you get the space of 2 disks.
This is the same ratio as for raid 1.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Dual Xeon + HW RAID question

2003-07-22 Thread SZUCS Gábor
by default -- do you mean there is a way to tell Linux to favor the second
real cpu over the HT one? how?

G.
--- cut here ---
- Original Message - 
From: Bruce Momjian [EMAIL PROTECTED]
Sent: Tuesday, July 22, 2003 6:26 PM
Subject: Re: [PERFORM] Dual Xeon + HW RAID question


 Right, I simplified it.  The big deal is whether the OS favors the
 second real CPU over one of the virtual CPU's on the same die --- by
 default, it doesn't.  Ever if it did work perfectly, you are talking
 about going from 1 to 1.4 or 2 to 2.8, which doesn't seem like much.


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Dual Xeon + HW RAID question

2003-07-22 Thread Bruce Momjian
SZUCS Gábor wrote:
 by default -- do you mean there is a way to tell Linux to favor the second
 real cpu over the HT one? how?

Right now there is no way the kernel can tell which virtual cpu's are on
each physical cpu's, and that is the problem.  Once there is a way,
hyperthreading will be more useful, but even then, it doesn't double
your CPU throughput, just increases by 40%.


  Right, I simplified it.  The big deal is whether the OS favors the
  second real CPU over one of the virtual CPU's on the same die --- by
  default, it doesn't.  Ever if it did work perfectly, you are talking
  about going from 1 to 1.4 or 2 to 2.8, which doesn't seem like much.
 
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Wrong plan or what ?

2003-07-22 Thread Josh Berkus
Gaetano,

 QUERY PLAN
  Hash Join  (cost=265.64..32000.76 rows=40612 width=263) (actual
 time=11074.21..11134.28 rows=10 loops=1)
Hash Cond: (outer.id_user = inner.id_user)
-  Seq Scan on user_logs ul  (cost=0.00..24932.65 rows=1258965 width=48)
 (actual time=0.02..8530.21 rows=1258966 loops=1)

OK, here's your problem

The planner thinks that you're going to get 40162 rows out of the final join, 
not 10.   If the row estimate was correct, then the Seq Scan would be a 
reasonable plan.   But it's not.   Here's some steps you can take to clear 
things up for the planner:

1) Make sure you've VACUUM ANALYZED
2) Adjust the following postgresql.conf statistics:
a) effective_cache_size: increase to 70% of available (not used by other 
processes) RAM.
b) random_page_cost: decrease, maybe to 2.
c) default_statistics_target: try increasing to 100
(warning: this will significantly increase the time required to do 
ANALYZE)

Then test again!


-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Tunning FreeeBSD and PostgreSQL

2003-07-22 Thread Vivek Khera
 BM == Bruce Momjian [EMAIL PROTECTED] writes:

BM I know Linux has pagable shared memory, and you can resize the maximum
BM in a running kernel, so it seems they must have abandonded the linkage
BM between shared page tables and the kernel.  This looks interesting:

Thanks for the info.  You can resize it in FreeBSD as well, using the
sysctl command to set the various kern.ipc.shm* values.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.Khera Communications, Inc.
Internet: [EMAIL PROTECTED]   Rockville, MD   +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Dual Xeon + HW RAID question

2003-07-22 Thread Bruce Momjian
Jord Tanner wrote:
 On Tue, 2003-07-22 at 10:39, Bruce Momjian wrote:
  But CPU affinity isn't realated to hyperthreading, as far as I know. 
  CPU affinity tries to keep processes on the same cpu in case there is
  still valuable info in the cpu cache.
  
 
 It is true that CPU affinity is designed to prevent the dump of valuable
 CPU cache. My thought is that if you are trying to prevent CPU
 contention, you could use CPU affinity to prevent 2 postmaster processes
 from running simultaneously on the same die. Am I out to lunch here?
 I've not worked with CPU affinity before, so I'm not familiar with the
 intimate details.

I guess you could but it is the backends that use the cpu.  I don't
think manually specifying affinity will work for most applications.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Dual Xeon + HW RAID question

2003-07-22 Thread Jord Tanner
On Tue, 2003-07-22 at 11:50, Bruce Momjian wrote:
 Jord Tanner wrote:
  On Tue, 2003-07-22 at 10:39, Bruce Momjian wrote:
   But CPU affinity isn't realated to hyperthreading, as far as I know. 
   CPU affinity tries to keep processes on the same cpu in case there is
   still valuable info in the cpu cache.
   
  
  It is true that CPU affinity is designed to prevent the dump of valuable
  CPU cache. My thought is that if you are trying to prevent CPU
  contention, you could use CPU affinity to prevent 2 postmaster processes
  from running simultaneously on the same die. Am I out to lunch here?
  I've not worked with CPU affinity before, so I'm not familiar with the
  intimate details.
 
 I guess you could but it is the backends that use the cpu.  I don't
 think manually specifying affinity will work for most applications.

This is beating a dead horse, but I'll take one more kick at it.

CPU affinity is defined by a bit mask, so multiple processors can be
selected. It is also inherited by child processes, so assigning CPU 0
and CPU 2 (which I assume would be on different dies in a dual processor
hyper-threading system) to the parent postmaster should prevent CPU
contention with respect to the postgres backend. 

I would be very interested to see if any advantage could be gained by a
combination of multiple HT processors and cpu affinity over multiple
non-HT processors. Yet Another Performance Testing To Do (YAPTTD)!

-- 
Jord Tanner [EMAIL PROTECTED]


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Wrong plan or what ?

2003-07-22 Thread Josh Berkus
Gaetano,

 SELECT * from user_logs where id_user in (
  10943,   10942,   10934,   10927,   10910,  10909
 );
 [SNIPPED]

 Why the planner or the executor ( I don't know ) do not follow
 the same strategy ?

It is, actually, according to the query plan.   

Can you post the EXPLAIN ANALYZE for the above query?

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Wrong plan or what ?

2003-07-22 Thread Mendola Gaetano
Josh Berkus [EMAIL PROTECTED]
 Gaetano,

  SELECT * from user_logs where id_user in (
   10943,   10942,   10934,   10927,   10910,  10909
  );
  [SNIPPED]

  Why the planner or the executor ( I don't know ) do not follow
  the same strategy ?

 It is, actually, according to the query plan.

 Can you post the EXPLAIN ANALYZE for the above query?

Index Scan using idx_user_user_logs, idx_user_user_logs, idx_user_user_logs,
idx_user_user_logs, idx_user_user_logs, idx_user_user_logs on user_logs
(cost=0.00..5454.21 rows=2498 width=48) (actual time=0.09..0.28 rows=10
loops=1)
   Index Cond: ((id_user = 10943) OR (id_user = 10942) OR (id_user = 10934)
OR (id_user = 10927) OR (id_user = 10910) OR (id_user = 10909))
 Total runtime: 0.41 msec
(3 rows)


Thank you
Gaetano


PS: if I execute the query I obtain 10 rows instead of 3 that say the
explain analyze.




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Wrong plan or what ?

2003-07-22 Thread Mendola Gaetano
Josh Berkus [EMAIL PROTECTED]
 Gaetano,

  QUERY PLAN
   Hash Join  (cost=265.64..32000.76 rows=40612 width=263) (actual
  time=11074.21..11134.28 rows=10 loops=1)
 Hash Cond: (outer.id_user = inner.id_user)
 -  Seq Scan on user_logs ul  (cost=0.00..24932.65 rows=1258965
width=48)
  (actual time=0.02..8530.21 rows=1258966 loops=1)

 OK, here's your problem

 The planner thinks that you're going to get 40162 rows out of the final
join,
 not 10.   If the row estimate was correct, then the Seq Scan would be a
 reasonable plan.   But it's not.   Here's some steps you can take to clear
 things up for the planner:

 1) Make sure you've VACUUM ANALYZED
 2) Adjust the following postgresql.conf statistics:
 a) effective_cache_size: increase to 70% of available (not used by other
 processes) RAM.
 b) random_page_cost: decrease, maybe to 2.
 c) default_statistics_target: try increasing to 100
 (warning: this will significantly increase the time required to do
ANALYZE)

 Then test again!

No improvement at all,
I pushed default_statistics_target to 1000
but the rows expected are still 40612 :-(
Of course I restarted the postmaster and I vacuumed analyze the DB


Thank you
Gaetano








---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Rod Taylor
On Tue, 2003-07-22 at 20:34, Castle, Lindsay wrote:
 Hi all,
 
 I'm working on a project that has a data set of approximately 6million rows
 with about 12,000 different elements, each element has 7 columns of data.

Are these 7 columns the same for each element?


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Rod Taylor
Ok.. Unless I'm missing something, the data will be static (or near
static).  It also sounds as if the structure is common for elements, so
you probably only want 2 tables.

One with 6 million rows and any row information.  The other with 6
million * 12000 rows with the element data linking to the row
information line with an identifier, and have an 'element type' (I
assume there are 12000 types of elements -- or something of that
nature).

Unique constraint on (row_identifier, element_type)

The speed you achieve will be based on what indexes you create.

If you spend most of your time with one or a few (5% or less of the
structure) element types, create a partial index for those element types
only, and a partial index for all of the others.

If you have a standard mathematical operation on num1, num2, etc. you
may want to make use of functional indexes to index the result of the
calculation.

Be sure to create the tables WITHOUT OIDS and be prepared for the
dataload to take a while, and CLUSTER the table based on your most
commonly used index (once they've been setup).

To help with speed, we would need to see EXPLAIN ANALYZE results and the
query being performed.

On Tue, 2003-07-22 at 21:00, Castle, Lindsay wrote:
 All rows have the same structure, the data itself will be different for each
 row, the structure is something like this:
 
   element
   date
   num1
   num2
   num3
   num4
   units
 
 Thanks,
 
 
 Lindsay Castle
 EDS Australia
 Midrange  Distributed Tools
 Infrastructure Tools AP
 Ph:   +61 (0)8 8464 7101
 Fax:  +61 (0)8 8464 2135
 
 
 -Original Message-
 From: Rod Taylor [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, 23 July 2003 10:24 AM
 To: Castle, Lindsay
 Cc: Postgresql Performance
 Subject: Re: One table or many tables for data set
 
 
 On Tue, 2003-07-22 at 20:34, Castle, Lindsay wrote:
  Hi all,
  
  I'm working on a project that has a data set of approximately 6million
 rows
  with about 12,000 different elements, each element has 7 columns of data.
 
 Are these 7 columns the same for each element?
 


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Castle, Lindsay
Apologies, let me clear this up a bit (hopefully) :-)

The data structure looks like this:
element
date
num1
num2
num3
num4
units

There are approx 12,000 distinct elements for a total of about 6 million
rows of data.

The scanning technology I want to use may need a different number of rows
and different columns depending on the scan formula;
eg scan1 may need num1, num2 and num3 from the last 200 rows for
element x
   scan2 may need num1, units from the last 10 rows for element y

I can either do the scans and calculate what i need within SQL or drag the
data out and process it outside of SQL, my preference is to go inside SQL as
I've assumed that would be faster and less development work.

If I went with the many tables design I would not expect to need to join
between tables, there is no relationship between the different elements that
I need to cater for.

Cheers,

Linz


Castle, Lindsay wrote and snipped:
 I'm working on a project that has a data set of approximately 6million
rows
 with about 12,000 different elements, each element has 7 columns of data.
 
 I'm wondering what would be faster from a scanning perspective (SELECT
 statements with some calculations) for this type of set up;
   one table for all the data
   one table for each data element (12,000 tables)
   one table per subset of elements (eg all elements that start with
 a in a table)
 

I, for one, am having difficulty understanding exactly what your data 
looks like, so it's hard to give advice. Maybe some concrete examples of 
what you are calling rows, elements, and columns would help.

Does each of 6 million rows have 12000 elements, each with 7 columns? Or 
do you mean that out of 6 million rows, there are 12000 distinct kinds 
of elements?

 Can I do anything with Indexing to help with performance?  I suspect for
the
 majority of scans I will need to evaluate an outcome based on 4 or 5 of
the
 7 columns of data.
 

Again, this isn't clear to me -- but maybe I'm just being dense ;-)
Does this mean you expect 4 or 5 items in your WHERE clause?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Castle, Lindsay
Thanks Rod 

My explanations will be better next time. :-)


-Original Message-
From: Rod Taylor [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 23 July 2003 11:41 AM
To: Castle, Lindsay
Cc: Postgresql Performance
Subject: Re: One table or many tables for data set


On Tue, 2003-07-22 at 21:50, Rod Taylor wrote:
 Ok.. Unless I'm missing something, the data will be static (or near
 static).  It also sounds as if the structure is common for elements, so
 you probably only want 2 tables.

I misunderstood. Do what Joe suggested.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Rod Taylor
On Tue, 2003-07-22 at 21:50, Rod Taylor wrote:
 Ok.. Unless I'm missing something, the data will be static (or near
 static).  It also sounds as if the structure is common for elements, so
 you probably only want 2 tables.

I misunderstood. Do what Joe suggested.


signature.asc
Description: This is a digitally signed message part


[PERFORM] One table or many tables for data set

2003-07-22 Thread Castle, Lindsay
Hi all,

I'm working on a project that has a data set of approximately 6million rows
with about 12,000 different elements, each element has 7 columns of data.

I'm wondering what would be faster from a scanning perspective (SELECT
statements with some calculations) for this type of set up;
one table for all the data
one table for each data element (12,000 tables)
one table per subset of elements (eg all elements that start with
a in a table)

The data is static once its in the database, only new records are added on a
regular basis.

I'd like to run quite a few different formulated scans in the longer term so
having efficient scans is a high priority.

Can I do anything with Indexing to help with performance?  I suspect for the
majority of scans I will need to evaluate an outcome based on 4 or 5 of the
7 columns of data.

Thanks in advance :-)

Linz

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] One table or many tables for data set

2003-07-22 Thread Joe Conway
Castle, Lindsay wrote:
The data structure looks like this:
element
date
num1
num2
num3
num4
units
There are approx 12,000 distinct elements for a total of about 6 million
rows of data.
Ahh, that helps! So are the elements evenly distributed, i.e. are there 
approx 500 rows of each element? If so, it should be plenty quick to put 
all the data in one table with an index on element (and maybe a 
multicolumn key, depending on other factors).

The scanning technology I want to use may need a different number of rows
and different columns depending on the scan formula;
eg scan1 may need num1, num2 and num3 from the last 200 rows for
element x
   scan2 may need num1, units from the last 10 rows for element y
When you say last X rows, do you mean sorted by date? If so, you 
might want that index to be on (element, date). Then do:

SELECT num1, num2, num3 FROM mytable WHERE element = 'an_element' order 
by date DESC LIMIT 20;

Replace num1, num2, num3 by whatever columns you want, and LIMIT X as 
the number of rows you want.

HTH,

Joe

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match