Re: [PERFORM] SELECT LIMIT 1 VIEW Performance Issue

2005-09-26 Thread K C Lau

At 20:17 05/09/23, K C Lau wrote:

At 19:15 05/09/23, Simon Riggs wrote:

select distinct on (PlayerID) PlayerID,AtDate from Player a
where PlayerID='0' order by PlayerId, AtDate Desc;

Does that work for you?

Best Regards, Simon Riggs


esdt= explain analyze select distinct on (PlayerID) PlayerID,AtDate from 
Player a where PlayerID='0' order by PlayerId Desc, AtDate Desc;
 Unique  (cost=0.00..1327.44 rows=2 width=23) (actual time=0.027..8.438 
rows=1 loops=1)
   -  Index Scan Backward using pk_player on player 
a  (cost=0.00..1323.05 rows=1756 width=23) (actual time=0.022..4.950 
rows=1743 loops=1)

 Index Cond: ((playerid)::text = '0'::text)
 Total runtime: 8.499 ms

That is the fastest of all queries looping the 1743 rows.
I do get the desired result by adding LIMIT 1:

esdt= explain analyze select distinct on (PlayerID) PlayerID,AtDate from 
Player a where PlayerID='0' order by PlayerId Desc, AtDate Desc LIMIT 1;


 Limit  (cost=0.00..663.72 rows=1 width=23) (actual time=0.032..0.033 
rows=1 loops=1)
   -  Unique  (cost=0.00..1327.44 rows=2 width=23) (actual 
time=0.028..0.028 rows=1 loops=1)
 -  Index Scan Backward using pk_player on player 
a  (cost=0.00..1323.05 rows=1756 width=23) (actual time=0.022..0.022 
rows=1 loops=1)

   Index Cond: ((playerid)::text = '0'::text)
 Total runtime: 0.094 ms

However, when I use that within a function in a view, it is slow again:

esdt= create or replace function player_max_atdate (varchar(32)) returns 
varchar(32) as $$
esdt$  select distinct on (PlayerID) AtDate from player where PlayerID= 
$1 order by PlayerID desc, AtDate desc limit 1;

esdt$ $$ language sql immutable;
CREATE FUNCTION
esdt= create or replace view VCurPlayer3 as select * from Player where 
AtDate = player_max_atdate(PlayerID);

CREATE VIEW
esdt= explain analyze select PlayerID,AtDate from VCurPlayer3 where 
PlayerID='0';


 Index Scan using pk_player on player  (cost=0.00..1331.83 rows=9 
width=23) (actual time=76.660..76.664 rows=1 loops=1)

   Index Cond: ((playerid)::text = '0'::text)
   Filter: ((atdate)::text = (player_max_atdate(playerid))::text)
 Total runtime: 76.716 ms

Why wouldn't the function get the row as quickly as the direct sql does?


Results from the following query suggests that the explain analyze output 
above only tells half the story, and that the function is in fact called 
1743 times:


esdt= create or replace view VCurPlayer3 as select distinct on (PlayerID) 
* from Player a where OID = (select distinct on (PlayerID) OID from Player 
b where b.PlayerID = a.PlayerID and b.AtDate = 
player_max_atdate(b.PlayerID) order by PlayerID desc, AtDate desc limit 1) 
order by PlayerId Desc, AtDate desc;

CREATE VIEW
esdt= explain analyze select PlayerID,AtDate from VCurPlayer3 where 
PlayerID='0';
 Subquery Scan vcurplayer3  (cost=0.00..1715846.91 rows=1 width=68) 
(actual time=0.640..119.124 rows=1 loops=1)
   -  Unique  (cost=0.00..1715846.90 rows=1 width=776) (actual 
time=0.633..119.112 rows=1 loops=1)
 -  Index Scan Backward using pk_player on player 
a  (cost=0.00..1715846.88 rows=9 width=776) (actual time=0.628..119.104 
rows=1 loops=1)

   Index Cond: ((playerid)::text = '0'::text)
   Filter: (oid = (subplan))
   SubPlan
 -  Limit  (cost=0.00..976.38 rows=1 width=27) (actual 
time=0.057..0.058 rows=1 loops=1743)
   -  Unique  (cost=0.00..976.38 rows=1 width=27) 
(actual time=0.052..0.052 rows=1 loops=1743)
 -  Index Scan Backward using pk_player on 
player b  (cost=0.00..976.36 rows=6 width=27) (actual  time=0.047..0.047 
rows=1 loops=1743)

   Index Cond: ((playerid)::text = ($0)::text)
   Filter: ((atdate)::text = 
(player_max_atdate(playerid))::text)

 Total runtime: 119.357 ms

It would also explain the very long time taken by the pl/pgsql function I 
posted a bit earlier.


So I guess it all comes back to the basic question:

For the query select distinct on (PlayerID) * from Player a where 
PlayerID='0' order by PlayerId Desc, AtDate Desc;
can the optimizer recognise the fact the query is selecting by the primary 
key (PlayerID,AtDate), so it can skip the remaining rows for that PlayerID, 
as if LIMIT 1 is implied?


Best regards, KC.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Advice on RAID card

2005-09-26 Thread Ron Peacetree
While I understand being economical, at some point one crosses the line
to being penny wise and pound foolish.

How much is the data on this server going to be worth?
How much much will it cost you to recover or restore it (assuming that
is even possible if you lose it)?

If your data is worth nothing or the cost to recover or restore it is
negligible, then you don't need (nor should want) a DB server.  You'll
get higher performance at less cost via a number of other methods.

OTOH, if you data _does_ have value by any of the above metrics,
then it is worth it to pay attention to reliable, safe, fast, physical IO.

Battery backed HD caches of appropriate size are usually well worth
the $, as they pay for themselves (and then some) with the first data
loss they prevent.

RAID 5 means you are _always_ only 2 HDs from data loss, and 1 HD
from a serious performance hit.  Part of the trade-off with using SATA
HDs that cost 1/3-1/4 their U320 15Krpm brethren is that such
circumstances are +FAR+ more likely with SATA HDs.  

If you are not going to use RAID 10 because of cost issues, then
spend the $ to get the biggest battery backed cache you can afford
and justify as being cheaper than what the proper RAID 6 or RAID 10
setup would cost you.  Even if you are going to use SW RAID and the
controller will just be a JBOD controller.

On the general subject of costs...  

At this writing, SODIMM RAM costs ~$100 (US) per GB.  Standard
DIMMs cost ~$75 per GB unless you buy 4GB ones, in which case
they cost ~$100 per GB.

The sweet spot in SATA HD pricing is ~$160 for 320GB at 7200rpm
(don't buy the 36GB or 74GB WD Raptors, they are no longer worth
it).  If you are careful you can get SATA HD's with 16MB rather than
8MB buffers for that price.  Each such HD will give you ~50MB/s of
raw Average Sustained Transfer Rate.

Decent x86 compatible CPUs are available for ~$200-$400 apiece.
Rarely will a commodity HW DB server need a more high end CPU.

Some of the above numbers rate to either fall to 1/2 cost or 2x in value
for the dollar within the next 6-9 months, and all of them will within the
next 18 months.  And so will RAID controller costs.

Your salary will hopefully not degrade at that rate, and it is unlikely that
your value for the dollar will increase at that rate.  Nor is it likely that
data worth putting on a DB server will do so.

Figure out what your required performance and reliability for the next 18
months is going to be, and buy the stuff from the above list that will
sustain that.  No matter what.

Anything less rates _highly_ to end up costing you and your organization
more money within the next 18months than you will save in initial
acquisition cost.

Ron

 



-Original Message-
From: PFC [EMAIL PROTECTED]
Sent: Sep 24, 2005 12:27 PM
Subject: Re: [PERFORM] Advice on RAID card


 It looks like a rebranded low end Adaptec 64MB PCI-X - SATA RAID card.
 Looks like the 64MB buffer is not upgradable.
 Looks like it's SATA, not SATA II

Yeah, that's exactly what it is. I can get one for 150 Euro, the Areca 
is  
at least 600. This is for a budget server so while it would be nice to  
have all the high-tech stuff, it's not the point. My question was raher,  
is it one of the crap RAID5 cards which are actually SLOWER than plain IDE  
disks, or is it decent, even though low-end (and cheap), and worth it  
compared to software RAID5 ?

 Assuming you are not building 1U boxes, get one of the full height
 cards and order it with the maximum size buffer you can afford.
 The cards take 1 SODIMM, so that will be a max of 1GB or 2GB
 depending on whether 2GB SODIMMs are available to you yet.

It's for a budget dev server which should have RAID5 for reliability, 
but  
not necessarily stellar performance (and price). I asked about this card  
because I can get one at a good price.

Thanks for taking the time to answer.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin)

2005-09-26 Thread Gurpreet Aulakh
Thanks for your help Tom.

While testing 8.1, I found that simple joins take longer in 8.1 than 8.0.
For example the sub query

SELECT doc.doc_documentid FROM document AS doc LEFT JOIN folder_document ON
doc.doc_documentid = folder_document.doc_documentId LEFT JOIN document as
root ON doc.doc_internalRootXref = root.doc_documentId

is actually slower on 8.1 than 8.0.

However, the full query that I will be running is much faster. In my
evaluation I found the same pattern. That simple joins were slower but
complex joins were faster.

Overall though, 8.1 is faster and we will probably be moving to it when it's
officially released.

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED]
Sent: September 23, 2005 2:13 PM
To: Gurpreet Aulakh
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin)


Gurpreet Aulakh [EMAIL PROTECTED] writes:
 After further investigation I have found that the reason why the query is
 slower on 8.0.3 is that the hash and hash joins are slower on the 8.0.3.
 So the question comes down to : Why are hash and hash joins slower?

I looked into this a bit and determined that the problem seems to have
been introduced here:

2002-12-30 10:21  tgl

* src/: backend/executor/nodeHash.c,
backend/executor/nodeHashjoin.c, backend/optimizer/path/costsize.c,
include/executor/nodeHash.h: Better solution to integer overflow
problem in hash batch-number computation: reduce the bucket number
mod nbatch.  This changes the association between original bucket
numbers and batches, but that doesn't matter.  Minor other cleanups
in hashjoin code to help centralize decisions.

(which means it's present in 7.4 as well as 8.0).  The code now
groups tuples into hash batches according to
(hashvalue % totalbuckets) % nbatch
When a tuple that is not in the first batch is reloaded, it is placed
into a bucket according to
(hashvalue % nbuckets)
This means that if totalbuckets, nbatch, and nbuckets have a common
factor F, the buckets won't be evenly used; in fact, only one in every F
buckets will be used at all, the rest remaining empty.  The ones that
are used accordingly will contain about F times more tuples than
intended.  The slowdown comes from having to compare these extra tuples
against the outer-relation tuples.

7.3 uses a different algorithm for grouping tuples that avoids this
problem, but it has performance issues of its own (in particular, to
avoid integer overflow we have to limit the number of batches we can
have).  So just reverting this patch doesn't seem very attractive.

The problem no longer exists in 8.1 because of rewrites undertaken for
another purpose, so I'm sort of tempted to do nothing.  To fix this in
the back branches we'd have to develop new code that won't ever go into
CVS tip and thus will never get beta-tested.  The risk of breaking
things seems higher than I'd like.

If we did want to fix it, my first idea is to increment nbatch looking
for a value that has no common factor with nbuckets.

regards, tom lane




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Announce
Is there an performance benefit to using int2 (instead of int4) in cases
where i know i will be well within its numeric range? I want to conserve
storage space and gain speed anywhere i can, but i know some apps simply end
up casting 2byte data to 4byte (like Java int/short).

These int2 values will be used in primary and foreign key fields and I know
that i must explicitly use tick marks (ex: where int2_column = '12') in
order to make use of indexes, but my question is IS IT WORTH IT?  IS THERE
ANY REAL GAIN FOR DOING THIS?

An simple scenario would be:

Songs
---
song_id serial pkey
genre   int2   fkey
titlevarchar
...


Genres
---
genreid int2   pkey
name  varchar
description varchar



I KNOW that I am not going to have anywhere near 32,000+ different genres in
my genre table so why use int4?  Would that squeeze a few more milliseconds
of performance out of a LARGE song table query with a genre lookup?


Thanks,

-Aaron


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] [GENERAL] Index use in BETWEEN statement...

2005-09-26 Thread Cristian Prieto

mydb=# explain analyze select locid from geoip_block where
'216.230.158.50'::inet between start_block and end_block;
  QUERY PLAN

--
 Seq Scan on geoip_block  (cost=0.00..78033.96 rows=230141 width=8) (actual
time=13015.538..13508.708 rows=1 loops=1)
   Filter: (('216.230.158.50'::inet = start_block) AND
('216.230.158.50'::inet = end_block))
 Total runtime: 13508.905 ms
(3 rows)

mydb=# alter table geoip_block add constraint pkey_geoip_block primary key
(start_block, end_block);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index
pkey_geoip_block for table geoip_block
ALTER TABLE

mydb=# vacuum analyze geoip_block; 

mydb=# explain analyze select locid from geoip_block where
'216.230.158.50'::inet between start_block and end_block;
  QUERY PLAN

---
 Seq Scan on geoip_block  (cost=0.00..101121.01 rows=308324 width=8) (actual
time=12128.190..12631.550 rows=1 loops=1)
   Filter: (('216.230.158.50'::inet = start_block) AND
('216.230.158.50'::inet = end_block))
 Total runtime: 12631.679 ms
(3 rows)

mydb=#


As you see it still using a sequential scan in the table and ignores the
index, any other suggestion?

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Sean Davis
Sent: Lunes, 26 de Septiembre de 2005 10:24 a.m.
To: Cristian Prieto; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Index use in BETWEEN statement...

On 9/26/05 11:26 AM, Cristian Prieto [EMAIL PROTECTED] wrote:

 
 Hello pals, I have the following table in Postgresql 8.0.1
 
 Mydb# \d geoip_block
 Table public.geoip_block
  Column|  Type  | Modifiers
 -++---
 locid   | bigint |
 start_block | inet   |
 end_block   | inet   |
 
 mydb# explain analyze select locid from geoip_block where
 '216.230.158.50'::inet between start_block and end_block;
 QUERY PLAN


 ---
 Seq Scan on geoip_block  (cost=0.00..142772.86 rows=709688 width=8)
(actual
 time=14045.384..14706.927 rows=1 loops=1)
  Filter: (('216.230.158.50'::inet = start_block) AND
 ('216.230.158.50'::inet = end_block))
 Total runtime: 14707.038 ms
 
 Ok, now I decided to create a index to speed a little the query
 
 Mydb# create index idx_ipblocks on geoip_block(start_block, end_block);
 CREATE INDEX
 
 clickad=# explain analyze select locid from geoip_block where
 '216.230.158.50'::inet between start_block and end_block;
 QUERY PLAN


 --
 Seq Scan on geoip_block  (cost=0.00..78033.96 rows=230141 width=8) (actual
 time=12107.919..12610.199 rows=1 loops=1)
  Filter: (('216.230.158.50'::inet = start_block) AND
 ('216.230.158.50'::inet = end_block))
 Total runtime: 12610.329 ms
 (3 rows)
 
 I guess the planner is doing a sequential scan in the table, why not use
the
 compound index? Do you have any idea in how to speed up this query?

Did you vacuum analyze the table after creating the index?

Sean


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin)

2005-09-26 Thread Tom Lane
Gurpreet Aulakh [EMAIL PROTECTED] writes:
 While testing 8.1, I found that simple joins take longer in 8.1 than 8.0.
 For example the sub query
 SELECT doc.doc_documentid FROM document AS doc LEFT JOIN folder_document ON
 doc.doc_documentid = folder_document.doc_documentId LEFT JOIN document as
 root ON doc.doc_internalRootXref = root.doc_documentId
 is actually slower on 8.1 than 8.0.

With no more detail than that, this report is utterly unhelpful.  Let's
see the table schemas and the EXPLAIN ANALYZE results in both cases.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] Index use in BETWEEN statement...

2005-09-26 Thread Tom Lane
Cristian Prieto [EMAIL PROTECTED] writes:
 mydb=# explain analyze select locid from geoip_block where
 '216.230.158.50'::inet between start_block and end_block;

 As you see it still using a sequential scan in the table and ignores the
 index, any other suggestion?

That two-column index is entirely useless for this query; in fact btree
indexes of any sort are pretty useless.  You really need some sort of
multidimensional index type like rtree or gist.  There was discussion
just a week or three ago of how to optimize searches for intervals
overlapping a specified point, which is identical to your problem.
Can't remember if the question was about timestamp intervals or plain
intervals, but try checking the list archives.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Alvaro Herrera
On Mon, Sep 26, 2005 at 12:54:05PM -0500, Announce wrote:
 Is there an performance benefit to using int2 (instead of int4) in cases
 where i know i will be well within its numeric range? I want to conserve
 storage space and gain speed anywhere i can, but i know some apps simply end
 up casting 2byte data to 4byte (like Java int/short).
 
 These int2 values will be used in primary and foreign key fields and I know
 that i must explicitly use tick marks (ex: where int2_column = '12') in
 order to make use of indexes, but my question is IS IT WORTH IT?  IS THERE
 ANY REAL GAIN FOR DOING THIS?

 An simple scenario would be:
 
 Songs
 ---
 song_id serial pkey
 genre   int2   fkey
 titlevarchar

Not in this case, because the varchar column that follows the int2
column needs 4-byte alignment, so after the int2 column there must be 2
bytes of padding.

If you had two consecutive int2 fields you would save some the space.
Or int2/bool/bool (bool has 1-byte alignment), etc.

This assumes you are in a tipical x86 environment ... in other
environments the situation may be different.

-- 
Alvaro Herrera   Valdivia, Chile   ICBM: S 39º 49' 17.7, W 73º 14' 26.8
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos / con todos los humanos acabaré (Bender)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Neil Conway
On Mon, 2005-26-09 at 12:54 -0500, Announce wrote:
 Is there an performance benefit to using int2 (instead of int4) in cases
 where i know i will be well within its numeric range?

int2 uses slightly less storage space (2 bytes rather than 4). Depending
on alignment and padding requirements, as well as the other columns in
the table, that may translate into requiring fewer disk pages and
therefore slightly better performance and lower storage requirements.

-Neil



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Chris Browne
[EMAIL PROTECTED] (Announce) writes:
 I KNOW that I am not going to have anywhere near 32,000+ different
 genres in my genre table so why use int4?  Would that squeeze a few
 more milliseconds of performance out of a LARGE song table query
 with a genre lookup?

By the way, I see a lot of queries on tables NOT optimized in this
fashion that run in less than a millisecond, so it would seem
remarkable to me if there were milliseconds to be squeezed out in the
first place...
-- 
output = reverse(moc.enworbbc @ enworbbc)
http://www.ntlug.org/~cbbrowne/sap.html
Why do we drive on parkways and park on driveways?

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Chris Browne
[EMAIL PROTECTED] (Announce) writes:
 I KNOW that I am not going to have anywhere near 32,000+ different
 genres in my genre table so why use int4?  Would that squeeze a few
 more milliseconds of performance out of a LARGE song table query
 with a genre lookup?

If the field is immaterial in terms of the size of the table, then it
won't help materially.

If you were going to index on it, however, THAT would make it
significant for indices involving the genre column.  Fitting more
tuples into each page is a big help, and this would help.

I doubt it'll be material, but I'd think it a good thing to apply what
restrictions to your data types that you can, a priori, so I'd be
inclined to use int2 for this...
-- 
let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];;
http://cbbrowne.com/info/nonrdbms.html
Rules of  the Evil Overlord  #136. If I  build a bomb, I  will simply
remember which wire to cut if  it has to be deactivated and make every
wire red. http://www.eviloverlord.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Query seem to slow if table have more than 200 million rows

2005-09-26 Thread Qingqing Zhou

Ahmad Fajar [EMAIL PROTECTED] wrote

 Select ids, keywords from dict where keywords='blabla' ('blabla' is a 
 single
 word);

 The table have 200 million rows, I have index the keywords field. On the
 first time my query seem to slow to get the result, about 15-60 sec to get
 the result. But if I repeat the query I will get fast result. My question 
 is
 why on the first time the query seem very slow.

 Table structure is quite simple:

 Ids bigint, keywords varchar(150), weight varchar(1), dpos int.


The first slowness is obviously caused by disk IOs. The second time is 
faster because all data pages it requires are already in buffer pool. 200 
million rows is not a problem for btree index, even if your client tool 
appends some spaces to your keywords at your insertion time, the ideal btree 
is 5 to 6 layers high at most. Can you show the iostats of index from your 
statistics view? 
http://www.postgresql.org/docs/8.0/static/monitoring-stats.html#MONITORING-STATS-VIEWS

Regards,
Qingqing



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] int2 vs int4 in Postgres

2005-09-26 Thread Tom Lane
Chris Browne [EMAIL PROTECTED] writes:
 If the field is immaterial in terms of the size of the table, then it
 won't help materially.
 If you were going to index on it, however, THAT would make it
 significant for indices involving the genre column.  Fitting more
 tuples into each page is a big help, and this would help.

For a multicolumn index it might help to replace int4 by int2.  For a
single-column index, alignment constraints on the index entries will
prevent you from saving anything :-(

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-26 Thread Tom Lane
Ron Peacetree [EMAIL PROTECTED] writes:
 Let's start by assuming that an element is = in size to a cache line and a
 node fits into L1 DCache.  [ much else snipped ] 

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache, and that you are working with sort keys
that you can efficiently pack into cache lines.  And that you know the
relative access speeds of the caches and memory so that you can schedule
transfers, and that the hardware lets you get at that transfer timing.
And that the number of distinct key values isn't very large.

I don't see much prospect that anything we can actually use in a
portable fashion is going to emerge from this line of thought.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-26 Thread Ron Peacetree
From: Dann Corbit [EMAIL PROTECTED]
Sent: Sep 26, 2005 5:13 PM
To: Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org, 
   pgsql-performance@postgresql.org
Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.

Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
don't require O(nlgn) HD accesses.  These shouldn't either.

Let's start by assuming that an element is = in size to a cache line and a
node fits into L1 DCache.  To make the discussion more concrete, I'll use a
64KB L1 cache + a 1MB L2 cache only as an example.

Simplest case: the Key has few enough distinct values that all Keys or
KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
 = 1000 different values.  More if we can fit more than 1 element into
each cache line.).

As we scan the data set coming in from HD, we compare the Key or KeyPrefix
to the sorted list of Key values in the node.  This can be done in O(lgn) using
Binary Search or O(lglgn) using a variation of Interpolation Search.  
If the Key value exists, we append this RID to the list of RIDs having the
same Key:
  If the RAM buffer of this list of RIDs is full we append it and the current
  RID to the HD list of these RIDs.
Else we insert this new key value into its proper place in the sorted list of 
Key
values in the node and start a new list for this value of RID.

We allocate room for a CPU write buffer so we can schedule RAM writes to
the RAM lists of RIDs so as to minimize the randomness of them.

When we are finished scanning the data set from HD, the sorted node with
RID lists for each Key value contains the sort order for the whole data set.

Notice that almost all of the random data access is occuring within the CPU
rather than in RAM or HD, and that we are accessing RAM or HD only when
absolutely needed.

Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
In the given example CPU, we will be able to fit at least 1000 elements per
node and 2^20/2^16= up to 16 such nodes in this CPU.  We use a node's
worth of space as a RAM write buffer, so we end up with room for 15 such
nodes in this CPU.  This is enough for a 2 level index to at least 15,000
distinct Key value lists.

All of the traditional tricks for splitting a Btree node and redistributing
elements within them during insertion or splitting for maximum node
utilization can be used here.

The most general case: There are too many nodes to fit within the CPU
cache(s).  The root node now points to a maximum of at least 1000 nodes
since each element in the root node points to another node.  A full 2 level
index is now enough to point to at least 10^6 distinct Key value lists, and
3 levels will index more distinct Key values than is possible in our 1TB, 
500M record example.

We can use some sort of node use prediction algorithm like LFU to decide
which node should be moved out of CPU when we have to replace one of
the nodes in the CPU.  The nodes in RAM or on HD can be arranged to
maximize streaming IO behavior and minimize random access IO
behavior.

As you can see, both the RAM and HD IO are as minimized as possible,
and what such IO there is has been optimized for streaming behavior.

 
Can you draw a picture of it for me?  (I am dyslexic and understand things
far better when I can visualize it).

Not much for pictures.  Hopefully the explanation helps?

Ron

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-26 Thread Ron Peacetree
SECOND ATTEMPT AT POST.  Web mailer appears to have
eaten first one.  I apologize in advance if anyone gets two
versions of this post.
=r

From: Tom Lane [EMAIL PROTECTED]
Sent: Sep 26, 2005 9:42 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort? 

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache,

NO.  I used exact values only as examples.  Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless.  For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB.  The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.


and that you are working with sort keys that you can efficiently pack
into cache lines.

Not pack.  map.  n items can not take on more than n values.  n
values can be represented in lgn bits.  Less efficient mappings can
also work.  Either way I demonstrated that we have plenty of space in
a likely and common cache line size.  Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)


And that you know the relative access speeds of the caches and
memory so that you can schedule transfers,

Again, no.  I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples.  I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides.  I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct.  The stated model does that very well.

Please don't take my word for it.  Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself.  Part of the
reason I published the model was so that others could examine it.
 

and that the hardware lets you get at that transfer timing.

Never said anything about this, and in fact I do not need any such.


And that the number of distinct key values isn't very large.

Quite the opposite in fact.  I went out of my way to show that the
method still works well even if every Key is distinct.  It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct.  This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique.  It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.


I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster