Re: [PERFORM] SELECT LIMIT 1 VIEW Performance Issue
At 20:17 05/09/23, K C Lau wrote: At 19:15 05/09/23, Simon Riggs wrote: select distinct on (PlayerID) PlayerID,AtDate from Player a where PlayerID='0' order by PlayerId, AtDate Desc; Does that work for you? Best Regards, Simon Riggs esdt= explain analyze select distinct on (PlayerID) PlayerID,AtDate from Player a where PlayerID='0' order by PlayerId Desc, AtDate Desc; Unique (cost=0.00..1327.44 rows=2 width=23) (actual time=0.027..8.438 rows=1 loops=1) - Index Scan Backward using pk_player on player a (cost=0.00..1323.05 rows=1756 width=23) (actual time=0.022..4.950 rows=1743 loops=1) Index Cond: ((playerid)::text = '0'::text) Total runtime: 8.499 ms That is the fastest of all queries looping the 1743 rows. I do get the desired result by adding LIMIT 1: esdt= explain analyze select distinct on (PlayerID) PlayerID,AtDate from Player a where PlayerID='0' order by PlayerId Desc, AtDate Desc LIMIT 1; Limit (cost=0.00..663.72 rows=1 width=23) (actual time=0.032..0.033 rows=1 loops=1) - Unique (cost=0.00..1327.44 rows=2 width=23) (actual time=0.028..0.028 rows=1 loops=1) - Index Scan Backward using pk_player on player a (cost=0.00..1323.05 rows=1756 width=23) (actual time=0.022..0.022 rows=1 loops=1) Index Cond: ((playerid)::text = '0'::text) Total runtime: 0.094 ms However, when I use that within a function in a view, it is slow again: esdt= create or replace function player_max_atdate (varchar(32)) returns varchar(32) as $$ esdt$ select distinct on (PlayerID) AtDate from player where PlayerID= $1 order by PlayerID desc, AtDate desc limit 1; esdt$ $$ language sql immutable; CREATE FUNCTION esdt= create or replace view VCurPlayer3 as select * from Player where AtDate = player_max_atdate(PlayerID); CREATE VIEW esdt= explain analyze select PlayerID,AtDate from VCurPlayer3 where PlayerID='0'; Index Scan using pk_player on player (cost=0.00..1331.83 rows=9 width=23) (actual time=76.660..76.664 rows=1 loops=1) Index Cond: ((playerid)::text = '0'::text) Filter: ((atdate)::text = (player_max_atdate(playerid))::text) Total runtime: 76.716 ms Why wouldn't the function get the row as quickly as the direct sql does? Results from the following query suggests that the explain analyze output above only tells half the story, and that the function is in fact called 1743 times: esdt= create or replace view VCurPlayer3 as select distinct on (PlayerID) * from Player a where OID = (select distinct on (PlayerID) OID from Player b where b.PlayerID = a.PlayerID and b.AtDate = player_max_atdate(b.PlayerID) order by PlayerID desc, AtDate desc limit 1) order by PlayerId Desc, AtDate desc; CREATE VIEW esdt= explain analyze select PlayerID,AtDate from VCurPlayer3 where PlayerID='0'; Subquery Scan vcurplayer3 (cost=0.00..1715846.91 rows=1 width=68) (actual time=0.640..119.124 rows=1 loops=1) - Unique (cost=0.00..1715846.90 rows=1 width=776) (actual time=0.633..119.112 rows=1 loops=1) - Index Scan Backward using pk_player on player a (cost=0.00..1715846.88 rows=9 width=776) (actual time=0.628..119.104 rows=1 loops=1) Index Cond: ((playerid)::text = '0'::text) Filter: (oid = (subplan)) SubPlan - Limit (cost=0.00..976.38 rows=1 width=27) (actual time=0.057..0.058 rows=1 loops=1743) - Unique (cost=0.00..976.38 rows=1 width=27) (actual time=0.052..0.052 rows=1 loops=1743) - Index Scan Backward using pk_player on player b (cost=0.00..976.36 rows=6 width=27) (actual time=0.047..0.047 rows=1 loops=1743) Index Cond: ((playerid)::text = ($0)::text) Filter: ((atdate)::text = (player_max_atdate(playerid))::text) Total runtime: 119.357 ms It would also explain the very long time taken by the pl/pgsql function I posted a bit earlier. So I guess it all comes back to the basic question: For the query select distinct on (PlayerID) * from Player a where PlayerID='0' order by PlayerId Desc, AtDate Desc; can the optimizer recognise the fact the query is selecting by the primary key (PlayerID,AtDate), so it can skip the remaining rows for that PlayerID, as if LIMIT 1 is implied? Best regards, KC. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Advice on RAID card
While I understand being economical, at some point one crosses the line to being penny wise and pound foolish. How much is the data on this server going to be worth? How much much will it cost you to recover or restore it (assuming that is even possible if you lose it)? If your data is worth nothing or the cost to recover or restore it is negligible, then you don't need (nor should want) a DB server. You'll get higher performance at less cost via a number of other methods. OTOH, if you data _does_ have value by any of the above metrics, then it is worth it to pay attention to reliable, safe, fast, physical IO. Battery backed HD caches of appropriate size are usually well worth the $, as they pay for themselves (and then some) with the first data loss they prevent. RAID 5 means you are _always_ only 2 HDs from data loss, and 1 HD from a serious performance hit. Part of the trade-off with using SATA HDs that cost 1/3-1/4 their U320 15Krpm brethren is that such circumstances are +FAR+ more likely with SATA HDs. If you are not going to use RAID 10 because of cost issues, then spend the $ to get the biggest battery backed cache you can afford and justify as being cheaper than what the proper RAID 6 or RAID 10 setup would cost you. Even if you are going to use SW RAID and the controller will just be a JBOD controller. On the general subject of costs... At this writing, SODIMM RAM costs ~$100 (US) per GB. Standard DIMMs cost ~$75 per GB unless you buy 4GB ones, in which case they cost ~$100 per GB. The sweet spot in SATA HD pricing is ~$160 for 320GB at 7200rpm (don't buy the 36GB or 74GB WD Raptors, they are no longer worth it). If you are careful you can get SATA HD's with 16MB rather than 8MB buffers for that price. Each such HD will give you ~50MB/s of raw Average Sustained Transfer Rate. Decent x86 compatible CPUs are available for ~$200-$400 apiece. Rarely will a commodity HW DB server need a more high end CPU. Some of the above numbers rate to either fall to 1/2 cost or 2x in value for the dollar within the next 6-9 months, and all of them will within the next 18 months. And so will RAID controller costs. Your salary will hopefully not degrade at that rate, and it is unlikely that your value for the dollar will increase at that rate. Nor is it likely that data worth putting on a DB server will do so. Figure out what your required performance and reliability for the next 18 months is going to be, and buy the stuff from the above list that will sustain that. No matter what. Anything less rates _highly_ to end up costing you and your organization more money within the next 18months than you will save in initial acquisition cost. Ron -Original Message- From: PFC [EMAIL PROTECTED] Sent: Sep 24, 2005 12:27 PM Subject: Re: [PERFORM] Advice on RAID card It looks like a rebranded low end Adaptec 64MB PCI-X - SATA RAID card. Looks like the 64MB buffer is not upgradable. Looks like it's SATA, not SATA II Yeah, that's exactly what it is. I can get one for 150 Euro, the Areca is at least 600. This is for a budget server so while it would be nice to have all the high-tech stuff, it's not the point. My question was raher, is it one of the crap RAID5 cards which are actually SLOWER than plain IDE disks, or is it decent, even though low-end (and cheap), and worth it compared to software RAID5 ? Assuming you are not building 1U boxes, get one of the full height cards and order it with the maximum size buffer you can afford. The cards take 1 SODIMM, so that will be a max of 1GB or 2GB depending on whether 2GB SODIMMs are available to you yet. It's for a budget dev server which should have RAID5 for reliability, but not necessarily stellar performance (and price). I asked about this card because I can get one at a good price. Thanks for taking the time to answer. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin)
Thanks for your help Tom. While testing 8.1, I found that simple joins take longer in 8.1 than 8.0. For example the sub query SELECT doc.doc_documentid FROM document AS doc LEFT JOIN folder_document ON doc.doc_documentid = folder_document.doc_documentId LEFT JOIN document as root ON doc.doc_internalRootXref = root.doc_documentId is actually slower on 8.1 than 8.0. However, the full query that I will be running is much faster. In my evaluation I found the same pattern. That simple joins were slower but complex joins were faster. Overall though, 8.1 is faster and we will probably be moving to it when it's officially released. -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: September 23, 2005 2:13 PM To: Gurpreet Aulakh Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin) Gurpreet Aulakh [EMAIL PROTECTED] writes: After further investigation I have found that the reason why the query is slower on 8.0.3 is that the hash and hash joins are slower on the 8.0.3. So the question comes down to : Why are hash and hash joins slower? I looked into this a bit and determined that the problem seems to have been introduced here: 2002-12-30 10:21 tgl * src/: backend/executor/nodeHash.c, backend/executor/nodeHashjoin.c, backend/optimizer/path/costsize.c, include/executor/nodeHash.h: Better solution to integer overflow problem in hash batch-number computation: reduce the bucket number mod nbatch. This changes the association between original bucket numbers and batches, but that doesn't matter. Minor other cleanups in hashjoin code to help centralize decisions. (which means it's present in 7.4 as well as 8.0). The code now groups tuples into hash batches according to (hashvalue % totalbuckets) % nbatch When a tuple that is not in the first batch is reloaded, it is placed into a bucket according to (hashvalue % nbuckets) This means that if totalbuckets, nbatch, and nbuckets have a common factor F, the buckets won't be evenly used; in fact, only one in every F buckets will be used at all, the rest remaining empty. The ones that are used accordingly will contain about F times more tuples than intended. The slowdown comes from having to compare these extra tuples against the outer-relation tuples. 7.3 uses a different algorithm for grouping tuples that avoids this problem, but it has performance issues of its own (in particular, to avoid integer overflow we have to limit the number of batches we can have). So just reverting this patch doesn't seem very attractive. The problem no longer exists in 8.1 because of rewrites undertaken for another purpose, so I'm sort of tempted to do nothing. To fix this in the back branches we'd have to develop new code that won't ever go into CVS tip and thus will never get beta-tested. The risk of breaking things seems higher than I'd like. If we did want to fix it, my first idea is to increment nbatch looking for a value that has no common factor with nbuckets. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] int2 vs int4 in Postgres
Is there an performance benefit to using int2 (instead of int4) in cases where i know i will be well within its numeric range? I want to conserve storage space and gain speed anywhere i can, but i know some apps simply end up casting 2byte data to 4byte (like Java int/short). These int2 values will be used in primary and foreign key fields and I know that i must explicitly use tick marks (ex: where int2_column = '12') in order to make use of indexes, but my question is IS IT WORTH IT? IS THERE ANY REAL GAIN FOR DOING THIS? An simple scenario would be: Songs --- song_id serial pkey genre int2 fkey titlevarchar ... Genres --- genreid int2 pkey name varchar description varchar I KNOW that I am not going to have anywhere near 32,000+ different genres in my genre table so why use int4? Would that squeeze a few more milliseconds of performance out of a LARGE song table query with a genre lookup? Thanks, -Aaron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] [GENERAL] Index use in BETWEEN statement...
mydb=# explain analyze select locid from geoip_block where '216.230.158.50'::inet between start_block and end_block; QUERY PLAN -- Seq Scan on geoip_block (cost=0.00..78033.96 rows=230141 width=8) (actual time=13015.538..13508.708 rows=1 loops=1) Filter: (('216.230.158.50'::inet = start_block) AND ('216.230.158.50'::inet = end_block)) Total runtime: 13508.905 ms (3 rows) mydb=# alter table geoip_block add constraint pkey_geoip_block primary key (start_block, end_block); NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index pkey_geoip_block for table geoip_block ALTER TABLE mydb=# vacuum analyze geoip_block; mydb=# explain analyze select locid from geoip_block where '216.230.158.50'::inet between start_block and end_block; QUERY PLAN --- Seq Scan on geoip_block (cost=0.00..101121.01 rows=308324 width=8) (actual time=12128.190..12631.550 rows=1 loops=1) Filter: (('216.230.158.50'::inet = start_block) AND ('216.230.158.50'::inet = end_block)) Total runtime: 12631.679 ms (3 rows) mydb=# As you see it still using a sequential scan in the table and ignores the index, any other suggestion? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sean Davis Sent: Lunes, 26 de Septiembre de 2005 10:24 a.m. To: Cristian Prieto; pgsql-general@postgresql.org Subject: Re: [GENERAL] Index use in BETWEEN statement... On 9/26/05 11:26 AM, Cristian Prieto [EMAIL PROTECTED] wrote: Hello pals, I have the following table in Postgresql 8.0.1 Mydb# \d geoip_block Table public.geoip_block Column| Type | Modifiers -++--- locid | bigint | start_block | inet | end_block | inet | mydb# explain analyze select locid from geoip_block where '216.230.158.50'::inet between start_block and end_block; QUERY PLAN --- Seq Scan on geoip_block (cost=0.00..142772.86 rows=709688 width=8) (actual time=14045.384..14706.927 rows=1 loops=1) Filter: (('216.230.158.50'::inet = start_block) AND ('216.230.158.50'::inet = end_block)) Total runtime: 14707.038 ms Ok, now I decided to create a index to speed a little the query Mydb# create index idx_ipblocks on geoip_block(start_block, end_block); CREATE INDEX clickad=# explain analyze select locid from geoip_block where '216.230.158.50'::inet between start_block and end_block; QUERY PLAN -- Seq Scan on geoip_block (cost=0.00..78033.96 rows=230141 width=8) (actual time=12107.919..12610.199 rows=1 loops=1) Filter: (('216.230.158.50'::inet = start_block) AND ('216.230.158.50'::inet = end_block)) Total runtime: 12610.329 ms (3 rows) I guess the planner is doing a sequential scan in the table, why not use the compound index? Do you have any idea in how to speed up this query? Did you vacuum analyze the table after creating the index? Sean ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Query slower on 8.0.3 (Windows) vs 7.3 (cygwin)
Gurpreet Aulakh [EMAIL PROTECTED] writes: While testing 8.1, I found that simple joins take longer in 8.1 than 8.0. For example the sub query SELECT doc.doc_documentid FROM document AS doc LEFT JOIN folder_document ON doc.doc_documentid = folder_document.doc_documentId LEFT JOIN document as root ON doc.doc_internalRootXref = root.doc_documentId is actually slower on 8.1 than 8.0. With no more detail than that, this report is utterly unhelpful. Let's see the table schemas and the EXPLAIN ANALYZE results in both cases. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] [GENERAL] Index use in BETWEEN statement...
Cristian Prieto [EMAIL PROTECTED] writes: mydb=# explain analyze select locid from geoip_block where '216.230.158.50'::inet between start_block and end_block; As you see it still using a sequential scan in the table and ignores the index, any other suggestion? That two-column index is entirely useless for this query; in fact btree indexes of any sort are pretty useless. You really need some sort of multidimensional index type like rtree or gist. There was discussion just a week or three ago of how to optimize searches for intervals overlapping a specified point, which is identical to your problem. Can't remember if the question was about timestamp intervals or plain intervals, but try checking the list archives. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] int2 vs int4 in Postgres
On Mon, Sep 26, 2005 at 12:54:05PM -0500, Announce wrote: Is there an performance benefit to using int2 (instead of int4) in cases where i know i will be well within its numeric range? I want to conserve storage space and gain speed anywhere i can, but i know some apps simply end up casting 2byte data to 4byte (like Java int/short). These int2 values will be used in primary and foreign key fields and I know that i must explicitly use tick marks (ex: where int2_column = '12') in order to make use of indexes, but my question is IS IT WORTH IT? IS THERE ANY REAL GAIN FOR DOING THIS? An simple scenario would be: Songs --- song_id serial pkey genre int2 fkey titlevarchar Not in this case, because the varchar column that follows the int2 column needs 4-byte alignment, so after the int2 column there must be 2 bytes of padding. If you had two consecutive int2 fields you would save some the space. Or int2/bool/bool (bool has 1-byte alignment), etc. This assumes you are in a tipical x86 environment ... in other environments the situation may be different. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 17.7, W 73º 14' 26.8 Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos / con todos los humanos acabaré (Bender) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] int2 vs int4 in Postgres
On Mon, 2005-26-09 at 12:54 -0500, Announce wrote: Is there an performance benefit to using int2 (instead of int4) in cases where i know i will be well within its numeric range? int2 uses slightly less storage space (2 bytes rather than 4). Depending on alignment and padding requirements, as well as the other columns in the table, that may translate into requiring fewer disk pages and therefore slightly better performance and lower storage requirements. -Neil ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] int2 vs int4 in Postgres
[EMAIL PROTECTED] (Announce) writes: I KNOW that I am not going to have anywhere near 32,000+ different genres in my genre table so why use int4? Would that squeeze a few more milliseconds of performance out of a LARGE song table query with a genre lookup? By the way, I see a lot of queries on tables NOT optimized in this fashion that run in less than a millisecond, so it would seem remarkable to me if there were milliseconds to be squeezed out in the first place... -- output = reverse(moc.enworbbc @ enworbbc) http://www.ntlug.org/~cbbrowne/sap.html Why do we drive on parkways and park on driveways? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] int2 vs int4 in Postgres
[EMAIL PROTECTED] (Announce) writes: I KNOW that I am not going to have anywhere near 32,000+ different genres in my genre table so why use int4? Would that squeeze a few more milliseconds of performance out of a LARGE song table query with a genre lookup? If the field is immaterial in terms of the size of the table, then it won't help materially. If you were going to index on it, however, THAT would make it significant for indices involving the genre column. Fitting more tuples into each page is a big help, and this would help. I doubt it'll be material, but I'd think it a good thing to apply what restrictions to your data types that you can, a priori, so I'd be inclined to use int2 for this... -- let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];; http://cbbrowne.com/info/nonrdbms.html Rules of the Evil Overlord #136. If I build a bomb, I will simply remember which wire to cut if it has to be deactivated and make every wire red. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Query seem to slow if table have more than 200 million rows
Ahmad Fajar [EMAIL PROTECTED] wrote Select ids, keywords from dict where keywords='blabla' ('blabla' is a single word); The table have 200 million rows, I have index the keywords field. On the first time my query seem to slow to get the result, about 15-60 sec to get the result. But if I repeat the query I will get fast result. My question is why on the first time the query seem very slow. Table structure is quite simple: Ids bigint, keywords varchar(150), weight varchar(1), dpos int. The first slowness is obviously caused by disk IOs. The second time is faster because all data pages it requires are already in buffer pool. 200 million rows is not a problem for btree index, even if your client tool appends some spaces to your keywords at your insertion time, the ideal btree is 5 to 6 layers high at most. Can you show the iostats of index from your statistics view? http://www.postgresql.org/docs/8.0/static/monitoring-stats.html#MONITORING-STATS-VIEWS Regards, Qingqing ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] int2 vs int4 in Postgres
Chris Browne [EMAIL PROTECTED] writes: If the field is immaterial in terms of the size of the table, then it won't help materially. If you were going to index on it, however, THAT would make it significant for indices involving the genre column. Fitting more tuples into each page is a big help, and this would help. For a multicolumn index it might help to replace int4 by int2. For a single-column index, alignment constraints on the index entries will prevent you from saving anything :-( regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron Peacetree [EMAIL PROTECTED] writes: Let's start by assuming that an element is = in size to a cache line and a node fits into L1 DCache. [ much else snipped ] So far, you've blithely assumed that you know the size of a cache line, the sizes of L1 and L2 cache, and that you are working with sort keys that you can efficiently pack into cache lines. And that you know the relative access speeds of the caches and memory so that you can schedule transfers, and that the hardware lets you get at that transfer timing. And that the number of distinct key values isn't very large. I don't see much prospect that anything we can actually use in a portable fashion is going to emerge from this line of thought. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Dann Corbit [EMAIL PROTECTED] Sent: Sep 26, 2005 5:13 PM To: Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: RE: [HACKERS] [PERFORM] A Better External Sort? I think that the btrees are going to be O(n*log(n)) in construction of the indexes in disk access unless you memory map them [which means you would need stupendous memory volume] and so I cannot say that I really understand your idea yet. Traditional algorithms for the construction of Btree variants (B, B+, B*, ...) don't require O(nlgn) HD accesses. These shouldn't either. Let's start by assuming that an element is = in size to a cache line and a node fits into L1 DCache. To make the discussion more concrete, I'll use a 64KB L1 cache + a 1MB L2 cache only as an example. Simplest case: the Key has few enough distinct values that all Keys or KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's = 1000 different values. More if we can fit more than 1 element into each cache line.). As we scan the data set coming in from HD, we compare the Key or KeyPrefix to the sorted list of Key values in the node. This can be done in O(lgn) using Binary Search or O(lglgn) using a variation of Interpolation Search. If the Key value exists, we append this RID to the list of RIDs having the same Key: If the RAM buffer of this list of RIDs is full we append it and the current RID to the HD list of these RIDs. Else we insert this new key value into its proper place in the sorted list of Key values in the node and start a new list for this value of RID. We allocate room for a CPU write buffer so we can schedule RAM writes to the RAM lists of RIDs so as to minimize the randomness of them. When we are finished scanning the data set from HD, the sorted node with RID lists for each Key value contains the sort order for the whole data set. Notice that almost all of the random data access is occuring within the CPU rather than in RAM or HD, and that we are accessing RAM or HD only when absolutely needed. Next simplest case: Multiple nodes, but they all fit in the CPU cache(s). In the given example CPU, we will be able to fit at least 1000 elements per node and 2^20/2^16= up to 16 such nodes in this CPU. We use a node's worth of space as a RAM write buffer, so we end up with room for 15 such nodes in this CPU. This is enough for a 2 level index to at least 15,000 distinct Key value lists. All of the traditional tricks for splitting a Btree node and redistributing elements within them during insertion or splitting for maximum node utilization can be used here. The most general case: There are too many nodes to fit within the CPU cache(s). The root node now points to a maximum of at least 1000 nodes since each element in the root node points to another node. A full 2 level index is now enough to point to at least 10^6 distinct Key value lists, and 3 levels will index more distinct Key values than is possible in our 1TB, 500M record example. We can use some sort of node use prediction algorithm like LFU to decide which node should be moved out of CPU when we have to replace one of the nodes in the CPU. The nodes in RAM or on HD can be arranged to maximize streaming IO behavior and minimize random access IO behavior. As you can see, both the RAM and HD IO are as minimized as possible, and what such IO there is has been optimized for streaming behavior. Can you draw a picture of it for me? (I am dyslexic and understand things far better when I can visualize it). Not much for pictures. Hopefully the explanation helps? Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
SECOND ATTEMPT AT POST. Web mailer appears to have eaten first one. I apologize in advance if anyone gets two versions of this post. =r From: Tom Lane [EMAIL PROTECTED] Sent: Sep 26, 2005 9:42 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? So far, you've blithely assumed that you know the size of a cache line, the sizes of L1 and L2 cache, NO. I used exact values only as examples. Realistic examples drawn from an extensive survey of past, present, and what I could find out about future systems; but only examples nonetheless. For instance, Hennessy and Patterson 3ed points out that 64B cache lines are optimally performing for caches between 16KB and 256KB. The same source as well as sources specifically on CPU memory hierarchy design points out that we are not likely to see L1 caches larger than 256KB in the forseeable future. The important point was the idea of an efficient Key, rather than Record, sort using a CPU cache friendly data structure with provably good space and IO characteristics based on a reasonable model of current and likely future single box computer architecture (although it would be fairly easy to extend it to include the effects of networking.) No apriori exact or known values are required for the method to work. and that you are working with sort keys that you can efficiently pack into cache lines. Not pack. map. n items can not take on more than n values. n values can be represented in lgn bits. Less efficient mappings can also work. Either way I demonstrated that we have plenty of space in a likely and common cache line size. Creating a mapping function to represent m values in lgm bits is a well known hack, and if we keep track of minimum and maximum values for fields during insert and delete operations, we can even create mapping functions fairly easily. (IIRC, Oracle does keep track of minimum and maximum field values.) And that you know the relative access speeds of the caches and memory so that you can schedule transfers, Again, no. I created a reasonable model of a computer system that holds remarkably well over a _very_ wide range of examples. I don't need the numbers to be exactly right to justify my approach to this problem or understand why other approaches may have downsides. I just have to get the relative performance of the system components and the relative performance gap between them reasonably correct. The stated model does that very well. Please don't take my word for it. Go grab some random box: laptop, desktop, unix server, etc and try it for yourself. Part of the reason I published the model was so that others could examine it. and that the hardware lets you get at that transfer timing. Never said anything about this, and in fact I do not need any such. And that the number of distinct key values isn't very large. Quite the opposite in fact. I went out of my way to show that the method still works well even if every Key is distinct. It is _more efficient_ when the number of distinct keys is small compared to the number of data items, but it works as well as any other Btree would when all n of the Keys are distinct. This is just a CPU cache and more IO friendly Btree, not some magical and unheard of technique. It's just as general purpose as Btrees usually are. I'm simply looking at the current and likely future state of computer systems architecture and coming up with a slight twist on how to use already well known and characterized techniques. not trying to start a revolution. I'm trying very hard NOT to waste anyone's time around here. Including my own Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster