date:20060118

Re: [HACKERS] Bad estimate on LIKE matching

2006-01-18 Thread Simon Riggs

On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote:

 On this table, I do a query like:
 SELECT * FROM path WHERE path LIKE 'f:/userdirs/s/super_73/%'
 
 The estimate for this query is comlpetely off, which I beleive is the
 cause for a very bad selection of a query plan when it's used in a big
 join (creating nestloops that ends up taking 15+ minutes to complete..).
 
 
 Explain analyze gives:
  QUERY PLAN
 
 ---
  Index Scan using path_name_idx on path  (cost=0.00..3.24 rows=1
 width=74) (actual time=0.035..0.442 rows=214 loops=1)
Index Cond: ((path = 'f:/userdirs/s/super'::text) AND (path 
 'f:/userdirs/s/supes'::text))
Filter: (path ~~ 'f:/userdirs/s/super_73%'::text)
 
 
 No matter what I search on (when it's very selective), the estimate is
 always 1 row, whereas the actual value is at least a couple of hundred.
 If I try with say f:/us, the difference is 377,759 estimated vs
 562,459 returned, which is percentage-wise a lot less, but...
 
 I have tried upping the statistics target up to 1000, with no changes.

 Any way to teach the planner about this?

In a recent thread on -perform, I opined that this case could best be
solved by using dynamic random block sampling at plan time followed by a
direct evaluation of the LIKE against the sample. This would yield a
more precise selectivity and lead to the better plan. So it can be
improved for the next release.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] equivalence class not working?

2006-01-18 Thread Simon Riggs

On Mon, 2006-01-16 at 19:03 -0500, uwcssa wrote:
 Fine. The rest documentation says: For now, the test only uses
 restriction clauses (those in restrictinfo_list). --Nels, Dec '92,
 however, I understand it as being overridden by the
 followup, which is:XXX as of 7.1, equivalence class info *is*
 available.  Consider
 improving this code as foreseen by Nels.

All readers are invited to solve the problem.

Currently we add only implied equality conditions, so enhancing the
optimizer to cope with inequalities seems possible.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Bad estimate on LIKE matching

2006-01-18 Thread Magnus Hagander


  I have tried upping the statistics target up to 1000, with 
 no changes.
 
  Any way to teach the planner about this?
 
 In a recent thread on -perform, I opined that this case could 
 best be solved by using dynamic random block sampling at plan 
 time followed by a direct evaluation of the LIKE against the 
 sample. This would yield a more precise selectivity and lead 
 to the better plan. So it can be improved for the next release.

I was kinda hoping for something I could use in 8.1 :-) Even if it's an
ugly solution for now. (My current workaround of writing it to a temp
table and the joining to the temp table causes a reasonable plan, but
I'd like something slightly less ugly than that if possible..)

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Leandro Guimarães Faria Corcete DUTRA

Jim C. Nasby jnasby at pervasive.com writes:

 a) the optimizer does a really poor job on multi-column index statistics

So it should be fixed?

And there are a *lot* of singular, natural keys.


 b) If each parent record will have many children, the space savings from
 using a surrogate key can be quite large

Not such a common case.


 c) depending on how you view things, putting actual keys all over the
 place is denormalized

How come?  Never!


 Generally, I just use surrogate keys for everything unless performance
 dictates something else.

What I am proposing is the reverse: use natural keys for everything unless 
performance dictates something else.

In support of my PoV: 
http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Leandro Guimarães Faria Corcete DUTRA

Greg Stark gsstark at mit.edu writes:

 I hate knee-jerk reactions too, but just think of all the pain of people
 dealing with databases where they used Social Security numbers for primary
 keys. I would never use an attribute that represents some real-world datum as
 a primary key any more.

I am not familiar with the situation.


 In my experience there are very few occasions where I want a real non-sequence
 generated primary key. I've never regretted having a sequence generated
 primary key, and I've certainly had occasions to regret not having one.

http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] enums

2006-01-18 Thread Leandro Guimarães Faria Corcete DUTRA

Andrew Dunstan andrew at dunslane.net writes:

 If people would like to play, I have created a little kit to help in 
 creating first class enum types in a few seconds.

Isn't what we actually want possreps?


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Martijn van Oosterhout

On Wed, Jan 18, 2006 at 01:08:53PM +, Leandro Guimarães Faria Corcete DUTRA 
wrote:
 Jim C. Nasby jnasby at pervasive.com writes:
  Generally, I just use surrogate keys for everything unless performance
  dictates something else.
 
 What I am proposing is the reverse: use natural keys for everything unless 
 performance dictates something else.
 
 In support of my PoV: 
 http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1

Interesting. However, in my experience very few things have natural
keys. There are no combination of attributes for people, phone calls
or even real events that make useful natural keys.

You don't say what the primary key on your events table was but I can
see one possibility:

(place,datetime)   

A unique on this won't prevent overlapping events. Sure, it'll get rid
of the obvious duplicates but won't solve the problem. It also fails
the criteria that keys stable, since you can move events. You do need a
constraint on that table, but a unique constraint isn't it.

While I agree with your statement that it's the abuse of these keys
thats the problem, I find people are far too likely to see natural keys
where none exist.

BTW, the way I deal with people mixing up surrogate keys is by (usually
by chance) having the sequences for different tables start at wildly
different points. By starting one counter at a million and the other at
one, the chances that you'll be able to mix them up is reduced. On some
systems I can even identify the table a key comes from by looking at the
number, just because I know only one table has keys in the 30,000
range.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] enums

2006-01-18 Thread Andrew Dunstan




Leandro Guimarães Faria Corcete DUTRA wrote:


Andrew Dunstan andrew at dunslane.net writes:

 

If people would like to play, I have created a little kit to help in 
creating first class enum types in a few seconds.
   



Isn't what we actually want possreps?



 



You appear to be responding to mail from months ago. Please catch up 
before replying, so we don't rehash old discussions. As previously 
discussed, I intend to do first class enums for the next release of 
postgres, if I get enough time. Enumkit was just a very small step along 
the research road, although it is useful in itself, which is why I 
released it.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Michael Glaesemann



On Jan 18, 2006, at 22:08 , Leandro Guimarães Faria Corcete DUTRA wrote:


Jim C. Nasby jnasby at pervasive.com writes:

a) the optimizer does a really poor job on multi-column index  
statistics


So it should be fixed?


Of course! Patches welcome!

Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] Unique constraints for non-btree indexes

2006-01-18 Thread Martijn van Oosterhout

Hi,

Currently due to the way unique constraints are tied to btree there is
no way to allow GiST indexes to do the same thing. The thing I'm
specifically interested in is an index where you insert ranges
(start,end) and if unique, the index will complain if they overlap. As
a side-effect, this may make progress toward the goal of deferrable
unique indexes.

Part of the solution is to remove the layering violation from the btree
code, it really shouldn't be accessing the heap directly. What I'm
proposing is to move the bulk of _bt_check_unique into a new function
(say check_unique_index) in the general index machinary and have the
b-tree code do just:

check_unique_index( ctid of inserting tuple, ctid of possibly conflicting tuple)

The point being that GiST indexes could use exactly the same function
to check for duplicates. The function would return InvalidTransactionId
if there's no conflict, or an actual transaction id to wait on, just
like the btree code does now.

It would require some changes to the GiST code since a lot more of the
index may need to be checked for duplicates. I suppose in the general
case, since a key can appear in multiple places, the concurrency issues
could be difficult. I suppose you would insert your key first, then
check for duplicates thus ensuring that at least one of the two
conflicting transactions will see it.

Now, one side-effect is that you could build deferrable unique
constraints on top of this by having the check function always return
InvalidTransactionId but storing the conflicts for later checking. But
I first want to know if there are any real issues with the above.

Any thoughts?
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] Unique constraints for non-btree indexes

2006-01-18 Thread Jonah H. Harris

I thought gistinsert had checkUnique, it was just ifdef'd out because
there was no code to enforce it... and as such, during bootstrap it was
marked as amcanunique = false. Would it be that hard to enable it?On 1/18/06, Martijn van Oosterhout kleptog@svana.org
 wrote:Hi,Currently due to the way unique constraints are tied to btree there is
no way to allow GiST indexes to do the same thing. The thing I'mspecifically interested in is an index where you insert ranges(start,end) and if unique, the index will complain if they overlap. Asa side-effect, this may make progress toward the goal of deferrable
unique indexes.Part of the solution is to remove the layering violation from the btreecode, it really shouldn't be accessing the heap directly. What I'mproposing is to move the bulk of _bt_check_unique into a new function
(say check_unique_index) in the general index machinary and have theb-tree code do just:check_unique_index( ctid of inserting tuple, ctid of possibly conflicting tuple)The point being that GiST indexes could use exactly the same function
to check for duplicates. The function would return InvalidTransactionIdif there's no conflict, or an actual transaction id to wait on, justlike the btree code does now.It would require some changes to the GiST code since a lot more of the
index may need to be checked for duplicates. I suppose in the generalcase, since a key can appear in multiple places, the concurrency issuescould be difficult. I suppose you would insert your key first, then
check for duplicates thus ensuring that at least one of the twoconflicting transactions will see it.Now, one side-effect is that you could build deferrable uniqueconstraints on top of this by having the check function always return
InvalidTransactionId but storing the conflicts for later checking. ButI first want to know if there are any real issues with the above.Any thoughts?--Martijn van Oosterhout 
kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.-BEGIN PGP SIGNATURE-Version: GnuPG v1.4.1 (GNU/Linux)iD8DBQFDzkmWIB7bNG8LQkwRArgRAJ9E34krswmsSEsMv6h/1d1KJc7crACgg1kpm32u4QtjXCqd53fjUP6WKUE=
=E0+I-END PGP SIGNATURE-

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Greg Stark

Leandro Guimarães Faria Corcete DUTRA [EMAIL PROTECTED] writes:

 Greg Stark gsstark at mit.edu writes:
 
  I hate knee-jerk reactions too, but just think of all the pain of people
  dealing with databases where they used Social Security numbers for primary
  keys. I would never use an attribute that represents some real-world datum 
  as
  a primary key any more.
 
 I am not familiar with the situation.

The US gov't handed out unique numbers to every worker for their old age
pension program. Many early database designers thought that made a wonderful
natural primary key.

It turns out that:

a) not everyone has a social insurance number: when their business expanded to
include foreign nationals these databases had to make up fake social insurance
numbers.

b) Occasionally people's social insurance numbers change, either because they
got it wrong in the first place or because of identity theft later on. Even
dealing with it changing isn't good enough because the old records don't
disappear; the person essentially has *two* social insurance numbers.

c) For security reasons it turns out to be a bad idea to be passing around
social insurance numbers in the first place. So these database designers had a
major problem adapting when people started refusing to give them social
insurance numbers or complaining when their application leaked their social
insurance number.

In short, what seemed like the clearest possible example of a natural primary
key became a great example of how hard it is to deal with changing business
requirements when you've tied your database design to the old rules. Using
natural primary keys makes an iron-clad design assumption that the business
rules surrounding that datum will never change. And the one thing constant in
business is that business rules change.

In the past I've used username as a primary key for a users table, what
could be safer? 

Later we had to create a sequence generated userid column because some data
partners couldn't handle an text column without corrupting it. And of course
one day the question arose whether we could handle someone wanting to change
their username. Then another day we were asked whether we could have two
different people with the same username if they belonged to separate branded
subsites.



-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Unique constraints for non-btree indexes

2006-01-18 Thread Martijn van Oosterhout

On Wed, Jan 18, 2006 at 09:15:04AM -0500, Jonah H. Harris wrote:
 I thought gistinsert had checkUnique, it was just ifdef'd out because there
 was no code to enforce it... and as such, during bootstrap it was marked as
 amcanunique = false.  Would it be that hard to enable it?

Well, it has the argument to gistinsert but it is commented out and
there is no other reference to unique anywhere in the GiST code. Once
the support infrastructure is there we can talk about enabling it. At
the very least we need to decide how to indicate what unique is.

For example: saying the two ranges (1,3) and (2,4) cannot co-exist in
the same index is not really what most people would consider the
behaviour of a unique index. Indeed, for any particular data-type,
there may be multiple ways of defining a conflict. For 2-D objects it
may refer to having no objects overlap, but it could also refer to no
overlaps in the X or Y axes.

I guess what you're talking about is a constrained index, of which a
unique index is just a particular type. I suppose the actual constraint
would be one of the operators defined for the operator class (since
whatever the test is, it needs to be indexable). Although some would
obviously be more useful than others...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] Unique constraints for non-btree indexes

2006-01-18 Thread Jonah H. Harris

I think I understand what you're saying, just that I don't think the btree index has anything to do with it.The extensibility is there for indexes to handle uniques in any way they choose. If you wanted to add a common unique index checking function for GIST, I'd just add it to GIST. It just seems to me like the access methods should keep the handling internal to themselves.
On the chance that I'm not be understanding what you're saying, sorry.On 1/18/06, Martijn van Oosterhout kleptog@svana.org wrote:
On Wed, Jan 18, 2006 at 09:15:04AM -0500, Jonah H. Harris wrote:
I thought gistinsert had checkUnique, it was just ifdef'd out because there was no code to enforce it... and as such, during bootstrap it was marked as amcanunique = false.Would it be that hard to enable it?
Well, it has the argument to gistinsert but it is commented out andthere is no other reference to unique anywhere in the GiST code. Oncethe support infrastructure is there we can talk about enabling it. At
the very least we need to decide how to indicate what unique is.For example: saying the two ranges (1,3) and (2,4) cannot co-exist inthe same index is not really what most people would consider the
behaviour of a unique index. Indeed, for any particular data-type,there may be multiple ways of defining a conflict. For 2-D objects itmay refer to having no objects overlap, but it could also refer to no
overlaps in the X or Y axes.I guess what you're talking about is a constrained index, of which aunique index is just a particular type. I suppose the actual constraintwould be one of the operators defined for the operator class (since
whatever the test is, it needs to be indexable). Although some wouldobviously be more useful than others...Have a nice day,--Martijn van Oosterhout kleptog@svana.org
http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.-BEGIN PGP SIGNATURE-Version: GnuPG v1.4.1 (GNU/Linux)iD8DBQFDzm26IB7bNG8LQkwRAiUCAJ9MURp34CmKaxWFPrESKqvx2DDsYQCePSLvJrKzcRQU7wf25oDv42Oeosc=
=y0WG-END PGP SIGNATURE-

Re: [HACKERS] debug_query_string and multiple statements

2006-01-18 Thread Bruce Momjian


Yep, I couldn't find a better way to do it when I added
debug_query_string long ago.  Unless we go to a lot of work to parse the
string, we could end up with something worse than we have now.

---

Neil Conway wrote:
 While reviewing Joachim Wieland's patch to add a pg_cursors system view,
 I noticed that the patch assumes that debug_query_string contains the
 portion of the submitted query string that corresponds to the SQL
 statement we are currently executing. That is incorrect:
 debug_query_string contains the *entire* verbatim query string sent by
 the client. So if the client submits the query string SELECT 1; SELECT
 2;, debug_query_string will contain exactly that string. (psql actually
 splits queries like the above into two separate FE/BE messages -- to see
 what I'm referring to, use libpq directly, or start up a copy of the
 standalone backend.)
 
 This makes debug_query_string the wrong thing to use for the pg_cursors
 and pg_prepared_statements views, but it affects other parts of the
 system as well: for example, given PQexec(conn, SELECT 1; SELECT 2/0;)
 and log_min_error_statement = 'error', the postmaster will log:
 
 ERROR:  division by zero
 STATEMENT:  SELECT 1; SELECT 2/0;
 
 which seems misleading, and is inconsistent with the documentation's
 description of this configuration parameter. Admittedly this isn't an
 enormous problem, but I think the current behavior isn't ideal.
 
 Unfortunately I don't see an easy way to fix this. It might be possible
 to extra a semicolon separated list of query strings from the parser or
 lexer, but that would likely have the effect of munging comments and
 whitespace from the literal string submitted by the client, which seems
 the wrong thing to do for logging purposes. An alternative might be to
 do a preliminary scan to look for semicolon delimited query strings, and
 then pass each of those strings into the raw_parser() separately, but
 that seems quite a lot of work (and perhaps a significant runtime cost)
 to fix what is at worst a minor UI wrinkle.
 
 Thoughts?
 
 -Neil
 
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Bad estimate on LIKE matching

2006-01-18 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote:
 Any way to teach the planner about this?

 In a recent thread on -perform, I opined that this case could best be
 solved by using dynamic random block sampling at plan time followed by a
 direct evaluation of the LIKE against the sample. This would yield a
 more precise selectivity and lead to the better plan. So it can be
 improved for the next release.

I find it exceedingly improbable that we'll ever install any such thing.
On-the-fly sampling of enough rows to get a useful estimate would
increase planning time by orders of magnitude --- and most of the time
the extra effort would be unhelpful.  In the particular case exhibited
by Magnus, it is *really* unlikely that any such method would do better
than we are doing now.  He was concerned because the planner failed to
tell the difference between selectivities of about 1e-4 and 1e-6.
On-the-fly sampling will do better only if it manages to find some of
those rows, which it is unlikely to do with a sample size less than
1e5 or so rows.  With larger tables the problem gets rapidly worse.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Jim C. Nasby

On Wed, Jan 18, 2006 at 01:08:53PM +, Leandro Guimar??es Faria Corcete 
DUTRA wrote:
  b) If each parent record will have many children, the space savings from
  using a surrogate key can be quite large
 
 Not such a common case.
 
Hmmm...

Many blog entries per user... Many blog comments per entry

Many PO's per customer... many line items per PO...
 
Etc., etc. I would argue that one-many relationships are far more common
than one-one, and it's very common for an integer ID to be a more
compact representation than a real key.

  c) depending on how you view things, putting actual keys all over the
  place is denormalized
 
 How come?  Never!
 
Huh?

One of the tenants of normalization is that you don't repeat data. You
don't use customer name in your PO table, because it's asking for
problems; what if a customer changes names (as just one example).

  Generally, I just use surrogate keys for everything unless performance
  dictates something else.
 
 What I am proposing is the reverse: use natural keys for everything unless 
 performance dictates something else.
 
 In support of my PoV: 
 http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1

Read the bottom of it:

I am not saying that you should avoid autonumber surrogate keys like an
SCO executive. The danger is not in their use but in their abuse. The
events_id column in the events table didn't give us any trouble
until we began to rely on it as the sole key for the table. The
accounting application gave us problems because we were using the ID as
the entire handle for the records. That crossed the line from use to
misuse, and we suffered for it.

To paraphrase, the issue isn't that surrogate keys were used for RI; the
issue is that proper keys were not setup to begin with. Does it make
sense to have a customer table where customer_name isn't unique? Almost
certainly not. But that's just one possible constraint you might put on
that table. To put words in Josh's mouth, the issue isn't with using a
surrogate key, it's with not thinking about what constraints you should
be placing on your data.

Take a look at cbk's comment; he does a great job of summing the issue
up.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] log_min_messages and debug levels

2006-01-18 Thread Jaime Casanova

Hi,

in my machine (win xp) i was trying to start psql (8.1.1) with
log_min_messages to debug5 (just to see the messages :) but even the
service start i cannot use psql nor pgadmin i receive an error of
server  closed the connection unexpectedly

postgres=# select version();
 version

--
 PostgreSQL 8.1.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC)
3.4.2 (mingw-special)
(1 fila)


Sorry, my postgres is in spanish but maybe you can recognize the message... ;)

C:\Archivos de programa\PostgreSQL\8.1\binpsql -U postgres pruebas
psql: el servidor ha cerrado la conexión inesperadamente,
probablemente porque terminó de manera anormal
antes o durante el procesamiento de la petición.


is this expected on windows platforms?

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] No heap lookups on index

2006-01-18 Thread David Scott

Allow me a brief introduction.  I work in a company who contracts 
intelligence analysis software to the government.  We are currently 
developing a product which is using PostgreSQL at it's core.  Due to the 
licensing of the product and the integration with perl this is our first 
choice in database solutions.


We are, however, currently stuck.  We are storing millions of rows and 
require very high query performance.  We have spent the last several 
months tweaking, list lurking and researching all the various tweaks and 
performance enhancements and have come to the conclusion that our 
biggest slowdown is validating the index rows which match our selection 
criteria against the heap values.  In general cases there is a very 
small amount required for this, but in our extreme use cases we are 
finding this to slow our queries by an unacceptable amount of time. 

We would like to resolve this issue.  In that endeavor we have done some 
feasibility analysis (either to write a patch ourselves or attempt to 
commission an expert to do so), starting with the archives for this 
list.  We found several posts discussing the issue and it seems that the 
complexity of storing the tuple visibility information inside of the 
index rows is prohibitive for simple indexes. 

I have used SQL Server in the past and have noticed that bookmark 
lookups are avoided because they force the query executor to actually 
fetch the data page off of disk, rather then return the values that 
exist in the index.  I have verified times against the PostgreSQL 
installation and SQL Server to verify that the SQL Server queries come 
back at roughly the same speed when avoiding bookmark lookups as 
Postgres queries accessing clustered tables using the index the table is 
clustered on.


Since I am sure everyone is tired of the intro by now, I'll get to the 
questions:
   Do commercial databases implement MVCC in a way that allows an 
efficient implementation of index lookups that can avoid heap lookups? 
   Is there any way to modify PostgreSQL to allow index lookups without 
heap validation that doesn't involve re-writing the MVCC implementation 
of keeping dead rows on the live table? 
   Is the additional overhead of keeping full tuple visibility 
information inside of the index so odious to the Postgres community as 
to prevent a patch with this solution from being applied back to the 
head?  Maybe as an optional use feature?  We would prefer this solution 
for our needs over the bitmap of heap pages listed in the TODO list 
because we want to ensure optimal query times, regardless of the state 
of the cache and because we are concerned with performance in the face 
of concurrent updates on the page level.


Thanks for any thoughts on this, I know this is a  perennial topic, but 
we are seriously considering contributing either code or money to the 
solution of this problem.


David Scott
Applied Technical Systems, Inc.




---(end of broadcast)---
TIP 6: explain analyze is your friend

71 matches

Mail list logo