Re: [HACKERS] PG qsort vs. Solaris

2006-10-04 Thread Zeugswetter Andreas DCP SD

  So basically, glibc's qsort is bad enough that even a 
  10%-more-comparisons advantage doesn't save it.

 Do those numbers look very different if you have lots of 
 columns or if you're sorting on something like an array or a ROW?

Imho, that also is an argument for using our own qsort.
It can be extended to deal with high comparison function cost directly.

Thus I would opt to add a comparison function cost arg to qsort_arg
iff
we find scenarios where our qsort performs too bad.
This cost can be used to switch to merge sort for very high cost values.

Andreas 

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] vcbuild bison check

2006-10-04 Thread Zeugswetter Andreas DCP SD

 Ok. So what you want is something that checks that it's 
 =1.875 but specifically not 2.1? 
 
 Might be a while before I can submit an updated patch for 
 that, may need to rewrite the whole script in perl to do that 
 :-( .bat files are horribly limited in what they can do.

Since we are on NT or higher you could use extensions:
IF %bversion% GEQ 1.875 IF %bversion% NEQ 2.1 goto use_bison
http://www.robvanderwoude.com/ntif.html

(even in .bat files) to avoid converting to perl.
sorry, haven't looked at the file so needs adaption

Thank you for the work
Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] vcbuild bison check

2006-10-04 Thread Zeugswetter Andreas DCP SD
 

 -Original Message-
 From: Magnus Hagander [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 04, 2006 6:01 PM
 To: Zeugswetter Andreas ADI SD
 Cc: pgsql-hackers@postgresql.org
 Subject: RE: [HACKERS] [PATCHES] vcbuild bison check
 
   Ok. So what you want is something that checks that it's
   =1.875 but specifically not 2.1?
  
   Might be a while before I can submit an updated patch for that,
  may
   need to rewrite the whole script in perl to do that :-( .bat
  files are
   horribly limited in what they can do.
  
  Since we are on NT or higher you could use extensions:
  IF %bversion% GEQ 1.875 IF %bversion% NEQ 2.1 goto use_bison 
  http://www.robvanderwoude.com/ntif.html
 
 I thought that only worked if your locale was set to 
 something that has dot as decimal separator. Mine has comma, 
 as have many others...

Um, I think it does a string compare because point or comma is no
decimal digit,
but that would imho also be sufficient.
My locale is German, so my decimal sep should also be a comma, and it
worked
for the mentioned versions.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] vcbuild bison check

2006-10-04 Thread Zeugswetter Andreas DCP SD

 And given that, they're going to get the latest by default, 
 or 1.875 if they read the (currently being written) README.

The point was, that = 2.2 won't be allowed when it comes out for win32,
even if it should work.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Faster StrNCpy

2006-10-03 Thread Zeugswetter Andreas DCP SD

 I'm still interested to experiment with MemSet-then-strlcpy 
 for namestrcpy, but given the LENCPY results this may be a loser too.

Um, why not strlcpy then MemSet the rest ?

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] Bad bug in fopen() wrapper code

2006-10-03 Thread Zeugswetter Andreas DCP SD

 Magnus, is this the right fix?

Well, actually msdn states:

Return Value
 If successful, _setmode returns the previous translation mode. A return
value of -1 indicates an error

So, shouldn't we be testing for -1 instead of  0 ?

The thing is probably academic, since _setmode is only supposed to fail
on invalid file handle or invalid mode.
So basically, given our code, it should only fail if filemode is
(O_BINARY | O_TEXT) both flags set.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Faster StrNCpy

2006-10-03 Thread Zeugswetter Andreas DCP SD

   I'm still interested to experiment with MemSet-then-strlcpy for 
   namestrcpy, but given the LENCPY results this may be a loser too.
  Um, why not strlcpy then MemSet the rest ?
 
 That's what strncpy() is supposed to be doing.

Yes, but it obviously does not in some ports, and that was the main
problem
as I interpreted it.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] @ versus ~, redux

2006-09-06 Thread Zeugswetter Andreas DCP SD

  The existing geometric containment tests seem to be nonstrict, so
if 
  we wanted to leave room to add strict ones later, it might be best
to 
  settle on
  
  x @= yx contains or equals y
  x =@ yx is contained in or equals y
  
  reserving @ and @ for future strict comparison operators.
 
  At first glace, it seems more intuitive to me to do:
 
  x @= y x contains or equals y
  x =@ y y is contained in or equals y
 
 Hm, I've never seen anyone spell less than or equal to as 
 =, so I'm not sure where you derive =@ from?  Not 
 saying no, but the other seems clearer to me.

Yes, but to me too =@ seems more natural since we started with @ and
@.
Tom, your argument would more match your original @ and @, but then it

would imply @= and @=, imho.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] @ versus ~, redux

2006-09-04 Thread Zeugswetter Andreas DCP SD

x @ y means x is contained in y
  
   ltree @ ltree
  
  If you consider ltree entries to be sets containing all their
children 
  then those sound consistent.

Now we get to decide whether @ was better than the now proposed @
:-)
I like @. (or we stay clear by using the inet ops)

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] insert/update/delete returning and rules

2006-09-04 Thread Zeugswetter Andreas DCP SD

 With this approach, you still have to update your rules if 
 you want to support RETURNING on your views --- but if you 
 don't update them, you don't have a security hole.  Basically 
 the standard setup for an updatable view would use
   ON INSERT DO INSTEAD INSERT INTO ... RETURNING ...
 where today you don't write any RETURNING.

I like that approach. And if the sections allow CASE WHEN
it should be possible to cover all use cases efficiently.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Prepared statements considered harmful

2006-09-01 Thread Zeugswetter Andreas DCP SD

  I'd wish that we reconsider when and how prepared statements are 
  used.  The JDBC interface and PL/pgSQL are frequently noticed 
  perpetrators, but the problem is really all over the place.
  AFAIK those are the only two places where preparation is 
 the default 
  ... what else were you thinking of?
  
  Perl DBI (DBD::Pg) defaults to prepared plans when connecting to a 
  version 8.0 or higher server.
  
  Or at least, that's the way I read the documentation.

Yea, but if you close the statement or leave the scope of the statement
variable the plan is gone. So it is doing exactly what I would expect.

It is written $stmt-prepare('select 1') what else would you expect ? 
There are enough other functions to get a result without a plan sticking
around, like $db-selectrow_array 

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Win32 hard crash problem

2006-09-01 Thread Zeugswetter Andreas DCP SD

  My bet is something depending on GetTickCount to measure elapsed
time 
  (and no, it's not used in the core Postgres code, but you've got 
  plenty of other possible culprits in that stack).
 
  This doesn't quite make sense. The only reason we have to reboot is 
  because PostgreSQL no longer responds. The system itself is fine.
 
 The Windows kernel may still work, but that doesn't mean that 
 everything Postgres depends on still works.

It may be a not reacting listen socket. This may be because of a handle
leak. Next time it blocks look at the handle counts (e.g. with
handle.exe
from sysinternals).

You could also look for handle count now with Task Manager and see if it
increases constantly. (handle.exe shows you the details)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Prepared statements considered harmful

2006-09-01 Thread Zeugswetter Andreas DCP SD

 Anyway, your plan assumes that you have information to work 
 with. The current system plans prepared queries with no 
 information at all about parameters and people are advocating 
 to keep it that way. I think a good first step would be the 
 plan on first execution, like Oracle does.

Yup, it is also possible to try to find an obvious plan and only
delay planning (or part of the plan) when different inputs make a
big difference (like MaxDB and Informix).

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Prepared statements considered harmful

2006-09-01 Thread Zeugswetter Andreas DCP SD

  I don't chime in very often, but I do think the refusal to
incorporate 
  hints into the planner system is fantastically stubborn and 
  nonsensical.
 
 What is actually fantastically nonsensical about this is that 
 the issues I outlined about prepared statements would merely 
 become worse if planner hints were used.  Then, you wouldn't 
 only have to worry about plans that were created earlier 
 during the session, you would be faced with plans that were 
 created earlier during the application's development.  In 
 general, the solutions to the prepared statement issues need 
 to effect that the plans are created more often, not less often.

I have yet to see one of our partial Informix hints (where the planner
does it's
usual job only with one path with lowered/elevated costs) fall foul on 
not anticipated change of underlying data.

Thus I don't buy the argument that hints are always bad.
Of course their use should be extremely rare and well thought out.
Most of the time sql tuning involves a concerted effort between the
programmer and a db performance expert, usually resulting in 
rewritten sql or program logic without adding hints.

I can see arguments for hints the dba can set himself centrally on the
server,
but in my experience chances for substantial improvement are very
limited in that case.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Prepared statements considered harmful

2006-08-31 Thread Zeugswetter Andreas DCP SD

  How about prepared means really prepared... in the sense of 
  parsed, analyzed all sensible plans, and save a meta-plan which
based 
  on current statistics and parameter values chooses one of the 
  considered (and cached) plans ?
 
 I don't think this could solve one particularly frequent 
 problem which is that pattern matching queries don't get 
 along with prepared plans if the search pattern isn't known 
 at planning time.

I think what we would actually want is knowledge about how
much difference different parameters actually make in plan decision.
(the stats show an even distribution and join correlation) 
Then we could prepare the plan when there is not much difference
and postpone planning until we know the parameters when the difference
is big.

OLTP workload typically benefits from prepared plans, and the one plan
is good 
for all possible inputs, so imho we cannot just assume all plans need
replanning
for different parameters.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Tricky bugs in concurrent index build

2006-08-25 Thread Zeugswetter Andreas DCP SD

  What bothers me about what we have now is that we have optional 
  keywords before and after INDEX, rather than only between 
 CREATE and INDEX.
 
 Yeah, putting them both into that space seems consistent to 
 me, and it will fix the problem of making an omitted index 
 name look like a valid command.
 
 I'm not sure I should be opening this can of worms, but do we 
 want to use a different keyword than CONCURRENTLY to make it 
 read better there?

precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end:
 CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;

I'd stick with that.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Tricky bugs in concurrent index build

2006-08-25 Thread Zeugswetter Andreas DCP SD

  precedent syntax (Oracle, Informix) uses the keyword ONLINE 
 at the end:
   CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;
 
 That was what the patch originally used, but it was changed 
 because it made difficult for psql to auto-complete that.

That is imho not enough of a reason to divert.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Tricky bugs in concurrent index build

2006-08-23 Thread Zeugswetter Andreas DCP SD

  Is it not possible to brute force this adding an AM method to insert

  without the uniqueness check?
 
 Hm.  Actually there already is a feature of aminsert to allow 
 suppressing the unique check, but I'm not sure whether using 
 it for RECENTLY_DEAD tuples helps.  Seems like we have to 
 wait to see whether DELETE_IN_PROGRESS deleters commit in any case.

Um, but if we wait for the DELETE_IN_PROGRESS tuple, after the wait we
can
add it eighter with or without the unique check (depending on
commit/abort).

Then at least we don't need to wait in a 3rd pass for readers ?

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WIP archive_timeout patch

2006-08-18 Thread Zeugswetter Andreas DCP SD

 I noticed a minor annoyance while testing: when the system is 
 completely idle, you get a forced segment switch every 
 checkpoint_timeout seconds, even though there is nothing 
 useful to log.  The checkpoint code is smart enough not to do 
 a checkpoint if nothing has happened since the last one, and 
 the xlog switch code is smart enough not to do a switch if 
 nothing has happened since the last one ... but they aren't 
 talking to each other and so each one's change looks like 
 something happened
 to the other one.  I'm not sure how much trouble it's worth 
 taking to prevent this scenario, though.  If you can't afford 
 a WAL file switch every five minutes, you probably shouldn't 
 be using archive_timeout anyway ...

Um, I would have thought practical timeouts would be rather more
than 5 minutes than less. So this does seem like a problem to me :-(

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Possible Typecasting Bug with coalesce()

2006-07-19 Thread Zeugswetter Andreas DCP SD

  = select now()coalesce('Jul 14 2006 9:16:47AM');

 The only bug I see here is that implicit coercions to text 
 are a bad idea :-( --- IMHO it would be better if your first 
 query failed instead of giving you unexpected behavior.

:-) We know that you think that Tom, but a lot of us do not 
want to go casting all our sql, especially where other db's don't
require it.

Would an equivalent CASE statement also do the early conversion to text
? 

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [SQL] using constraint based paritioning to fix EAV type schemas? (view inheritance and constraints)

2006-07-19 Thread Zeugswetter Andreas DCP SD

 I've seen a few EAV designs in practice. They've all been 
 problematic. I'd like to have a better way of dealing with 
 them. Which is why I'm tentatively suggesting support for 
 inheritance and constraints in views. If there's some other 
 way to achieve constraint based exclusion across a UNION of 
 heterogenous tables, I'd love to hear it.

If you do your own rules anyway, why can't you use inheritance and
create the rules
on the parent table and the constraints on the child tables ?

You can still use the child tables directly if you want.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [SQL] using constraint based paritioning to fix EAV type schemas? (view inheritance and constraints)

2006-07-19 Thread Zeugswetter Andreas DCP SD

  I've seen a few EAV designs in practice. They've all been
problematic. 
  I'd like to have a better way of dealing with them. Which is why I'm

  tentatively suggesting support for inheritance and constraints in 
  views. If there's some other way to achieve constraint based
exclusion 
  across a UNION of heterogenous tables, I'd love to hear it.
 
 If you do your own rules anyway, why can't you use 
 inheritance and create the rules on the parent table and the 
 constraints on the child tables ?

Ah, sorry, just saw that you want different column names in your
subtables.

Add me as another vote to extend the new constraint elimination to union
all views :-)

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] RESET CONNECTION?

2006-07-19 Thread Zeugswetter Andreas DCP SD

   Will this patch make it into 8.2?
   http://archives.postgresql.org/pgsql-patches/2004-12/msg00228.php
   
   It's a really nice feature, would be extremly useful with tools
like pgpool.
  
  No, it will not because RESET CONNECTION can mess up interface code 
  that doesn't want the connection reset.  We are not sure how to
handle that.

Imho, if it where at the protocol level, that would not be such an
issue.
If the interface gives access to the protocol level it is already
depending
on good will.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] automatic system info tool?

2006-07-18 Thread Zeugswetter Andreas DCP SD

  If you can open a command shell you can get the OS version with the

  'ver' command under Windows:
 
  C:\ver
 
  Microsoft Windows XP [Version 5.1.2600]
 
  How do you do this from a program though. Under UNIX uname() is a 
  function call as well as a program. It returns the os name, version,

  hostname and system type.
 
 
 GetVersionEx() will get you the windows version, service 
 pack, etc IIRC.

in perl:

use POSIX;
print join(',',POSIX::uname()),\n;

prints:
Windows NT,hostname.domain.com,5.0,Build 2195 (Service Pack 4),x86

Works on all Platforms.

(more detail on Win with: use Win32; join(' ', Win32::GetOSVersion()),
\n;)

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Implied Functional Index use

2006-07-13 Thread Zeugswetter Andreas DCP SD

   - add a new boolean to pg_operator to allow us to define which 
   operators offer true equality ...
  
  This would be useful for other purposes too, as we keep coming up 
  against what's the equality operator for this datatype problems.
  However, the restriction to true equality, such that we can assume
x 
  = y implies f(x) = f(y) for every immutable function f on the
datatype

Maybe we could have a tri (or more) state flag for the equality
operators.

' ' .. not an equality op
'e' .. equality
's' .. strict equality (op only true iff the binary representation is
equal)

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Implied Functional Index use

2006-07-13 Thread Zeugswetter Andreas DCP SD

 There is a problem of implication here, AFAICS:
 When a user SQL asks 
   WHERE col1 = 7
 which equality level is meant when several exist?

Well, the operator must be unique, so there is no problem.
Unique in the sense that an operator with the same name ('=' in this
case)
and argument types cannot exist for more than one level of equality. 
(and the level should not have an effect on the resolution)

So, when we see col1 = 7 we lookup the equality level of the operator
and decide whether it is strict enough for the particular optimization.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] update/insert,

2006-07-05 Thread Zeugswetter Andreas DCP SD

 OK, but the point of the question is that constantly updating 
 a single row steadily degrades performance, would 
 delete/insery also do the same?

Yes, there is currently no difference (so you should do the update).
Of course performance only degrades if vaccuum is not setup correctly.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] update/insert, delete/insert efficiency WRT vacuum and

2006-07-04 Thread Zeugswetter Andreas DCP SD

  Is there a difference in PostgreSQL performance between these two 
  different strategies:
  
  
  if(!exec(update foo set bar='blahblah' where name = 'xx'))
  exec(insert into foo(name, bar) values('xx','blahblah'); or

In pg, this strategy is generally more efficient, since a pk failing
insert would create
a tx abort and a heap tuple. (so in pg, I would choose the insert first
strategy only when 
the insert succeeds most of the time (say  95%))

Note however that the above error handling is not enough, because two
different sessions
can still both end up trying the insert (This is true for all db systems
when using this strategy).

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Zeugswetter Andreas DCP SD

  Here is an overview of the SITC method:
  http://momjian.us/cgi-bin/pgsitc
 
 A pretty fundamental problem is that the method assumes it's 
 OK to change the CTID of a live tuple (by swapping its item 
 pointer with some expired version).  It is not --- this will break:

I am having difficulty visualizing that. The plan is not to change
CTID's
(only the CTID's offset into the page is to be changed).
The CTID of the new version is one that is up to now invisible to all
backends,
so noone can actually have remembered that CTID.

Also you would first insert the slot content and then change the CTID
offset
(this offset change might need to be made atomic).

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Single Index Tuple Chain (SITC) method

2006-06-29 Thread Zeugswetter Andreas DCP SD

  And anyway, ctid is a usable unique row identifier only within 
  read-only transactions, or not ?

actually for as long as no vacuum comes along. This would change
with SITC. (Maybe it would help to only reuse old versions of the same
row,
then anybody holding a ctid would at least be still looking at a version
of
the same row, and should thus be able to follow the update chain)
 
 Err, no. The ctid is the only identifer of a tuple in any 
 case. When you do a delete, the tuple to be deleted is 
 indicated by the ctid field which has been passed up from the 
 base table through the rest of the query. When you reach the 
 top the ctid better refer to the same tuple or you'll delete 
 the wrong one. UPDATE is the same.

For all these purposes you will be holding the ctid of a visible
(to someone) tuple. Those don't qualify for a new SITC tuple anyway.

 For all intents and purposes, the CTID of tuple can't change 
 unless you're 100% certain no-one is using it in any way.

For all I know, noone is using dead tuples except for visibility
lookup. We would need to make sure that other backends see the new 
tuple eighter as dead or txopen as long as the contents are not valid.
I think we could do that without a vacuum lock on platforms that support
4 byte atomic operations.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-27 Thread Zeugswetter Andreas DCP SD

Very nice explanation, thanks Alvaro.

 2. Mark frozen databases specially somehow.
To mark databases frozen, we need a way to mark tables as frozen.
How do we do that?  As I explain below, this allows some nice
optimizations, but it's a very tiny can full of a huge amount of
worms.
 
 Marking a Table Frozen
 ==
 
 Marking a table frozen is simple as setting relminxid = 
 FrozenXid for a table.  As explained above, this cannot be 
 done in a regular postmaster environment, because a 
 concurrent transaction could be doing nasty stuff to a table. 
  So we can do it only in a standalone backend.

Unless you lock the table exclusively during vacuum, that could be done
with 
vacuum freeze. I like that more, than changing stuff that is otherwise
completely 
frozen/static. (I see you wrote that below)

 On the other hand, a frozen table must be marked with 
 relminxid = a-regular-Xid as soon as a transaction writes 
 some tuples on it.  Note that this unfreezing must take 
 place even if the offending transaction is aborted, because 
 the Xid is written in the table nevertheless and thus it 
 would be incorrect to lose the unfreezing.

The other idea was to need a special unfreeze command ...

 
 This is how pg_class_nt came into existence -- it would be a 
 place where information about a table would be stored and not 
 subject to the rolling back of the transaction that wrote it.

Oh, that puts it in another league, since it must guarantee commit.
I am not sure we can do that. The previous discussion was about
concurrency and data that was not so important like tuple count.

In short: 
- I'd start with #1 (no relminxid = FrozenXid) like Tom
suggested
- and then implement FREEZE/UNFREEZE with exclusive locks 
like Simon wrote (so it does not need pg_class_nt) and use that
for the templates.

Simon wrote:
 Suggest that we prevent write operations on Frozen tables by revoking
all INSERT, UPDATE or DELETE rights held, then enforcing a check during
GRANT to prevent them being re-enabled. Superusers would need to check
every time. If we dont do this, then we will have two contradictory
states marked in the catalog - privilges saying Yes and freezing saying
No.

No, I'd not mess with the permissions and return a different error when
trying to
modify a frozen table. (It would also be complicated to unfreeze after
create database)
We should make it clear, that freezing is no replacement for revoke.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Non-transactional pg_class, try 2

2006-06-27 Thread Zeugswetter Andreas DCP SD

   Suggest that we prevent write operations on Frozen tables by 
   revoking
  all INSERT, UPDATE or DELETE rights held, then enforcing a check 
  during GRANT to prevent them being re-enabled. Superusers would need

  to check every time. If we dont do this, then we will have two 
  contradictory states marked in the catalog - privilges saying Yes
and 
  freezing saying No.
  
  No, I'd not mess with the permissions and return a different error 
  when trying to modify a frozen table. (It would also be complicated
to 
  unfreeze after create database) We should make it clear, that
freezing 
  is no replacement for revoke.
 
 That was with a mind to performance. Checking every INSERT, 
 UPDATE and DELETE statement to see if they are being done 
 against a frozen table seems like a waste.

I'd think we would have relminxid in the relcache, so I don't buy the
performance argument :-) (You could still do the actual check in the
same place where the permission is checked)

 There would still be a specific error message for frozen 
 tables, just on the GRANT rather than the actual DML statements.

I'd still prefer to see the error on modify. Those that don't can
revoke.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD

  On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index
row 
  as dead because it actually points to more than one row on the
page, 
  some are non-visible, some are visible.
  
  Back up the truck ... you mean in the current code base we have heap

  tuples that are visible in index scans because of heap tuple
chaining 
  but without index tuples pointing directly at them?
 
 I don't know where this idea came from, but it's not true.  
 All heap tuples, dead or otherwise, have index entries.  

When using CITC you would be reusing the index tuples from the current
heap tuple, so you can only reuse free space or a dead member of a CITC
chain.
You cannot reuse a dead tuple not member of a CITC chain because that
has separate
(invalid) index tuples pointing at it.

Part of the trick was moving slots (==ctid) around, so I still do not
really see how
you can represent the CITC chain as part of the update chain. 
Unless you intend to break dead parts of the update chain ? Maybe that
is ok ?

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD

  head of the chain yet.  With an index scan, finding the head is
easy, 
  but for a sequential scan, it seems more difficult, and we don't
have 
  any free space in the tail of the chain to maintain a pointer to the
head.
 
 Thinking some more, there will need to be a bit to uniquely 
 identify the head of a CITC.

I don't think so. It would probably be sufficient to impose an order on
the CITC.
e.g. the oldest tuple version in the CITC is the head. 
(An idea just in case we can't spare a bit :-) 

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-23 Thread Zeugswetter Andreas DCP SD

 back and forth the data between an archive table and the live 
 table, based on how active the groups are, I can't imagine 
 any other way of partitioning it. And that would also mean 
 some quite big load given the pretty high dynamics of the groups.

You said the activity comes in bursts per group, so the obvious
partitioning would be per group.
If you have too many groups to have one partition per group you could
try to find some modulo or other rule to spread them into separate
partitions.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-23 Thread Zeugswetter Andreas DCP SD
  This could be a solution... but then I'm not sure how well would do 
  queries which need the first 10 records based on some criteria which

  does not include the group id. I guess limit queries across the
union 
  of the partitions don't work too well for now, and we do 
 have such queries.

You would query the parent (no union). Do you need order by's ?
Without order by it is currently no problem.

Do we push the limit down to the separate tables when we have an
appropriate
index for the order by (that could be a TODO item)?
(You need a max of limit rows per child in the outer order) 

Or we would need to implement an efficient index merge node
for order by queries on parent (and union all's) with low limits
and an appropriate index.

Selecting the oldest x rows from a time partitioned table is a frequent
problem
we need to work around here too (Informix db).

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-22 Thread Zeugswetter Andreas DCP SD

  Each time the record is updated, a new version is created, thus 
  lengthening the correct version search each time row is accessed,

  until, of course, the next vacuum comes along and corrects the
index 
  to point to the latest version of the record.
 
  Is that a fair explanation?
 
  No, it's not.
 
  1.  The index points to all the versions, until they get vacuumed
out.

 it points to the last current version as updated by vacuum, or the
first version 
 of the row.

no, the index has one entry for each version of the row.
This is why updating only non-indexed columns is relatively expensive
in pg.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] sync_file_range()

2006-06-20 Thread Zeugswetter Andreas DCP SD

  Indeed, I've been wondering lately if we shouldn't resurrect 
  LET_OS_MANAGE_FILESIZE and make that the default on systems with 
  largefile support.  If nothing else it would cut down on open/close 
  overhead on very large relations.

I'd still put some limit on the filesize, else you cannot manually
distribute a table across spindles anymore. Also some backup solutions
are not too happy with too large files eighter (they have trouble
with staging the backup). I would suggest something like 32 Gb.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] sync_file_range()

2006-06-20 Thread Zeugswetter Andreas DCP SD
Tom Lane [EMAIL PROTECTED] writes: 
  Indeed, I've been wondering lately if we shouldn't resurrect 
  LET_OS_MANAGE_FILESIZE and make that the default on systems with 
  largefile support.  If nothing else it would cut down on open/close 
  overhead on very large relations.
 
  I'd still put some limit on the filesize, else you cannot manually 
  distribute a table across spindles anymore. Also some 
 backup solutions 
  are not too happy with too large files eighter (they have 
 trouble with 
  staging the backup). I would suggest something like 32 Gb.
 
 Well, some people would find those arguments compelling and 
 some wouldn't.  We already have a manually configurable 
 RELSEG_SIZE, so people who want a 32Gb or whatever segment 
 size can have it.
 But if you're dealing with terabyte-sized tables that's still 
 a lot of segments.
 
 What I'd be inclined to do is allow people to set RELSEG_SIZE 
 = 0 in pg_config_manual.h to select the unsegmented option.  
 That way we already have the infrastructure in pg_control etc 
 to ensure that the database layout matches the backend.

That sounds perfect. Still leaves the question of what to default to ?

Another issue is, that we would probably need to detect large file
support of the underlying filesystem, else we might fail at runtime :-(

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] MultiXacts WAL

2006-06-19 Thread Zeugswetter Andreas DCP SD

 I would like to see some checking of this, though.  Currently 
 I'm doing testing of PostgreSQL under very large numbers of 
 connections (2000+) and am finding that there's a huge volume 
 of xlog output ... far more than 
 comparable RDBMSes.   So I think we are logging stuff we 
 don't really have to.

I think you really have to lengthen the checkpoint interval to reduce
WAL overhead (20 min or so). Also imho you cannot only compare the log
size/activity since other db's write part of what pg writes to WAL to 
other areas (physical log, rollback segment, ...).

If we cannot afford lenghtening the checkpoint interval because of 
too heavy checkpoint load, we need to find ways to tune bgwriter, and
not
reduce checkpoint interval.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-12 Thread Zeugswetter Andreas DCP SD

  This bothers me a bit, because in
  fact the effects if any of the tested query would have been rolled 
  back.  Not sure we have any choice though.  If we expose the error 
  then we'll have problems with clients not showing the EXPLAIN 
  results.
 
  I think we should leave it in top level, throw the error and fix the

  clients.
  As I understood, the idea was, that it only does that if you press
^C 
  or query timeout. In this case current clients would also not show
the 
  plan.
 
 Not if the clients are implemented per protocol spec.  A 
 client cannot assume that sending QueryCancel will make the 
 current query fail.

Sorry I don't understand that comment. I did not not say that it must
fail,
but iff it is interrupted (and thus fails) was the case I meant.

You stated, that current clients won't show the explain output if they 
get a protocol error response. (Does the protocol not allow both data
and error ?)

We would need to teach clients to output the explain result even if an
error
is returned.

I hold my comment: on ^C we should return the plan and return the error.
We should not misuse automatic subtransactions for this.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] More on inheritance and foreign keys

2006-06-09 Thread Zeugswetter Andreas DCP SD

  I think that the ONLY was wrong from day one :-(
 
 Well, sure, but until we have an implementation that actually 
 *works* across multiple tables, it has to be there so that we 
 can at least consistently support the current single-table 
 semantics.  Until we have some form of cross-table unique 
 constraint (index or whatever) we can't support multi-table 
 foreign keys 

 --- taking off the ONLY is not a fix.

Um, I think it would work for a special case, where the unique
constraint
includes the partitioning column[s], and the partitions (check
constraints)
don't overlap.
In this case you can create simple unique indexes on the subtables.

When looking at other db's this is not such an exceptional requirement
for unique indexes that share the same partitioning scheme as the table.
And imho the all indexes sharing the table partitioning scheme is the 
most important use case.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-09 Thread Zeugswetter Andreas DCP SD
 This bothers me a bit, because in 
 fact the effects if any of the tested query would have been 
 rolled back.  Not sure we have any choice though.  If we 
 expose the error then we'll have problems with clients not 
 showing the EXPLAIN results.

I think we should leave it in top level, throw the error and fix the
clients.
As I understood, the idea was, that it only does that if you 
press ^C or query timeout. In this case current clients would also
not show the plan.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] ADD/DROP constraints

2006-06-09 Thread Zeugswetter Andreas DCP SD

 On a separate note. The one major remaining piece here is in 
 constraints. I'm thinking what I have to check is that every 
 constraint present on the parent table is present on the 
 child tables. And I'm thinking I should do that by looking at 
 the constraint's textual definition (consrc).
 
 This doesn't allow you to get by with a single stronger 
 constraint -- you would still need the redundant looser 
 constraint to satisfy the inheritance.

Yes, I think you would actually want eighter an identical, or a stronger

constraint on the child.

 But it does let you get by with constraint names that don't 
 match the parent's.
 
 I'm not sure that's such a good thing, since pg_dump would 
 then generate a redundant constraint when it generates the 
 table. Maybe that would go if constraints got conislocal and coninh.
 
 Or maybe I should insist that a matching constraint name be 
 present *and* that the source text match? That's more of a 
 pain to code though.

I think in the meantime, I would check that eighter a source match
is present OR a constraint with the same name.  This would allow more
flexibility and imho still enough safety checking.  

Until we have (or feel a need for) check logic for stronger constraint
it would be the op's responsibility.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] ADD/DROP constraints

2006-06-09 Thread Zeugswetter Andreas DCP SD
  Or maybe I should insist that a matching constraint name be present 
  *and* that the source text match? That's more of a pain to  code
though.
 
 That could also break some partitioning schemes; I don't 
 think it's a given that parents and children have matching 
 constraints, and afaik a parent can have constraints that a 
 child doesn't.

Yea, but that is why we would have parent ONLY constraints,
they would only apply when the tuple is actually stored in the parent
relation.

In the typical partitioning case it does not really matter since the
parent ONLY is typically empty.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] ADD/DROP INHERITS

2006-06-08 Thread Zeugswetter Andreas DCP SD

   But that's entirely inconsistent with the way inherited tables
work 
   in general.
  
  I don't see any basis for that conclusion.  The properties of a
table 
  are set when it's created and you need to do pretty explicit ALTERs
to 
  change them.
 
 It just seems weird for:
 
 CREATE TABLE foo (x,y,z) INHERITS (bar)
 
 to not be the equivalent to:
 
 CREATE TABLE foo (x,y,z)
 ALTER TABLE foo ADD INHERITS bar

Imho the op should only choose that path if he wants to fill the table
before
adding the inheritance. It makes no sense to add columns with default
values
to existing rows of the child table, especially when you inherit the
defaults 
from the parent.

So I agree with Tom, that ADD INHERITS should not add columns.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More on inheritance and foreign keys

2006-06-08 Thread Zeugswetter Andreas DCP SD

   The solution to the foreign key problem seems easy if I 
 modify PostgreSQL implementation and take off the ONLY word 
 from the SELECT query, but it's not an option for me, as I'm 

I think that the ONLY was wrong from day one :-(
The default in other areas is table including childs.
(Not like in old pg where you had to use tab* to include childs)
(iirc leaving off ONLY is not sufficient because of locking
 problems)

Of course then we would need
REFERENCES tenk ONLY (unique1)
to allow current behavior.

Andreas 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Updatable views/with check option parsing

2006-05-29 Thread Zeugswetter Andreas DCP SD
  While I don't think that making WITH a fully reserved word would
cause
  any great damage, I'm unwilling to do it just to save a couple of
lines
  of code. 
 
 I think we should go on and do promote WITH to a reserved keyword now

Oracle, MS-SQL, DB2, MySQL and Informix also have WITH reserved, so it
would
imho be ok to do it if it simplifies code.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] error-free disabling of individual child partition

2006-05-24 Thread Zeugswetter Andreas DCP SD

 Added to TODO:
 
 o Add ALTER TABLE tab ADD/DROP INHERITS parent

Sounds good, more inline with add/drop constraint.

   pg_attribute.attislocal has to be set to 'false' for ADD,
and
 attislocal: If you set this to False, you wouldn't be able to set it
 back again.

Just curious, why is that ?

   pg_attribute.attinhcount adjusted appropriately

Do we have a plan on what to do with RI constraints on the parent
when doing DROP INHERTITS ? 
Seems all FK tables would need to be checked for keys in the local
table.
(Not sure whether we have PK on tab*, but it seems you could do it when
the
partitioning column is part of the PK)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] LIKE, leading percent, bind parameters and indexes

2006-05-24 Thread Zeugswetter Andreas DCP SD
 AFAICS the problem is not restricted to LIKE, we can easily find a lot
of
 similar problems caused by the actual parameters. For example, SeqScan
vs.
 IndexScan vs. BitmapIndexScan for a range query. So an improvement is
 definitely needed.

 Another way is to generate a plan on the fly. What we do is to let
some
 REPLAN nodes sit on top of some critical plan node: at the execution,
we
 will compare the actual numbers we get and the estimated number we
have

Since we are deciding this on histogram data, it seems we could store
the ranges (and exception values) where this plan is not good, and
replan in
case the new value does not fit.

This would also imply, that we postpone (part of the) planning until we
get the
first values, when the node cost largly depends on the supplied value.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] error-free disabling of individual child partition

2006-05-23 Thread Zeugswetter Andreas DCP SD

  table of another table. I propose a TODO item to allow this:
 
  ALTER TABLE childN INHERITS ( parent1, ... );

  We don't need a disinherit do we?

I propose: ALTER TABLE childN INHERITS ();
Thus I also think, that the list should be complete, and is not an
addition 
to existing inheritance.

 
 O, yes, I think we do. I can imagine that the ability to swap a table 

Agreed. Simon, were you testing how many ppl read to the end :-)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Zeugswetter Andreas DCP SD

 1) Use n sort areas for n tapes making everything purely sequential
access.

Some time ago testing I did has shown, that iff the IO block size is
large enough
(256k) it does not really matter that much if the blocks are at random
locations.
I think that is still true for current model disks.

So unless we parallelize, it is imho sufficient to see to it that we
write
(and read) large enough blocks with single calls. This also has no
problem in 
highly concurrent scenarios, where you do not have enough spindles.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Zeugswetter Andreas DCP SD

 Certainly, if you can't prototype a convincing performance win using
 that algorithm, it's unlikely to be worth anyone's time to 
 look harder.

That should be easily possible with LZO. It would need to be the lib
that
we can optionally link to (--with-lzo), since the lib is GPL.

lzo even allows for inplace decompression and overlapping compression.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Zeugswetter Andreas DCP SD

 Unfortunatly, the interface provided by pg_lzcompress.c is probably
 insufficient for this purpose. You want to be able to compress tuples
 as they get inserted and start a new block once the output reaches a

I don't think anything that compresses single tuples without context is
going to be a win under realistic circumstances.

I would at least compress whole pages. Allow a max ratio of 1:n,
have the pg buffercache be uncompressed, and only compress on write
(filesystem cache then holds compressed pages).

The tricky part is predicting whether a tuple still fits in a n*8k
uncompressed
8k compressed page, but since lzo is fast you might even test it in
corner cases.
(probably logic that needs to also be in the available page freespace
calculation)
Choosing a good n is also tricky, probably 2 (or 3 ?) is good.

You probably also want to always keep the header part of the page
uncompressed.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_dump and backslash escapes

2006-05-17 Thread Zeugswetter Andreas DCP SD

 Very clear.  The issue is that I can't find any of these emitted by a
 pg_dump version who's native backend doesn't understand them.
 
 I assume that it is expected that a cross-db dump/reload will generate
 errors, and it is done rarely for upgrades, but I assume same-version
 dump/restore is done more frequently and people don't expect errors.
 Is that not a significant distinction?

I thought the suggested procedure (see migration doc) was to use the 
new pg_dump to dump the older db version, so why backpatch ?

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pg_dump and backslash escapes

2006-05-17 Thread Zeugswetter Andreas DCP SD

  I thought the suggested procedure (see migration doc) was to use the

  new pg_dump to dump the older db version, so why backpatch ?
 
 Uh, you can suggest it, but I would guess  50% do it, and once the
old
 database is gone, there is no way to re-do the dump.

But you can still load the dump if you execute the two statements in the

new db psql session before loading the dump file, no ?

   SET escape_string_warning = off;
   SET standard_conforming_strings = off;

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Zeugswetter Andreas DCP SD

  Given that any time that happens we end up caring much less about
CPU
  usage and much more about disk IO, for any of these cases that use
  non-random access, compressing the data before sending it to disk
would
  potentially be a sizeable win.
 
 Note however that what the code thinks is a spill to disk and what
 actually involves disk I/O are two different things.  If you think
 of it as a spill to kernel disk cache then the attraction is a lot
 weaker...

Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.

It would probably need some sort of plugin architecture, since the
fastest compression algorithms (LZO) that also reach good ratios are
gpl.
LZO is proven to increase physical IO write speed with low CPU overhead.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Zeugswetter Andreas DCP SD

  Personally, I believe it would be worth it - but only to a few. And
  these most of these few are likely using Oracle. So, no gain unless
  you can convince them to switch back... :-)
 
 We do know that the benefit for commercial databases that use raw and
 file system storage is that raw storage is only a few percentage
 points faster.

Imho it is really not comparable because they all use direct or async IO
that bypasses the OS buffercache even when using filesystem files for
storage.
A substantial speed difference is allocation of space for restore
(no format of fs and no file allocation needed).

I am not saying this to advocate moving in that direction however.
I do however think that there is substantial headroom in reducing the
number
of IO calls and reducing on disk storage requirements.
Especially in concurrent load scenarios.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Wrong plan for simple join with index on FK

2006-05-16 Thread Zeugswetter Andreas DCP SD

 These are all minor abberations though, on the whole the estimates
are
 pretty good. Perhaps you need to tweak the values of random_page_cost
 and similar variables.
 
 Thank You, It's general problem or only mine? I have 100% 
 standard current PC.

The default random_page_cost assumes some concurrent activity. If your
PC does nothing else concurrently, the performance of a seq scan will
be underestimated.

Try to do the statement with some concurrent disk load and you will most
likely
see that the 1. plan is faster. (assuming the tables are not fully
cached) 

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] [HACKERS] Big IN() clauses etc : feature proposal

2006-05-11 Thread Zeugswetter Andreas DCP SD

 Something else worth considering is not using the normal 
 catalog methods
 for storing information about temp tables, but hacking that together
 would probably be a rather large task.

But the timings suggest, that it cannot be the catalogs in the worst
case
he showed.

 0.101 ms BEGIN
 1.451 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER
NOT  
 NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP

1.4 seconds is not great for create table, is that what we expect ?

 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id
DESC  
 LIMIT 20
 0.443 ms ANALYZE tmp
 0.365 ms SELECT * FROM tmp
 0.310 ms DROP TABLE tmp
 32.918 ms COMMIT
 
   CREATING the table is OK, but what happens on COMMIT ? I hear
the disk  
 seeking frantically.

The 32 seconds for commit can hardly be catalog related. It seems the
file is 
fsynced before it is dropped.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] XLOG_BLCKSZ vs. wal_buffers table

2006-05-03 Thread Zeugswetter Andreas DCP SD

   I'm planning on continuing to increase XLOG_BLCKSZ and wal_buffers
to
   determine when the throughput starts to level out or drop 
  
  I think for an even better comparison you should scale wal_buffers
  down with increasing XLOG_BLCKSZ, so that the xlog buffer has a
fixed
  size in kb.
  
  Reasonable wal_buffers imho amount to at least 256kb, better yet 512
or 1 Mb,
  with sufficiently large transactions (and to try to factor out the
difference 
  between blocksizes).
 
 AFAIK all the transactions in DBT2 are pretty small. I think all DML
is
 single-row in fact, so I'm not sure that having wal_buffers much
larger
 than the number of connections would help much.

Well, but those updates wander around the whole table/index, so you will
have a lot of
before images to write. So I take back the sufficiently large
transactions part
of my comment. You want more wal_buffers in all higher load scenarios.

(one test had 8 buffers of 2k each, this is not enough in any high load
scenario)

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] XLOG_BLCKSZ vs. wal_buffers table

2006-05-02 Thread Zeugswetter Andreas DCP SD

 I'm planning on continuing to increase XLOG_BLCKSZ and wal_buffers to
 determine when the throughput starts to level out or drop 

I think for an even better comparison you should scale wal_buffers
down with increasing XLOG_BLCKSZ, so that the xlog buffer has a fixed
size in kb.

Reasonable wal_buffers imho amount to at least 256kb, better yet 512 or
1 Mb,
with sufficiently large transactions (and to try to factor out the
difference 
between blocksizes).

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] ANSI-strict pointer aliasing rules

2006-04-27 Thread Zeugswetter Andreas DCP SD

  4. Find the option for disabling strict alias and get configure to
add
  that.
 
 You'll still lose performance, but the option is -qalias=noansi.

My old xlc does not show that option, it is unfortunately version
specific.
The currently compatible option to turn it off would be -qnoansialias

So we can use:
xlc -qnoansialias

The default cc options are: -qlanglvl=extended -qnoro -qnoroconst
So I guess we could also use (but above is imho clearer/better): 
cc -qro -qroconst -qlanglvl=extc89

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] ANSI-strict pointer aliasing rules

2006-04-27 Thread Zeugswetter Andreas DCP SD

 I ran afoul of these rules the other day when compiling pgsql 8.1 on
 AIX. The configure scripts are set up to look for xlc instead of
 cc, and that command invokes cc with -qalias=ansi, the ANSI-strict
 pointer aliasing mode.

Can you please explain what exactly was not working ?
xlc has in the past shown warnings that were actually problematic code
that gcc did not show (and the cc variant of xlc also does not show).

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Concurrency problem building indexes

2006-04-25 Thread Zeugswetter Andreas DCP SD

  [ shrug... ]  Worksforme.
  
  There is a short interval at the end of the first CREATE INDEX on
the
  table where the problem would happen if another CREATE INDEX tries
to
  modify the pg_class row before the first one's committed.

 I did a pg_dumpall and removed the index creation commands.  The first
time
 I run the index build, I usually get at least one occurrence.

I think that narrows it down nicely. You create the table, load rows,
then without
analyze create the indexes, thus pg_class is not up to date, and the
update
needs to be done.

My answer to this would be to (have an option to) ommit this relpages 
and reltuples update. It is imho not the task of create index to update
statistics
in the first place. I have been burnt by that behavior when creating
indexes on empty
tables in Informix and never liked it (iirc pg has a workaround for
empty tables though).

Wes, you could most likely solve your immediate problem if you did an
analyze before 
creating the indexes.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Concurrency problem building indexes

2006-04-25 Thread Zeugswetter Andreas DCP SD

  Wes, you could most likely solve your immediate problem if you did
an
  analyze before creating the indexes.
 
 I can try that.  Is that going to be a reasonable thing to do when
there's
 100 million rows per table?  I obviously want to minimize the number
of
 sequential passes through the database.

No, I think it would only help if it gets the exact tuple count.
For large tables it only gets an exact count with a full scan 
(use vacuum instead of analyze).

Then again, when the table is large, the different create indexes 
should finish at sufficiently different times, so an analyze might
be sufficient to fix the problem for small tables.

(analyze is fast for large tables since it only does a sample)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_dump -Ft failed on Windows XP

2006-04-21 Thread Zeugswetter Andreas DCP SD

  Apparently it won't work at all if TMP isn't set?
 
  I'm not *too* concerned about that, since TMP is normally set by the
OS
  itself. There's one set in the system environment (to
c:\windows\temp
  or whatrever) and then it's overridden by one set by the OS when it
  loads a user profile.
 
 OK, then maybe not having it would be equivalent to /tmp-not-writable
 on Unix, ie, admin error.
 
  Also to the point, what would you fall back to?
 
 Current directory maybe?

It tries \ (tested on Win 2000), if the dir argument is NULL and TMP is
not set.
But TMP is usually set. 

Attached is a working version not yet adapted to port/.
- memoryleak fixed
- use _tmpname and _fdopen not the compatibility tmpname and fdopen
(imho only cosmetic)
- EACCES fixed (Win2000 needs _S_IREAD | _S_IWRITE or fails with EACCES,
even as Admin)
- I suggest adding a prefix pg_temp_ (for leftover temp files after
crash, 
the name I get is then usually pg_temp_2)

Andreas
Index: bin/pg_dump/pg_backup_tar.c
===
RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_backup_tar.c,v
retrieving revision 1.50
diff -c -r1.50 pg_backup_tar.c
*** bin/pg_dump/pg_backup_tar.c 12 Feb 2006 06:11:50 -  1.50
--- bin/pg_dump/pg_backup_tar.c 21 Apr 2006 09:22:00 -
***
*** 362,368 
--- 362,388 
{
tm = calloc(1, sizeof(TAR_MEMBER));
  
+ #ifndef WIN32
tm-tmpFH = tmpfile();
+ #else
+   /* on win32, tmpfile() generates a filename in the root 
directory, which requires
+* administrative permissions to write to. */
+   while (1)
+   {
+   char *tmpname;
+   int fd;
+   
+   tmpname = _tempnam(NULL, pg_temp_);
+   if (tmpname == NULL)
+   break;
+   fd = _open(tmpname, _O_RDWR | _O_CREAT | _O_EXCL | 
_O_BINARY | _O_TEMPORARY, _S_IREAD | _S_IWRITE);
+   free(tmpname);
+   if (fd == -1  errno == EEXIST)
+   continue; /* Try again with a new name if file 
exists */
+   if (fd != -1)
+   tm-tmpFH = _fdopen(fd, w+b);
+   break;
+   }
+ #endif
  
if (tm-tmpFH == NULL)
die_horribly(AH, modulename, could not generate 
temporary file name: %s\n, strerror(errno));

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Practical impediment to supporting multiple SSL libraries

2006-04-14 Thread Zeugswetter Andreas DCP SD

 Well, the psqlODBC driver apparently ran into a number of problems
with
 libpq that resulted in them not using it for their purpose. 
 Given libpq primary purpose is to connect to PostgreSQL, it failing at
that is
 something that should be fixed.

I think you are forgetting, that e.g. a JDBC driver will not want to
depend on
an external C dll at all. It will want a native Java implementation
(Group 4).
Thus imho it is necessary to have a defined wire protocol, which we
have.

So if a driver needs to use the wire protocol it is imho not a problem.
If applications started using it, because they don't find a suitable
driver,
now that would be a problem.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] schema-qualified SET CONSTRAINTS

2006-04-12 Thread Zeugswetter Andreas DCP SD

  The attached patch allows SET CONSTRAINTS to take a schema
qualified
  constraint name (myschema.t1_fk_t2) and when given a bare
constraint name
  it uses the search_path to determine the matching constraint
instead of
  the previous behavior of disabling all identically named
constraints.
 
  This patch seems egregiously non backwards compatible :-(.
 
 Yes, it does change the existing behavior, but egregiously? How many

 applications intentionally defer constraints in multiple schemas at
once? 

intentionally defer specifically named constraints in multiple schemas
(The default application would imho eighter defer all, or a specific
constraint)

 Not many.  I would guess the more likely situation is that these 
 applications don't even realize that they are deferring more than one 
 constraint when it happens.

I agree. I think the new behavior is more intuitive, and would even
argue 
the old behavior gets it wrong.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Accessing schema data in information schema

2006-03-23 Thread Zeugswetter Andreas DCP SD

 Plan C would be to say that we don't need to preserve SELECT * FROM
 seqname, but I'll bet there would be some hollering.

I'd like to hear this hollering first, before we create tons of views
:-)
Imho it is not a problem to remove it, I am for Plan C.
(Those with need for the select can still create their view by hand.
A release note would be sufficient imho.)
Of course if we still need one row in pg_class for the ACL's, that row
might
as well be a view.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-10 Thread Zeugswetter Andreas DCP SD

 Two pass will create the count of subfiles proportional to:
 Subfile_count = original_stream_size/sort_memory_buffer_size
 
 The merge pass requires (sizeof record * subfile_count) memory.

That is true from an algorithmic perspective. But to make the
merge efficient you would need to have enough RAM to cache a reasonably
large block per subfile_count. Else you would need to reread the same
page/block from one subfile multiple times.
(If you had one disk per subfile you could also rely on the disk's own
cache,
but I think we can rule that out)

 Example:
 You have a 7 gigabyte table to sort and you have 100 MB sort buffer.
 The number of subfiles will be:
 70 / 1 = 70 files

To be efficient you need (70 + 1) \* max(record_size, 256k) = 18 Mb 

Plus you need a structure per subfile that points to the current record
in the buffer.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-10 Thread Zeugswetter Andreas DCP SD

   Two pass will create the count of subfiles proportional to:
   Subfile_count = original_stream_size/sort_memory_buffer_size
   
   The merge pass requires (sizeof record * subfile_count) memory.
  
  That is true from an algorithmic perspective. But to make the merge 
  efficient you would need to have enough RAM to cache a reasonably 
  large block per subfile_count. Else you would need to 
 reread the same 
  page/block from one subfile multiple times.
  (If you had one disk per subfile you could also rely on the 
 disk's own 
  cache, but I think we can rule that out)
 
 But what about the OS cache? Linux will read upto the next 
 128KB of a file if it's contiguous on disk, which is likely 
 with modern filesystems. It's likely to be much fairer than 
 any way we can come up with to share memory.

We were discussing how much RAM is needed, and not how much
the backend allocates itself. So if the backend needs to duplicate some 
of the OS cache, that will only add to the memory requirement. 
The most likely scenario is, that the backend additionally holds one
page
per subfile.

 Question is, do we want our algorithm to rely on that caching?

Currently we do, and I don't think that is so bad actually.
The only optimization I would consider, is adding a sequential access
hint
to the tape file :-) open.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] problem with large maintenance_work_mem settings and

2006-03-10 Thread Zeugswetter Andreas DCP SD

  I'll look into it, but I was already wondering if we shouldn't bound

  the number of tapes somehow.  It's a bit hard to believe that 28000 
  tapes is a sane setting.
 
 Well, since they are not actually tapes, why not?

I wonder what the OS does when we repeatedly open and close those files
because we are short on filedescriptors ? Will it replace cached pages
of a file that we have closed *more* aggressively ?

Maybe we should limit the files to how many files we would actually be
able
to hold open in parallel ? Or keep more that one tape in one file
and remember a start offset into the file per tape.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Merge algorithms for large numbers of tapes

2006-03-09 Thread Zeugswetter Andreas DCP SD

  This amounts to an assumption that you have infinite work_mem, in
 which
  case you hardly need an external sort at all.  If your 
 work_mem is in 
  fact finite, then at some point you need more than two passes.  I'm
 not
  really interested in ripping out support for sort 
 operations that are 
  much larger than work_mem.
 
 No it does not.  I have explained this before.  You can have 
 one million files and merge them all into a final output with 
 a single pass.  It does not matter how big they are or how 
 much memory you have.

Hh ? But if you have too many files your disk access is basically
then going to be random access (since you have 1000nds of files per
spindle).
From tests on AIX I have pretty much concluded, that if you read
256k blocks at a time though, random access does not really hurt that
much
any more.
So, if you can hold 256k per file in memory that should be sufficient.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Automatic free space map filling

2006-03-06 Thread Zeugswetter Andreas DCP SD

  But you could do the indexes first and remember how far you 
 can vacuum 
  the heap later.
 
 But the indexes _can't_ be done first; you _first_ need to 
 know which tuples are dead, which requires looking at the 
 table itself.

If we already had the all tuples visible bitmap I think we could
first scan the bitmap and decide whether we can afford to look
at the visibility info for each entry in the index.

We only collect the ctids before so we don't have the inefficient
lookups,
but if we can avoid the lookup in most cases it would again be
attractive.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatic free space map filling

2006-03-03 Thread Zeugswetter Andreas DCP SD

  But what about index clearing?  When do you scan each index?
 
 At the end of each iteration (or earlier, depending on 
 maintenance_work_mem).  So for each iteration you would need 
 to scan the indexes.
 
 Maybe we could make maintenance_work_mem be the deciding 
 factor; after scanning the indexes, do the release/reacquire 
 locks cycle.

But you could do the indexes first and remember how far you can 
vacuum the heap later.

So you might as well do each index separately first and remember 
how far you can go with the heap for each one.
Then do the heap with a special restriction that comes from what you 
remembered from the indexes.
You can now separate the heap vacuum in arbitrarily large transactions,
since the indexes are already taken care of.

(You only vacuum to the point of the eldest vacuumed index)

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatic free space map filling

2006-03-02 Thread Zeugswetter Andreas DCP SD

 I thought we had sufficiently destroyed that reuse a tuple 
 meme yesterday.  You can't do that: there are too many 
 aspects of the system design that are predicated on the 
 assumption that dead tuples do not come back to life.  You 
 have to do the full vacuuming bit (index entry removal, 
 super-exclusive page locking, etc) before you can remove a dead tuple.

One more idea I would like to throw in.
Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.
(If you still wanted to be able to locate index entries fast,
you would need to keep indexed columns, but I think we agreed that there
is
no real use)

I think that would be achievable at reasonable cost (since you can avoid
one page IO)
on the page of the currently active tuple (the first page that is
considered).

On this page:
if freespace available
  -- use it
elsif freespace available after reducing all dead rows 
  -- use the freespace with a new slot
else 

Of course this only works when we still have free slots,
but I think that might not really be an issue.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Automatic free space map filling

2006-03-02 Thread Zeugswetter Andreas DCP SD

 I think you must keep the header because the tuple might be 
 part of an update chain (cf vacuuming bugs we repaired just a 
 few months ago).
 t_ctid is potentially interesting data even in a certainly-dead tuple.

yes, I'd still want to keep the full header.

 Andreas' idea is possibly doable but I am not sure that I see 
 the point.
 It does not reduce the need for vacuum nor the I/O load 
 imposed by vacuum.  What it does do is bias the system in the
 direction of allocating an unreasonably large number of tuple 
 line pointers on a page (ie, more than are useful when the 
 page is fully packed with normal tuples).

 Since we never 
 reclaim such pointers, over time all the pages in a table 
 would tend to develop line-pointer-bloat.  I don't know what 
 the net overhead would be, but it'd definitely impose some 
 aggregate inefficiency.

Ok, for vacuum the slot would look like any other dead row and thus be 
target for removal.

Why do we not truncate the line pointer array ?
Is it, that vacuum (not the full version) does not move
rows to other pages or slots ? Of course vacuum full could do it,
but I see your point. 
Maybe we could impose an upper limit on the number of slots to allow, 
after which the optimization is turned off. 
But this starts to sound not so good :-( 

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] GRANT/REVOKE: Allow column-level privileges

2006-01-31 Thread Zeugswetter Andreas DCP SD

 3) For every privilege descriptor in CPD whose action is 
 INSERT, UPDATE,
or REFERENCES without a column name, privilege descriptors are also
created and added to CPD for each column C in O for which 
 A holds the
corresponding privilege with grant option. For each such column, a
privilege descriptor is created that specifies the 
 identical grantee,
the identical action, object C, and grantor A. 
 
 4) For every privilege descriptor in CPD whose action is 
 SELECT without a
column name or method name, privilege descriptors are also 
 created and
added to CPD for each column C in O for which A holds the 
 corresponding
privilege with grant option. For each such column, a privilege
descriptor is created that specifies the identical grantee, the
identical action, object C, and grantor A. 
 
 As I read it, granting a table-level privilege is equivalent 
 to repeating the appropriate column-level privilege for all 
 columns.  In other words:
 
 For this table:
 
   CREATE TABLE tab (c1 int, c2 int, c3 int);
 
 This statement:
   GRANT SELECT ON tab TO grantee;
 
 ...also implies:
 
   GRANT SELECT (c1) ON tab TO grantee;
   GRANT SELECT (c2) ON tab TO grantee;
   GRANT SELECT (c3) ON tab TO grantee;
 
 This means that after the following, the grantee should have 
 no privileges on tab.c1 (but should retain them on tab.c2, tab.c3):
 
   GRANT SELECT ON tab TO grantee;
   REVOKE SELECT (c1) ON tab FROM grantee;

I don't (do not want to) read that conclusion from above paragraphs,
anyone else ?
My reasoning is, that you can only revoke what has previously been
granted.

e.g. grant dba to grantee;
cannot be revoked with: revoke select on tab from grantee; for that
table

I think the paragraphs have only been added to understand what rights
you have on
each column.
 
Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] gprof SELECT COUNT(*) results

2005-11-29 Thread Zeugswetter Andreas DCP SD

  OTOH DB2 and SQLServer take block level 
  read locks, so they can do this too, but at major loss of
concurrency
  and threat of deadlock.

Note, that in the usual committed read isolation, they do not need to 
read lock a row ! e.g. Informix only verifies, that it could lock the
row 
(that there is no write lock). Only cursor stability leaves one read
lock
until the next fetch, serializable actually leaves all read locks, 
and select for update an intent update lock.

Also they usually feed a buffer of rows to the client, so if the client
does a fetch it gets a row from the client side buffer. Only when the
buffer
is empty, they get more from the server.

I think the statement holds, that the optimization is pg specific,
and cannot be directly compared to other db's.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] gprof SELECT COUNT(*) results

2005-11-29 Thread Zeugswetter Andreas DCP SD

 DB2:
 Uncommitted Read (UR) mode Dirty read isn't the default, or 
 the recommended lock level for most apps. I was considering 
 Cursor Stability mode (or higher), which is the default 

Sorry, they call it read committed but actually do cursor stability,
which does keep one lock on the last fetched row. Keeping the lock would

actually not be necessary to conform with ANSI read committed.

See table 4 on Page 8 of
http://www.cs.ndsu.nodak.edu/~yawang/Snapshot.ppt

 SQLServer:
 READ COMMITTED does take share locks.

But it does not hold them. According to docu it holds them while
reading which 
is not a very detailed description. How long is that really, e.g. with
odbc forward 
cursor fetch ?

 There's a NO LOCK hint, true, but its not a default.

That is for dirty/uncommitted reads.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Should libedit be preferred to libreadline?

2005-11-22 Thread Zeugswetter Andreas DCP SD

 With AIX 5, the easiest way to get a shared object is to pass
-bexpall
 to the linker. This results in all symbols being exported.

Yes, that is another reason not to use this broken switch.
And last time I checked (AIX 4.3.3), -bexpall did not export all needed
symbols
(e.g. globals) from the backend eighter. And the counterpart -bimpall
did also not work.
Dynamic loading did not work without the .imp and .exp files :-(

Andreas

PS: I'd prefer if readline was only linked where it is needed, namely in
psql.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving count(*)

2005-11-18 Thread Zeugswetter Andreas DCP SD

  The instant someone touches a block it would no longer be marked as 
  frozen (vacuum or analyze or other is not required) and count(*)
would 
  visit the tuples in the block making the correct decision at that
time.
 
 Hmm, so the idea would be that if a block no longer contained any
tuples hidden from any active transaction,
 you could store the count and skip reading that page.

I like the approach of informix and maxdb, that can tell the count(*)
instantly without looking at index leaf or data pages.

Imho we could do that with a central storage of count(*) even with mvcc.
The idea is a base value for count(*) and corrective values per open
xid.
To tell the count you add all corrective values whose xid is visible in
snapshot.
Each backend is responsibe for compacting xid counters below min open
xid.
Periodically (e.g. at checkpoint time) you compact (aggregate committed
xid counters 
into the base value) and persist the count.

Since that costs, I guess I would make it optional and combine it with
materialized 
views that are automatically used at runtime, and can at the same time
answer other 
aggregates or aggregates for groups. 
create materialized view xx_agg enable query rewrite as select count(*),
sum (col1) from xx
[group by col2];

Your page flag storage could possibly also be used for btree access, to
short circuit
the heap visibility lookup (e.g. for pages where all rows are visible
(vacuumed)).
I think that your proposal is too complex if it is not used to also
improve other
performance areas. 

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] MERGE vs REPLACE

2005-11-18 Thread Zeugswetter Andreas DCP SD

 Unless you have a table lock, INSERT has to be before UPDATE, think
UPDATE, UPDATE (both fail), INSERT, INSERT.

  update
  if no rows updated
insert
if duplicate key
  update
  if no rows updated goto insert

That is why you have the loop. This is not a problem with above code,
because only one insert succeeds
while the others then do the update.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Improving count(*)

2005-11-18 Thread Zeugswetter Andreas DCP SD

  Since that costs, I guess I would make it optional and combine it
with 
  materialized views that are automatically used at runtime, and can
at 
  the same time answer other aggregates or aggregates for groups.
  create materialized view xx_agg enable query rewrite as select 
  count(*), sum (col1) from xx [group by col2];
  
 
 I wonder how many times you really need a count(*) w/o where clause.
 If I understand you correctly you are trying to optimize just this one
case?

I guess you have not read to the end. A materialized view with a group
by 
as indicated in the example is able to answer all sorts of queries
with or without where clauses ( e.g. ... where col2 = 'x').

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] tablespaces and non-empty directories

2005-11-17 Thread Zeugswetter Andreas DCP SD

  This is because lost+found exists. Since lost+found would be a 
  reasonably common directory to find at a mount-point on Unix-like 
  OSs*, would it make sense for CREATE TABLESPACE to ignore it if
present?

 No.  There is no reason to use a volume's root directory as a
tablespace;
 especially so since the root directory ought to be owned by root

That is not so on AIX. Only the moint point (the dir in the parent) is
root.
Once mounted it can have (and preserves) any permission you want.
But on AIX the workaround is to remove the directory after mounting and
before
creating the tablespace.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] MERGE vs REPLACE

2005-11-17 Thread Zeugswetter Andreas DCP SD
 The problem I try to solve is something along: a bunch of clients try
to update a count, and ONE of them must initialize  the count if it
does not exist... this can't be done with current infrastructure without
race conditions.

The solution without merge but a unique key in other db's is:

update
if no rows updated
  insert
  if duplicate key
update
if no rows updated goto insert

note, that the counter updates need to be of the form set x = x + ?
where key=y 
do you see a potential race condition with this ?
In pg you also need a savepoint before the insert for this to work.

Depending on the ratio of insert vs update we also start with insert
when 
the insert succeeds more that 50% (I would use a higher percentage with
pg though):

insert
  if duplicate key
update
if no rows updated goto insert

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly