Re: [HACKERS] win32 tablespace handing

2004-10-06 Thread Zeugswetter Andreas SB SD

  hardlinks and junctions don't work across physical disks, only symlinks.
 
 Where did you read this?  I just looked and can see no such restriction.

There is no such restriction for junctions, I just tried it to be safe.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] AIX and V8 beta 3

2004-10-04 Thread Zeugswetter Andreas SB SD

  Have you tried using cc_r for that compile line?  Does that help?
 
 Alas, that is not an option available.
 
 cc_r is specific to the AIX xlc compiler; we're using GCC, and xlc
 is not available to us.

What is missing is a -lpthread .

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] psql questions: SQL, progname, copyright dates

2004-09-10 Thread Zeugswetter Andreas SB SD

 I think adding 'S' to \df confuses more than it helps.

Why that? Imho it would be consistent.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] APR 1.0 released

2004-09-10 Thread Zeugswetter Andreas SB SD

  Well, it's easily changed, if all that's needed is a search-and-replace.
  Suggestions for a better name?
 
 MINGW32

I think that is a bad idea. That symbol sure suggests, that you are using mingw. 
Are you expecting someone who creates a VisualStudio project to define
MINGW32 ?

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] APR 1.0 released

2004-09-10 Thread Zeugswetter Andreas SB SD

  Personally I don't think that any rename()-usleep loop is necessary.
  I'll check the archives.
 
 I agree the rename loop seems unnecessary.  I kept it in case we hadn't
 dealt with all the failure places.  Should we remove them now or wait
 for 8.1?  Seems we should keep them in and see if we get reports from
 users of looping forever, and if not we can remove them in 8.1.

What I do not understand is, that Windows has rename and _unlink.
Are we using those or not?

Looping forever is certainly not good, but I thought the current code
had a limited loop. I think a limited loop is required, since both
rename and _unlink can not cope with a locked file.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] PITR: XLog File compression on Archive

2004-08-24 Thread Zeugswetter Andreas SB SD

 PS: but something you *could* do in 8.0 is replace cp by gzip to
 archive compressed files that way.

How about replacing the page image records with a same size dummy record 
that only contains a dummy header and all 0's. If the archiver cares about 
space he will use some sort of compression anyway.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] 8.0 beta 1 on linux-mipsel R5900

2004-08-24 Thread Zeugswetter Andreas SB SD

  Think harder... one processor != one process...
 
  Well sure, but you don't want a spinlock in that case.
 
 Actually you do, when the normal case is that you don't have to block.
 You want it to fall through as quickly as possible in the success case
 (the blocking case is going to suck no matter what).  Given the present
 set of available/portable technologies, spinlocks win.

I guess it could still save some CPU cycles in the single CPU case,
if you yield() instead of tight loop around TAS in the failure case.
Problem is yield and detecting single CPU is not portable.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] PITR: XLog File compression on Archive

2004-08-24 Thread Zeugswetter Andreas SB SD

 Re-thinking the whole purpose of the additional full page images appended to
 the xlog records, I now understand and agree with Tom's comment in the docs
 that we don't need to include those additional full page images for PITR -
 they only exist to correctly recover the database in the event of a crash.
 The reason for this is that the base backup provides a full set of blocks on
 which to recover - there is no danger that the backup contains bad blocks,
 as would be the case for crash recovery.

It is correct, that the base backup cannot contain bad blocks (on OS's we know).
But it can contain newer blocks than WAL records would expect.
Will it not matter if e.g. a page split for a btree index is already in the
index file, but a WAL record exists, that references the not yet split page?
I think not, but I am not sure, since normal crash recovery won't see this 
situation because of the page images (that can be older than what is on disk).

Also is there not now code, that logs index recreation by only logging 
page images ? We would still need those, no ?

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Point in Time Recovery

2004-07-30 Thread Zeugswetter Andreas SB SD

  I was wondering about this point - might it not be just as reasonable 
  for the copied file to *be* an exact image of pg_control?  Then a very 
  simple variant of pg_controldata (or maybe even just adding switches to 
  pg_controldata itself) would enable the relevant info to be extracted
 
 We didn't do that so admins could easily read the file contents.

If you use a readable file you will also need a feature for restore (or a tool) 
to create an appropriate pg_control file, or are you intending to still require
that pg_control be the first file backed up. 
Another possibility would be that the start function writes the readable file and
also copies pg_control.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Sketch of extending error handling for subtransactions

2004-07-26 Thread Zeugswetter Andreas SB SD

 I was just looking around the net to see exactly what Oracle's PL/SQL
 syntax is.  It doesn't seem too unreasonable syntax-wise:
 
   BEGIN
   ... controlled statements ...
   EXCEPTION
   WHEN exception_name THEN
   ... error handling statements ...
   WHEN exception_name THEN
   ... error handling statements ...
   ...
   WHEN OTHERS THEN
   ... error handling statements ...
   END;
 
 There's nothing here we couldn't do.  However, it seems that Oracle
 thinks you should throw in explicit SAVEPOINT and ROLLBACK statements
 on top of this!  That's just weird.  It might be that we should
 deliberately *not* adopt the exact syntax they are using, just so we
 don't create compatibility gotchas.

That is because they usually use this to handle the exception of only one 
potentially failing statement, which does not rollback any prev statements
(except in pg). 
Thus in Oracle you need savepoints in a lot fewer cases. It is only in those seldom
cases, that you add savepoints on top of blocks with exceptions in Oracle.

But yes, I think doing implicit savepoints for plpgsql blocks that contain an 
exception ... handler is absolutely the way to proceed. For maximum protability
that savepoint should probably travel along with every successful statement after 
the BEGIN (that of course is debateable).

BEGIN
--implicit savepoint x
update 1-- succeeds
--implicit release old x, new savepoint x
update 2-- fails
--position y 
update 3-- succeeds
EXCEPTION   -- rollback to savepoint x
WHEN  
-- transfer control to position y
END;-- implicit RELEASE savepoint x

Doing this only for blocks with an exception handler would not impose any overhead
for conventional plpgsql funcs.

Andreas

PS: can someone please help me get through the lists spam blocker, get Marc to contact 
me,
or I don't know what else I can do 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Fixing PKs and Uniques in tablespaces

2004-07-23 Thread Zeugswetter Andreas SB SD

  Also, since I checked and it seems that our syntax for putting tables an 
  d indexes in tablespaces at creation time is identical to oracle's, 
  perhaps we should copy them on constraints as well.
 
 Since we're getting close to beta, can we have consensus on what I'm to 
 do about this?

The Oracle 10g documentation has: USING INDEX TABLESPACE blabla
none of the words are optional.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] localhost redux

2004-07-20 Thread Zeugswetter Andreas SB SD

  I don't know if the problem is isolated to just me, but I wanted to 
  suggest that we use a parameter for that, which can be configured in the 
  postgresql.conf, with a default value if it's not set, set to 
  localhost.
 
 I think you should first trace down what the problem really is --- is
 your system just misconfigured or is there some fundamental issue that
 we really ought to answer to?

The trouble on AIX is, that getaddrinfo only does a nameservice lookup,
no /etc/hosts. So if your nameservice does not have localhost ...
Same for `hostname`, if nameservice does not resolve `hostname` ...

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Point in Time Recovery

2004-07-20 Thread Zeugswetter Andreas SB SD

  Hang on, are you supposed to MOVE or COPY away WAL segments?
 
 Copy.  pg will delete them once they are archived.

Copy. pg will recycle them once they are archived.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] localhost redux

2004-07-20 Thread Zeugswetter Andreas SB SD

 If localhost isn't being resolved correctly are you seeing error 
 messages like this on the server log (from pqcomm.c)?
 
could not translate host name \%s\, service \%s\ to 
 address: %s

After poking around, I found the following: 
The Windows pdc (==name server :-( ) does really not resolve localhost (this might
be common for Windows nameservers). I do not have a services entry for 5432,
and don't have a dns record for this RS6000.

LOG:  could not translate service 5432 to address: Host not found
WARNING:  could not create listen socket for *
LOG:  could not bind socket for statistics collector: Can't assign requested address

Setting to a port, that is listed in /etc/services does not change anything
(except the port of course).

I do not have those problems on a machine where dns lists the machine and localhost.
This machine has a long obsolete oslevel 4.3.2, so am not sure it is worth pursuing
the issue. I only wanted to state, that it does not work here under certain 
cirumstances 
eighter.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD

  I'm aiming for the minimum feature set - which means we do need to take
  care over whether that set is insufficient and also to pull any part
  that doesn't stand up to close scrutiny over the next few days.
 
 As you can see, we are still chewing on NT.  What PITR features are
 missing?  I assume because we can't stop file system writes during
 backup that we will need a backup parameter file like I described.  Is
 there anything else that PITR needs?

No, we don't need to stop writes ! Not even to split a mirror,
other db's need that to be able to restore, but we dont.
We only need to tell people to backup pg_control first. The rest was only 
intended to enforce 
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

I think the idea with an extra file with WAL start position was overly
complicated, since all you need is pg_control (+ WAL end position to enforce 2.).

If we don't want to tell people to backup pg_control first, imho the next 
best plan would be to add a WAL start input (e.g. xlog name) parameter 
to recovery.conf, that fixes pg_control.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD

 then on restore once all the files are restored move the
 pg_control.backup to its original name.  That gives us the checkpoint
 wal/offset but how do we get the start/stop information.  Is that not
 required?

The checkpoint wal/offset is in pg_control, that is sufficient start 
information. The stop info is only necessary as a safeguard.

 Do we need a checkpoint after the archiving
 starts but before the backup begins?

No.

 Also, when you are in recovery mode, how do you get out of recovery
 mode, meaning if you have a power failure, how do you prevent the system
 from doing another recovery?  Do you remove the recovery.conf file?

pg_control could be updated during rollforward (only if that actually 
does a checkpoint). So if pg_control is also the recovery start info, then 
we can continue from there if we have a power failure.
For the first release it would imho also be ok to simply start over if
you loose power.

I think the filename 'recovery.conf' is misleading, since it is not a 
static configuration file, but a command file for one recovery.
How about 'recovery.command' then 'recovery.inprogress', and on recovery 
completion it should be renamed to 'recovery.done'

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD

  Do we need a checkpoint after the archiving
  starts but before the backup begins?
 
  No.
 
 Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure. 

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Zeugswetter Andreas SB SD

  Other db's have commands for:
  start/end external backup 

I see that the analogy to external backup was not good, since you are correct
that dba's would expect that to stop all writes, so they can safely split
their mirror or some such. Usually the expected time from start
until end external backup is expected to be only seconds. I actually think we
do not need this functionality, since in pg you can safely split the mirror any 
time you like.

My comment was meant to give dba's a familiar tool. The effect of it
would only have been to create a separate backup of pg_control.
Might as well tell people to always backup pg_control first.

I originally thought you would require restore to specify an xlog id
from which recovery will start. You would search this log for the first 
checkpoint record, create an appropriate pg_control, and start rollforward.

I still think this would be a nice feature, since then all that would be required
for restore is a system backup (that includes pg data files) and the xlogs.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Zeugswetter Andreas SB SD

Sorry for the stupid question, but how do I get this patch if I do not
receive the patches mails ?

The web interface html'ifies it, thus making it unusable.

Thanks
Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Zeugswetter Andreas SB SD

 The recovery mechanism doesn't rely upon you knowing 1 or 3. The
 recovery reads pg_control (from the backup) and then attempts to
 de-archive the appropriate xlog segment file and then starts 
 rollforward

Unfortunately this only works if pg_control was the first file to be 
backed up (or by chance no checkpoint happened after backup start and 
pg_control backup)

Other db's have commands for:
start/end external backup

Maybe we should add those two commands that would initially only do 
the following:

start external backup:
- (checkpoint as an option)
- make a copy of pg_control
end external backup:
- record WAL position (helps choose an allowed minimum PIT)

Those commands would actually not be obligatory but recommended, and would 
only help with the restore process.

Restore would then eighter take the existing pg_control backup, or ask
the dba where rollforward should start and create a corresponding pg_control.
A helper utility could list possible checkpoints in a given xlog.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Portals and nested transactions

2004-07-14 Thread Zeugswetter Andreas SB SD

My answers:

 Q1: Should Portals successfully created within the failed subxact
 be closed?  Or should they remain open?

no for protocol level

I can understand a yes to this one for sql level, because it will be
hard to clean up by hand :-( But I like the analogy to hold cursors, 
so I would also say no to sql level.

Is the pro yes argument ACID allowed here ? I thought ACID is about 
data integrity and not flow control, and also deals with main transactions 
and not subtransactions.

 Q2: If the subxact changed the state of a pre-existing Portal, should
 that state change roll back?  In particular, can a Close Portal
 operation roll back?

NO for both SQL and protocol level.
The analogy is imho that closing a 'hold cursor' is also never rolled back 

 How to do it non-transactionally
 

Sounds like a good plan, but also sounds like a lot of work.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Zeugswetter Andreas SB SD

 The starting a new timeline thought works for xlogs, but not for clogs.
 No matter how far you go into the future, there is a small (yet
 vanishing) possibility that there is a yet undiscovered committed
 transaction in the future. (Because transactions are ordered in the clog
 because xids are assigned sequentially at txn start, but not ordered in
 the xlog where they are recorded in the order the txns complete).

Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Nested Transactions, Abort All

2004-07-09 Thread Zeugswetter Andreas SB SD

 My proposal would be:
 
 1. Begin main transaction: BEGIN { TRANSACTION | WORK }
 2. Commit main (all) transaction: COMMIT { TRANSACTION | WORK }
 3. Rollback main (all) transaction: ROLLBACK { TRANSACTION }
 
 4. Begin inner transaction: BEGIN NESTED { TRANSACTION | WORK }
 5. Commit inner transaction: COMMIT NESTED { TRANSACTION | WORK }
 6. Rollback inner transaction: ROLLBACK NESTED { TRANSACTION }
 
 
 1, 2 and 3 are not negotiable.  4, 5 and 6 are.

Hmm, 1-3 are at least negotiable for the abbreviated form 'BEGIN'
and 'END'. I think we could differentiate those. 
The standard only has 'BEGIN TRANSACTION' and 'COMMIT [WORK]'
and 'ROLLBACK [WORK]'. I agree that we are not allowed to change 
the semantics of those non abbreviated forms.

How about:
1. Begin main tx: BEGIN WORK | BEGIN TRANSACTION
2. Commit main (all) transaction: COMMIT [ TRANSACTION | WORK ]
3. Rollback main (all) transaction: ROLLBACK [ TRANSACTION | WORK ]

4. BEGIN: starts eighter a main or a subtransaction (for plsql BEGIN SUB)
5. END: commits nested, maybe also abort a nested tx that is already 
in abort state (for plsql END SUB)
6. ROLLBACK SUB[TRANSACTION]: rolls subtx back

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Nested Transactions, Abort All

2004-07-09 Thread Zeugswetter Andreas SB SD

 As far as implementing only savepoints, look at this:
 
 BEGIN;
   BEGIN;
   INSERT INTO ...;
   COMMIT;
   BEGIN;
   INSERT INTO ...;
   COMMIT;
   BEGIN;
   INSERT INTO ...;
   COMMIT;
 
 With savepoints, it looks pretty strange:
   
 BEGIN;
   SAVEPOINT x1;
   INSERT INTO ...;
   SAVEPOINT x2;

If you meant same as your nested example, it would be:

BEGIN TRANSACTION;
SAVEPOINT x;
INSERT INTO ...;
SAVEPOINT x;-- this implicitly commits previous subtxn x
INSERT INTO ...;
SAVEPOINT x;
INSERT INTO ...;
COMMIT;

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Nested Transactions, Abort All

2004-07-09 Thread Zeugswetter Andreas SB SD

 I'd opt for BEGIN as a start of a subtransaction (no need for special 
 semantics in plpgsql), the corresponding END simply changes the 
 transaction context to the parent level.

But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a 
statement block. Are we intending to change that ? I think not.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Nested Transactions, Abort All

2004-07-09 Thread Zeugswetter Andreas SB SD

 But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a 
 statement block. Are we intending to change that ? I think not.
 
   
 
 There are two possibilities:
 Either BEGIN *does* start a subtransaction, or BEGIN does not. I don't 
 see how two nesting level hierarchies in a function should be 
 handleable, i.e. having independent levels of statements blocks and 
 subtransactions.
 
 BEGIN [whatever] suggests that there's also a statement closing that 
 block of [whatever], but it's very legal for subtransactions to have no 
 explicit end; the top level COMMIT does it all.

An 'END SUB' after a 'BEGIN SUB' in plpgsql could be required, and could
mean start/end block and subtx. I do not really see a downside.
But, it would imho only make sense if the 'END SUB' would commit sub
or abort sub iff subtx is in aborted state (see my prev posting)

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Point in Time Recovery

2004-07-07 Thread Zeugswetter Andreas SB SD

 Well, Tom does seem to have something with regard to StartUpIds. I feel
 it is easier to force a new timeline by adding a very large number to
 the LogId IF, and only if, we have performed an archive recovery. That
 way, we do not change at all the behaviour of the system for people that
 choose not to implement archive_mode.

Imho you should take a close look at StartUpId, I think it is exactly this 
large number. Maybe you can add +2 to intentionally leave a hole.

Once you increment, I think it is very essential to checkpoint and double 
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.
 
 Should we implement timelines?

Yes :-)

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Point in Time Recovery

2004-07-06 Thread Zeugswetter Andreas SB SD
 
 - by time - but the time stamp on each xlog record only specifies to the
 second, which could easily be 10 or more commits (we hope)
 
 Should we use a different datatype than time_t for the commit timestamp,
 one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

 - when we stop, keep reading records until EOF, just don't apply them.
 When we write a checkpoint at end of recovery, the unapplied
 transactions are buried alive, never to return.
 - stop where we stop, then force zeros to EOF, so that no possible
 record remains of previous transactions.
 I'm tempted by the first plan, because it is more straightforward and
 stands much less chance of me introducing 50 wierd bugs just before
 close.

But what if you restore because after that PIT everything went haywire
including the log ? Did you mean to apply the remaining changes but not 
commit those xids ? I think it is essential to only leave xlogs around 
that allow a subsequent rollforward from the same old full backup. 
Or is an instant new full backup required after restore ?

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: subtransactions and FETCH behaviour (was Re: [HACKERS] PREPARE and transactions)

2004-07-06 Thread Zeugswetter Andreas SB SD

 Well, the proposal of implementing it like holdable cursors means using
 a Materialize node which, if I understand correctly, means taking the
 whole result set and storing it on memory (or disk).

Would it help to hold the lock for a record that is the current cursor position,
iff this record was updated (and subsequently rolled back) by this subtxn,
and release that lock as soon as you fetch next ?

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] nested-xacts cursors (was Re: Performance with new nested-xacts code)

2004-07-05 Thread Zeugswetter Andreas SB SD

 begin;
   declare cursor c ...;
   fetch 1 from c; -- returns tuple 1
   begin;
 fetch 1 from c;   -- returns tuple 2
   rollback;
   fetch 1 from c; -- returns tuple 1 again
 
 This is mightly ugly but I think it's the most usable of the options
 seen so far.

Imho most usabel would be to handle the cursor like a hold corsor.

begin;
   declare cursor c ...;
   fetch 1 from c;  -- returns tuple 1
   begin;
 fetch 1 from c;-- returns tuple 2
   rollback;
   fetch 1 from c;  -- returns tuple 3

For me the reason is, that most likely you are not going to rollback
because the fetch did not work or returned something you don't like.
Most likely some consequent action did not work out, and the next step 
will be to correct (or ignore) the problem. You can do that without an 
extra fetch, because you still have the values in host variables.

resetting to tuple 1 imho opens the door for endless loops. 

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-18 Thread Zeugswetter Andreas SB SD

  My only guess is that getaddrinfo in your libc has a bug somehow that is
  corrupting the stack (hance the improper backtrace), then crashing.
 
 It could be libc on AIX, I suppose, but it strikes me as sort of odd
 that nobody else ever seens this.  Unless nobody else is using AIX
 5.1, which is of course possible.

I can confirm, that AIX 4.3.2 getaddrinfo is at least a bit *funny*. 
getaddrinfo seems to not honour nsorder and only does dns, even though the manual sais:
Should there be any discrepancies between this description and the POSIX description,
 the POSIX description takes precedence.
The function does return multiple entries, often the first is not the best.

Log is:
LOG:  could not translate service 5432 to address: Host not found
WARNING:  could not create listen socket for *
LOG:  could not bind socket for statistics collector: Can't assign requested address
LOG:  disabling statistics collector for lack of working socket

This area probably needs a fix/workaround on AIX :-(

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Tablespaces

2004-06-11 Thread Zeugswetter Andreas SB SD

  With the rule system and two underlying tables one could make it work by 
  hand I think.
 
 The rule system could be used to do this, but there was some discussion of
 using inherited tables to handle it. However neither handles the really hard
 part of detecting queries that use only a part of the table and taking that
 into account in generating the plan.

I think the consensus should be to add smarts to the planner to include 
static constraint information to reduce table access.

e.g if you have a constraint acol integer, check acol  5
and you have a query with a where acol = 10 you could reduce that
to where false. This would help in all sorts of situations not only 
partitioned/inherited tables. I am not sure what the runtime cost of 
such an inclusion would be, so maybe it needs smarts to only try in certain 
cases ?

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Why hash indexes suck

2004-06-07 Thread Zeugswetter Andreas SB SD

 We could safely sort on the hash value, but I'm not sure how effective
 that would be, considering that we're talking about values that already
 hashed into the same bucket --- there's likely not to be very many
 distinct hash values there.

I think we can safely put that on the todo list.
The existing hash algorithm is very good. So I would on the 
contrary beleive that only a few keys share a hash value per pagesized bucket.
For the equal keys case it does not matter since we want all of the rows anyways.
For the equal hash value case it would probably be best to sort by ctid.

TODO ?: order heap pointers inside hash index pages by hash value and ctid

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Zeugswetter Andreas SB SD

 I think an actually implementable version of this would be:
 
 1. Don't log any index operations at all in WAL.
 
 2. When recovering from WAL, restore all the table contents by WAL
 replay.  (This would of course include the system catalog contents that
 describe the indexes.)  Then sit there and do a global REINDEX to
 rebuild all the indexes.
 
 This would gain a reduction of some percentage in WAL traffic, at the
 cost of a hugely expensive recovery cycle any time you actually needed
 to use the WAL.  I guess this could be attractive to some installations,
 but I'm not sure very many people would want it ...

I think only the global part of it is not really acceptable. If we had a flag
for each index that marks it inconsistent reindexing only those that are
marked would be great.

Could we log a WAL record that basically only marks an index for deferred reindex
after WAL recovery ? During WAL replay all records for this index could be 
ignored (this is not a must because of the post update page images in WAL, 
the index would still stay inconsistent until reindex of course).

I think such a reindex step could also be responsible for those non-btree 
indexes that don't fully support WAL (gist?).

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-05-27 Thread Zeugswetter Andreas SB SD

 What would be useful for that (as of 7.5) is a dummy ALTER TABLE TYPE,
 viz
 
   ALTER TABLE tab ALTER COLUMN col TYPE same-type-it-already-has

I think we should add special syntax for this purpose, since I would like to
(or someone else later on) see all possible cases of alter column short 
circuited.

The syntax I would see fit is something like:
ALTER TABLE tab [MOVE] TABLESPACE xy;

For the above special case the tablespace would be the same as before.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Timezone fun (bugs and a request)

2004-05-25 Thread Zeugswetter Andreas SB SD
LOG:  database system was shut down at 2004-05-25 15:15:44 GMT-12
 
 For comparison, 7.4.1 on the same system says:
 
LOG:  database system was shut down at 2004-05-25 16:03:43 NZST

 Can we keep the zic database convention unchanged but change the display 
 format in the logs to be consistent with the SQL conventions?

I think what was meant is, that if it displays GMT+-something it should
convert the sign. Is that zic filename exposed anywhere else ?

It is common practice that + is East and - is West, no ?

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Table Spaces

2004-05-18 Thread Zeugswetter Andreas SB SD

 If you run NTFS, it's still possible to use arbitrary links. In the Windows
 world, they are called junctions. Microsoft does not provide a junction tool
 for some reason (perhaps because it's limited to NTFS). A good tool, free
 and with source, can be found here
 http://www.sysinternals.com/ntw2k/source/misc.shtml#junction 
 I use this tool myself. Works like a charm.
 
 We've looked at it before. Apart from anything else I don't think its 
 license is compatible with PostgreSQL's.

The tool was suggested for use as is, not for inclusion in pg source.
It can be used to e.g. symlink xlog manually.

If you want something in pg source, the relevant lines (about 20)
can be copied from MSDN.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Call for 7.5 feature completion

2004-05-17 Thread Zeugswetter Andreas SB SD

 It is too late to think about pushing back another month.  We had this
 discussion already.  June 1 is it.

I thought the outcome of that discussion was June 15 ?

Can we try to do the 2PC patch now instead of waiting for subtransactions ?

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] relcache refcount

2004-05-14 Thread Zeugswetter Andreas SB SD

 BTW, what are your plans for state saving/reversion for the lock manager
 and buffer manager?  The lock state, in particular, makes these other
 problems look trivial by comparison.

Why can't we keep all locks until main tx end ? Locks are not self conflicting
are they ? So the only reason to free them would be to improve concurrency,
and imho we don't need that. I guess I am just not seeing this correctly.
(I am assuming that a deadlock will still break the whole tx)

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] threads stuff/UnixWare

2004-05-13 Thread Zeugswetter Andreas SB SD

 I know, this sucks, but, I don't see any other way, other than linking
 *ALL* libpq-using programs (including initdb and friends) with -K
pthread.

How about making a libpq.so (without pthread) and a thread safe
libpq_r.so ?

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] XLog: how to log?

2004-05-12 Thread Zeugswetter Andreas SB SD

  I think this argument is largely a red herring ... but if it makes you
  feel better, we could change the contents of the commit timestamp to
  be gettimeofday() output (seconds+microseconds) instead of just time()
  output.  That should be precise enough for practical purposes.
 
 I am saying timestamp as used for specifying a recovery location might
 not be unique enough, no?

Maybe the api should allow a xid only in addition to a timestamp, for use when 
timestamp alone is not precise enough. That would solve the problem of not 
finding the xid. I would stay away from microseconds.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] ALTER TABLE TODO items

2004-05-07 Thread Zeugswetter Andreas SB SD

 FireBird: ALTER COLUMN column TYPE type
 DB2:  ALTER COLUMN column SET DATA TYPE type.
 Oracle:   MODIFY column type
 MSSQL:ALTER COLUMN column type constraints
 MySQL:Both Oracle and MSSQL
 Sap:  MODIFY column type
 
 Spec: Nothing (obvious) on changing column types
 
 MODIFY is horrible. It seems to drop all constraints, defaults, etc that
 are not specified in the second definition. It is essentially a
 replacement of the column.

In Oracle MODIFY leaves omitted parts unchanged,
syntax is actually ALTER TABLE table MODIFY (column type default constraint)
I think the parentheses are optional if only one column is modified.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] ANALYZE locks pg_listener in EXCLUSIVE for long time?

2004-05-03 Thread Zeugswetter Andreas SB SD

 * Is it really a good idea for database-wide ANALYZE to run as a single
 transaction?  Holding all those locks is a recipe for deadlocks, even
 if they're as inoffensive as AccessShareLocks normally are.

Wasn't one idea behind that change also to not make the planner create a plan
from mixed old and new statistics ? I guess that could later be accomplished with 
begin work; analyze; commit work; (with subtransactions) though.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] PITR logging control program

2004-04-30 Thread Zeugswetter Andreas SB SD
 
 Basically it is updating the logs as soon as it receives the
 notifications.  Writing 16 MB of xlogs could take some time.

In my experience with archiving logs, 16 Mb is on the contrary way too 
small for a single log. The overhead of starting e.g. a tape session
is so high that you cannot keep up (a few seconds). Once the tape is 
streaming it is usually quite fast. So imho it is not really practical to 
have logs so small that they can fill in less that 20 seconds.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] pre-loading a user table.

2004-04-02 Thread Zeugswetter Andreas SB SD

 Why do you think it useful to preload something during InitPostgres,
 anyway?  Any heavily used table will certainly be present in shared
 buffers already, and even more surely present in kernel buffers.

And if you really want it preloaded you can issue dummy selects
with a client right after startup. I really think locking a table into 
memory is a worthless feature if a good buffer manager is at work.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [pgsql-hackers-win32] fsync with sync, and Win32 unlink

2004-03-11 Thread Zeugswetter Andreas SB SD

 Consider either a box with many different postgresql instances, or one
 that run both postgresql and other software. Issuing sync() in that
 sitaution will cause sync of a lot of data that probably doesn't need
 syncing. 
 But it'd probably be a very good thing on a dedicated server, giving the
 kernel the chance to optimise.

It is not like the sync is done every few seconds ! It is currently done
every 5 minutes (I actually think this is too frequent now that we have 
bgwriter, 10 - 20 min would be sufficient). So imho even on a heavily 
otherwise used system the sync will be better.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] PITR Functional Design v2 for 7.5

2004-03-10 Thread Zeugswetter Andreas SB SD

 The only way we can support file-level hot backup is in conjunction with
 PITR-style WAL log archiving.  It is okay for the data area dump to be
 inconsistent, so long as your recovery process includes replay of WAL
 starting at some checkpoint before the filesystem dump started, and
 extending to some point after the filesystem dump finished.  Replaying
 WAL will correct the inconsistencies.

And the last checkpoint info resides in pg_control, and not in pg_clog, no ?
So basically a PITR restore would need to adjust the pg_control file
after filesystem restore and before starting recovery. Maybe it can take that 
info from the oldest available WAL ? The OP would only need to ensure,
that only such logs that need to be rolled forward are visible (in the 
correct directory) to the recovery.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] PITR Functional Design v2 for 7.5

2004-03-10 Thread Zeugswetter Andreas SB SD

  To clarify:
  I'd expect a cluster to be workable, if I
  - disable VACUUM until backup completed
  - issue CHECKPOINT
  - backup clog (CHECKPOINT and backup clog are the backup checkpoint)
  - backup all datafiles (which include at least all completed transaction 
  data at checkpoint time)
  and then
  - restore datafiles and clog
  - bring up pgsql.
 
 Why is that a useful approach?  You might as well shut down the
 postmaster and do a cold filesystem backup, because you are depending on
 the data files (including clog) not to change after the checkpoint.  You
 cannot make such an assumption in a running database.

I think there is a misunderstanding here. 

What I think is possible is the following (continuous backup of WAL assumed):
- disable VACUUM
- issue CHECKPOINT C1
- backup all files
- reenable VACUUM

- restore files
- adapt pg_control (checkpoint C1)
- recover WAL until at least end of backup

The db is inconsistent until you recovered all WAL (PITR) that accumulated during
file backup. 

I am not sure about clog, isn't clog logged in xlog ?

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [pgsql-hackers-win32] [HACKERS] Tablespaces

2004-03-05 Thread Zeugswetter Andreas SB SD

 First of all, symlinks are a pretty popular feature.  Even Windows
 supports what would be needed.  Second of all, PostgreSQL will still
 run on OSes without symlinks, tablespaces won't be available, but
 PostgreSQL will still run.  Since we are all using PostgreSQL without

My idea for platforms that don't support symlinks would be to simply create
a tblspaceoid directory inplace instead of the symlink (maybe throw a warning).
My feeling is, that using the same syntax on such platforms is important, 
but actual distribution is not (since they will most likely be small systems).

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it

2004-03-04 Thread Zeugswetter Andreas SB SD

 - Dispose names of connectiong and replace them with a pointer.

You cannot dispose the names, you can only add something to also allow pointers.
The names are in the ESQL/C standard.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [pgsql-hackers-win32] [HACKERS] Tablespaces

2004-03-03 Thread Zeugswetter Andreas SB SD

   For tablespaces on OS's that don't support it, I think we will have to
   store the path name in the file and read it via the backend.  Somehow we
   should cache those lookups.
  
  My feeling is that we need not support tablespaces on OS's without
  symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction

I think Win2000 or XP would be a reasonable restriction for Win32 PG 
installations that want tablespaces.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Tablespaces

2004-02-27 Thread Zeugswetter Andreas SB SD

  Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
  drop a table space until the directory is empty.

Agreed.

 
 How would it get to be empty?  Are you thinking of some sort of connect
 database to tablespace and disconnect database from tablespace
 commands that would respectively create and delete the per-database
 subdirectory?  That seems moderately reasonable to me.  We could then

I would only allow the drop if the directory only contains empty db oid 
directories.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it required?

2004-02-27 Thread Zeugswetter Andreas SB SD

 I'm not sure I understand you correctly. The SQL standard says you can
 call your statement as this:
 exec sql at CONNECTION select 1;
 
 Here CONNECTION of course is a string, the name of the connection. So,
 yes, we have to maintain that list to make sure we get the right
 connection.

I thought the main problem was an exec sql set connection :hostvar,
where hostvar is a string determined at runtime. Else above could be translated at 
precompile time to anything more efficient than a string search.

Oh, just noticed above is not standard, but sure is very useful.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it

2004-02-27 Thread Zeugswetter Andreas SB SD

 I am asking for CONNECTION being a variable of data type 'connection *' rather 
 than 'const char *'. That would avoid name lookups.
 
 Is that out of spec?

Yes, but the preprocessor could still add an optimization ala 'connection *' for
the hardcoded cases (exec sql set connection 'myconn1'; exec sql at 'myconn1' ...).
It needs to maintain the string list for the non hardcoded cases though.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Tablespaces

2004-02-27 Thread Zeugswetter Andreas SB SD

 I do not intend to undertake raw disk tablespaces for 7.5. I'd be
 interested if anyone could provide some real world benchmarking of file
 system vs. raw disk. Postgres benefits a lot from kernel file system cache
 at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

The use of raw disks is usually paired with the use of kernel aio.
The difference is said to be up to 30% on Solaris. I can assert, that
it made the difference between a bogged down system and a much better behaved
DB on Sun here.

My experience with kaio on AIX Informix is, that kaio is faster as long as IO
is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while
for an IO bound system the Informix builtin IO threads that can be used instead
win. (Since they obviously do better at grouping, sorting and readahead
than the AIX kernel does for kaio)

Overall I think the price and komplexity is too high, especially since there are 
enough platforms where the kernel does a pretty good job at grouping, sorting and 
readahead. Additionally the kernel takes non PostgreSQL IO into account.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] OIDs, CTIDs, updateable cursors and friends

2004-02-19 Thread Zeugswetter Andreas SB SD

 I believe the ODBC driver uses CTID for this sort of problem.  CTID is
 guaranteed to exist and to be fast to access (since it's a physical
 locator).  Against this you have the problem that concurrent updates
 of the record will move it, leaving your CTID invalid.  However, that

IIRC the ctid access follows the chain up to the currently valid 
tuple ? I thought the only enemy of ctid access was vacuum ?

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] NO WAIT ...

2004-02-19 Thread Zeugswetter Andreas SB SD

  The question is whether we should have a GUC variable to control no
  waiting on locks or add NO WAIT to specific SQL commands.
  
  Does anyone want to vote _against_ the GUC idea for nowait locking.  (We
  already have two voting for such a variable.)
 
 I vote against. We got bit by both the regex and the autocommit GUC vars
 and this is setting up to cause a similar headache with old code on new
 platforms.

I vote for the GUC. Imho it is not comparable to the autocommit case,
since it does not change the way your appl needs to react (appl needs to
react to deadlock already).

I personally think a wait period in seconds would be more useful.
Milli second timeouts tend to be misused with way too low values
in this case, imho.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] NO WAIT ...

2004-02-19 Thread Zeugswetter Andreas SB SD

  I personally think a wait period in seconds would be more useful.
  Milli second timeouts tend to be misused with way too low values
  in this case, imho.
 
 I understand, but GUC lost the vote.  I have updated the TODO list to
 indicate this.  Tatsuo posted a patch to add NO WAIT to the LOCK
 command, so we will see if we can get that into CVS.

Ok, I can see the advantages of that approach too.
Too bad there is no standard for this.

And it is probably really true that statement_timeout solves
the problem of very long (indefinite :-) waits for locks.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Transaction aborts on syntax error.

2004-02-13 Thread Zeugswetter Andreas SB SD

 In both cases, the transaction either commits or rollback occurs. No
 other option is possible at the end of the transaction, but in the first
 style of transaction semantics you get a mid-way decision point. This
 only refers to retryable errors, since errors like access rights
 violations and many other system errors aren't retryable. In the example

You seem to ignore the fact, that a lot of errors (and I assume most of 
the real world cases, where the appl actually reacts inside a transaction) are 
repared by these applications by doing something else instead.

Like the application does an insert gets a duplicate key error and does an update 
instead. Or it does an insert gets a foreign key constraint violation,
populates the foreign table and does the insert again. Note that this programming 
practice is more efficient, than letting the appl check beforehand if the error cases 
are seldom.

It seems to me, that leaving all this to the client (which implicitly inserts 
savepoints) can never be as efficient as a serverside feature.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



Re: [HACKERS] Transaction aborts on syntax error.

2004-02-13 Thread Zeugswetter Andreas SB SD

  It seems to me, that leaving all this to the client (which implicitly
  inserts savepoints) can never be as efficient as a serverside feature.
 
 I think this is an overly narrow view of efficiency.  With client
 control, the client can insert savepoints whereever it needs them,

Yes, but not if the client API does implicit savepoints. So imho if it is not 
cheap we should not start to supply API's that do them implicitly.

 which might not be for every statement.  Savepoints that you don't
 actually need are going to be a fairly expensive overhead, AFAICS.

Well with other db's per statement rollback is a no overhead feature, 
so this is pg specific. (In the sense of: nothing is done that would not need 
to be done anyway, since they all undo the changes)

Imho the 80% main use case for applications would be for duplicate key 
to not abort. For interactive psql it would probably be the syntax error.  
Maybe something can be done about those special cases to make partial 
rollback cheaper for those.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html



Re: [HACKERS] Transaction aborts on syntax error.

2004-02-12 Thread Zeugswetter Andreas SB SD

 Improving on not ideal would be good, and would get even closer to
 full Oracle/SQLServer migration/compatibility. However, since I've never
 looked at that section of code, I couldn't comment on any particular
 approach nor implement such a change, so I'll shut up and be patient.

 Imagine this:
 
   BEGIN WORK;
   LOCK oldtab;
   CREATE_X TABLE newtab AS SELECT * FROM oldtab;
   DELETE oldtab;
   COMMIT
 
 In this case, you would want the database to abort on a syntax error, right?

Yeah, but in other db's this is solved by the frontend. e.g. in Informix
dbaccess has a mode that simply stops execution upon first error. So I don't 
think this is a nogo argument, if we added such a feature to psql.

Imagine your script continuing with insert into newtab ... after the commit, 
wouldn't you actually want that to not run eighter ?

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-05 Thread Zeugswetter Andreas SB SD
 I don't think the bgwriter is going to be able to keep up with I/O bound
 backends, but I do think it can scan and set those booleans fast enough
 for the backends to then perform the writes.

As long as the bgwriter does not do sync writes (which it does not, 
since that would need a whole lot of work to be performant) it calls 
write which returns more or less at once.
So the bottleneck can only be the fsync. From those you would want 
at least one per pg disk open in parallel.

But I think it should really be left to the OS when it actually does the IO
for the writes from the bgwriter inbetween checkpoints. 
So Imho the target should be to have not much IO open for the checkpoint, 
so the fsync is fast enough, even if serial.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint

2004-02-05 Thread Zeugswetter Andreas SB SD

 People keep saying that the bgwriter mustn't write pages synchronously
 because it'd be bad for performance, but I think that analysis is
 faulty.  Performance of what --- the bgwriter?  Nonsense, the *point*

Imho that depends on the workload. For a normal OLTP workload this is 
certainly correct. I do not think it is correct for mass loading,
or an otherwise IO bound db.

 of the bgwriter is to do the slow tasks.  The only argument that has
 any merit is that O_SYNC or immediate fsync will prevent us from having
 multiple writes outstanding and thus reduce the efficiency of disk
 write scheduling.  This is a valid point but there is a limit to how
 many writes we need to have in flight to keep things flowing smoothly.

But that is imho the main point. The difference for modern disks
is 1Mb/s for random 8k vs. 20 Mb/s for random 256k.

Don't understand me wrong I think sync writing would achieve maximum performance, 
but you have to try to write physically adjacent 256k, and you need a vague
idea which blocks to write in parallel. And since that is not so easy I think 
we could leave it to the OS.

And as an aside I think 20-30 minute checkpoint intervals would be sufficient
with a bgwriter.

Andreas

Ps: don't most syncers have 60s intervals, not 30 ?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [pgsql-hackers-win32] [HACKERS] [PATCHES] fork/exec patch

2003-12-18 Thread Zeugswetter Andreas SB SD
How about the typical answer on Windows ? Create an invisible Window
with an Event Handler and pass it a windows message ?

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] 7.4 include file conflict

2003-12-18 Thread Zeugswetter Andreas SB SD

 Does anyone know how Informix, where this file comes from, 
 handles this?

Informix puts those files in $INFORMIXDIR/incl/esql (e.g. /usr/informix/incl/esql),
so imho a /usr/postgres installation could have them somewhere under /usr/postgres

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] fsync method checking

2003-12-17 Thread Zeugswetter Andreas SB SD

 Ideally that path isn't taken very often.  But I'm currently having a
 discussion off-list with a CMU student who seems to be seeing a case
 where it happens a lot.  (She reports that both WALWriteLock and
 WALInsertLock are causes of a lot of process blockages, which seems to
 mean that a lot of the WAL I/O is being done with both held, which would
 have to mean that AdvanceXLInsertBuffer is doing the I/O.  
 More when we figure out what's going on exactly...)

I would figure, that this is in a situation where a large transaction
fills one XLInsertBuffer, and a lot of WAL buffers are not yet written.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] fsync method checking

2003-12-12 Thread Zeugswetter Andreas SB SD

 Running the attached test program shows on BSD/OS 4.3:
 
   write  0.000360
   write  fsync  0.001391

I think the write  fsync pays for the previous write test (same filename).

   write, close  fsync   0.001308
   open o_fsync, write0.000924

I have tried to modify the program to more closely resemble WAL 
writes (all writes to WAL are 8k), the file is usually already open, 
and test larger (16k) transactions.

[EMAIL PROTECTED]:~ test_sync1
write  0.000625
write  fsync  0.016748
write  fdatasync  0.006650
write, close  fsync   0.017084
write, close  fdatasync   0.006890
open o_dsync, write0.015997
open o_dsync, one write0.007128

For the last line xlog.c would need to be modified, but the measurements
seem to imply that it is only worth it on platforms that have O_DSYNC
but not fdatasync.  

Andreas


test_sync1.c
Description: test_sync1.c

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pg_restore and create FK without verification check

2003-11-26 Thread Zeugswetter Andreas SB SD
  Q2: New situation: Why is it not a good idea to backup the database
  files of a cluster incl. all c_log and x_log (log files last) to get a
  physicaly hot backup.
  In principle it is the same situation like a server which is crashing
  (not a once but during some time). After restoring, it should do a redo
  and rollback automatically like after a crash. This methode (physical
  hot backup) would increas backup and restore times dramatically.
 
 The answer from Robert Treat was:
 
  Essentially I think you're right, it should behave much like a crashing 
  server.  The main reason why people don't recommend it is that (depending on 
  your os setup) there is the potential to lose data that has been commited but 
  not actually written to disk.  Note that you shouldn't get corrupted data 
  from this, but in many cases losing data is just as bad so we don't recomend 
  it.  If you really want to do this, you should really either shut down the 
  database  or get LVM going.

The key issue here is to have a pg_control file to start from with a 
finished checkpoint from before you start to backup. Then you need to 
ensure that you have all logfiles from checkpoint until backup finishes.
The last thing to backup must be the last active x_log.
It would prbbly be a good idea to not have a vacuum running concurrently :-)

And then you need to do a lot of tests, since nobody else does it yet.
I think this is an issue, since it is such high risk, nobody will 
step up easily and say that it is safe.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-26 Thread Zeugswetter Andreas SB SD

  There are no such libraries.  I keep hearing ICU, but that is much too
  bloated.
 
 At least it is kind of standard and also something what will be
 maintained for foreseeable future, it also has a compatible license and
 is available on all platforms of interest to postgresql.

And it is used for DB/2 and Informix, so it must be quite feature complete
for DB relevant stuff.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Zeugswetter Andreas SB SD
Have you looked at what is available from 
http://oss.software.ibm.com/icu/ ?

Seems they have a compatible license, but use some C++.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


splitting WAL (was RE: [HACKERS] Providing anonymous mmap as an option of sharing memory)

2003-11-25 Thread Zeugswetter Andreas SB SD

 In case of WAL per database, the operations done on a shared catalog from a 
 backend would need flushing system WAL and database WAL to ensure such 
 transaction commit. Otherwise only flushing database WAL would do.

I don't think that is a good idea. If you want databases separated you should 
install more than one instance. That gives you way more flexibility.

Imho per database WAL is a deficiency, not a feature.

Andreas

PS: problem with mmap was, that it has no attached process count

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Sponsoring enterprise features

2003-11-24 Thread Zeugswetter Andreas SB SD

Main needs partitioning is useful for:
- partition elimination for queries (e.g. seq scans only scan relevant partitions)
- deleting/detaching huge parts of a table in seconds
- attaching huge parts to a table in seconds (that may have been loaded with
a fast loading utility (e.g. loading without index, prebuilding indexes,
attaching table + prebuilt partitioned indexes)) 
- achieving [heap and index] per page data locality (for better cache rates)
- allowing partial restores (for defect disks) while the rest of the db is still online
- in pg, allowing partial vacuums (only partitions that see changes)

People needing those features usually have data with more than 10-50 Gb per 
partition. 

 I'm a little unclear, personally, about what can be accomplished through table 
 partitioning that we can't currently do through partial indexes and inherited 
 tables, especially after Gavin finishes his tablespaces patch

Well, sure the goal needs to be to make use of what already exists,
but a few things are still missing, e.g.:
 - unique indexes, that span the hierarchy (and do not contain the partitioning 
column[s])
 - partition elimination (imho we should use check constraints for that)
 - physical backups :-)
 - tablespaces ?

Note, that these would all be useful for non partitioning 
use cases also.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Background writer process

2003-11-19 Thread Zeugswetter Andreas SB SD

 1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if 
  Without grouping WAL writes that does not fly. Iff however such grouping
  is implemented that should deliver optimal performance. I don't think flushing 
  WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k 
  to disk with one call takes nearly the same time. The WAL write would need to be 
  done as soon as eighter 256k fill or a txn commits.
 
 That means no special treatment to WAL files? If it works, great. There would be 
 single class of files to take care w.r.t sync. issue. Even more simpler.

No, WAL needs special handling. Eighter leave it as is with write + f[data]sync,
or implement O_SYNC|O_DIRECT with grouping of writes (the current O_SYNC 
implementation 
is only good for small (8kb) transactions).

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Background writer process

2003-11-18 Thread Zeugswetter Andreas SB SD

 If the background writer uses fsync, it can write and allow the buffer
 to be reused and fsync later, while if we use O_SYNC, we have to wait
 for the O_SYNC write to happen before reusing the buffer;  
 that will be slower.

You can forget O_SYNC for datafiles for now. There would simply be too much to
do currently to allow decent performance, like scatter/gather IO, ...
Imho the reasonable target should be to write from all backends but sync (fsync)
from the background writer only. (Tune the OS if it actually waits until the 
pg invoked sync (== 5 minutes per default)) 

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Background writer process

2003-11-17 Thread Zeugswetter Andreas SB SD
 
 1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if 

Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing 
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k 
to disk with one call takes nearly the same time. The WAL write would need to be 
done as soon as eighter 256k fill or a txn commits.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] 7.4RC2 regression failur and not running stats collector process on Solaris

2003-11-12 Thread Zeugswetter Andreas SB SD

  LOG:  could not bind socket for statistics collector: Cannot assign requested 
  address
 
 Hmm ... that's sure the problem, but what can we do about it? ISTM that
 any non-broken system ought to be able to resolve localhost.  Actually
 it's worse than that: your system resolved localhost and then refused

Are we using an api that only returns nslookup responses and not
/etc/hosts entries ? At least on AIX it looks like it.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Zeugswetter Andreas SB SD

 that works well enough to make it uncommon for backends to have to
 write dirty buffers for themselves.  If we can, then doing all the
 writes O_SYNC would not be a problem.

One problem with O_SYNC would be, that the OS does not group writes any 
more. So the code would need to eighter do it's own sorting and grouping
(256k) or use aio, or you won't be able to get the maximum out of the disks.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Zeugswetter Andreas SB SD

  One problem with O_SYNC would be, that the OS does not group writes any 
  more. So the code would need to eighter do it's own sorting and grouping
  (256k) or use aio, or you won't be able to get the maximum out of the disks.
 
 Or just run multiple writer processes, which I believe is Oracle's
 solution.

That does not help, since for O_SYNC the OS'es (those I know) do not group those 
writes together. Oracle allows more than one writer to busy more than one 
disk(subsystem) and circumvent other per process limitations (mainly on platforms 
without AIO). 

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-05 Thread Zeugswetter Andreas SB SD

  The only idea I have come up with is to move all buffer write operations
  into a background writer process, which could easily keep track of
  every file it's written into since the last checkpoint.  
 
 I fear this approach. It seems to limit a lot of design flexibility later. But
 I can't come up with any concrete way it limits things so perhaps that
 instinct is just fud.

A lot of modern disk subsystems can only be saturated with more then one parallel 
IO request. So it would at least need a tuneable number of parallel writer processes,
or one writer that uses AIO to dump all outstanding IO requests out at once.
(Optimal would be all, in reality it would need to be batched into groups of 
n pages, since most systems have a max aio request queue size, e.g. 8192).

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Experimental ARC implementation

2003-11-05 Thread Zeugswetter Andreas SB SD
 My plan is to create another background process very similar to 
 the checkpointer and to let that run forever basically looping over that 
 BufferSync() with a bool telling that it's the bg_writer.

Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Experimental ARC implementation

2003-11-05 Thread Zeugswetter Andreas SB SD

  Why not use the checkpointer itself inbetween checkpoints ?
  use a min and a max dirty setting like Informix. Start writing
  when more than max are dirty stop when at min. This avoids writing
  single pages (which is slow, since it cannot be grouped together
  by the OS).
 
 Current approach is similar ... if I strech the IO and syncing over the 
 entire 150-300 second checkpoint interval, grouping in 50 pages then 
 sync()+nap, the system purr's pretty nice and without any peaks.

But how do you handle a write IO bound system then ? My thought was to 
let the checkpointer write dirty pages inbetween checkpoints with a min max,
but still try to do the checkpoint as fast as possible. I don't think
streching the checkpoint is good, since it needs to write hot pages, which the 
inbetween IO should avoid doing. The checkpointer would have two tasks,
that it handles alternately, checkpoint or flush LRU from max to min.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Zeugswetter Andreas SB SD

 Or...  It seems to me that we have been observing something on the order 
 of 10x-20x slowdown for vacuuming a table.  I think this is WAY 
 overcompensating for the original problems, and would cause it's own 
 problem as mentioned above.   Since the granularity of delay seems to be 
 the problem can we do more work between delays? Instead of sleeping 
 after every page (I assume this is what it's doing) perhaps we should 
 sleep every 10 pages,

I also think doing more than one page per sleep is advantageous since
it would still allow the OS to do it's readahead optimizations.
I suspect those would fall flat if only one page is fetched per sleep.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] 2-phase commit

2003-10-10 Thread Zeugswetter Andreas SB SD

I was wondering whether we need to keep WAL online for 2PC,
or whether only something like clog is sufficient.

What if:
1. phase 1 commit must pass the slave xid that will be used for 2nd phase
   (it needs to return some sort of identification anyway)
2. the coordinator must keep a list of slave xid's along with 
   corresponding (commit/rollback) info

Is that not sufficient ? Why would WAL be needed in the first place ?
This is not replication, the slave has it's own WAL anyway.

I also don't buy the argument with the lockup. Iff today somebody connects
with psql starts a transaction modifies something and then never commits
or aborts there is also no automatism builtin that will eventually kill 
it automatically. 2PC will simply need to have means for the administrator
to rollback/commit an in doubt transaction manually.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] 2-phase commit

2003-10-09 Thread Zeugswetter Andreas SB SD

  Why would you spent time on implementing a mechanism whose ultimate
  benefit is supposed to be increasing reliability and performance, when you
  already realize that it will have to lock up at the slightest sight of
  trouble?  There are better mechanisms out there that you can use instead.
 
 If you want cross-server transactions, what other methods are there that
 are more reliable?  It seems network unreliability is going to be a
 problem no matter what method you use.

And unless you have 2-phase (or 3-phase) commit, all other methods are going 
to be worse, since their time window for possible critical failure is
going to be substantially larger. (extending 2-phase to 3-phase should not be 
too difficult)

A lot of use cases for 2PC are not for manipulating the same data on more than 
one server (replication), but different data that needs to be manipulated in an
all or nothing transaction. In this scenario it is not about reliability but about 
physically locating data (e.g. in LA vs New York) where it is needed most often.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] new initdb.c available

2003-10-08 Thread Zeugswetter Andreas SB SD

 . using wb for writing out on Windows is so that we don't 
 get Windows' gratuitous addition of carriage returns. I will document that.

Please use the #define PG_BINARY_W from c.h which is defined
with the correct letters for all platforms (wb on Windows).

That is how Peter's comment was meant.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] 2-phase commit

2003-09-29 Thread Zeugswetter Andreas SB SD

The simplest senario(though there could be varations) is
  
[At participant(master)'s side]
  Because the commit operations is done, does nothing.
  
[At coordinator(slave)' side]
   1) After a while
   2) re-establish the communication path between the
  partcipant(master)'s TM.
   3) resend the commit requeset to the participant's TM.
  1)2)3) would be repeated until the coordinator receives
  the commit ok message from the partcipant.
  
   [ scratches head ] I think you are using the terms master and slave
   oppositely than I would.
  
  Oops my mistake, sorry.
  But is it 2-phase commit protocol in the first place ?
 
 That is, in your exmaple below
 
  Example:
 
 Master  Slave
 --  -
 commit ready--

This is the commit for phase 1. This commit is allowed to return all 
sorts of errors, like violated deferred checks, out of diskspace, ...

 --OK
 commit done-XX

This is commit for phase 2, the slave *must* answer with success
in all but hardware failure cases. (Note that instead the master could 
instead send rollback, e.g. because some other slave aborted)

 is the commit done message needed ?

So, yes this is needed.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] 2-phase commit

2003-09-29 Thread Zeugswetter Andreas SB SD

 I don't think there is any way to handle cases where the master or slave
 just disappears.  The other machine isn't under the server's control, so
 it has no way of it knowing. I think we have to allow the administrator
 to set a timeout, or ask to wait indefinately, and allow them to call an
 external program to record the event or notify administrators.
 Multi-master replication has the same issues.

Needs to wait indefinitely, a timeout is not acceptable since it leads to 
inconsistent data. Human (or monitoring software) intervention is needed
if they can't reach each other in a reasonable time.

I think this needs to be kept dumb. Different sorts of use cases will simply  
need different answers to resolve in-doubt transactions. What is needed is an
interface that allows listing and commit/rollback of in-doubt transactions 
(preferably from a newly started client, or a direct command for the postmaster).

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] 2-phase commit

2003-09-29 Thread Zeugswetter Andreas SB SD

   Master  Slave
   --  -
   commit ready--
   --OK
   commit done-XX
  
   is the commit done message needed ?
  
  Of course ... how else will the Slave commit?  From my 
 understanding, the
  concept is that the master sends a commit ready to the 
 slave, but the OK
  back is that OK, I'm ready to commit whenever you are, at 
 which point
  the master does its commit and tells the slave to do its ...
 
 Or the slave could reject the request.

At this point only because of a hardware error. In case of network 
problems the commit done eighter did not reach the slave or the success
answer did not reach the master.

That is what it's all about. Phase 2 is supposed to be low overhead and very 
fast to allow keeping the time window for failure (that produces in-doubt 
transactions) as short as possible.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] 2-phase commit

2003-09-29 Thread Zeugswetter Andreas SB SD

   Or the slave could reject the request.
  
  Huh?  The slave has that option??  In what circumstance?
 
 I thought the slave could reject if someone local already had the row
 locked.

No, not at all. The slave would need to reject phase 1 commit ready
for this.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Threads vs Processes

2003-09-26 Thread Zeugswetter Andreas SB SD

 When the address-of operator is applied to a thread-local variable, it 
 is evaluated at run-time and returns the address of the current thread's 
 instance of that variable. An address so obtained may be used by any 
 thread. When a thread terminates, any pointers to thread-local variables 
 in that thread become invalid.

Bummer, I would have thought one advantage of using TLS must surely be memory 
protection ? So the only for pg useful usage for TLS seems to be __declspec(thread) 
and __declspec(thread) static (both for stuff that do not need runtime 
preinitialization).

Maybe the techniques of electric fence could be used for protecting the shmem
at least a little bit.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] More Prelimiary DBT-2 Test Results with PostgreSQL 7.3.4

2003-09-25 Thread Zeugswetter Andreas SB SD

Are those response times in the right unit? 7-10s?
 
  No problem: http://developer.osdl.org/markw/misc/plana.out
 
 Ok, I guess I misunderstood you. These queries are taking 0.5ms - 300ms except
 for the last aggregate query which takes just over 1s.

Yes, but because this is a benchmark he probably has many clients running 
in parallel.

It would be interesting to see the 'explain analyze' from a session that runs 
concurrently with the benchmark.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] AIX 4.2.1 CVS head and SSL

2003-09-15 Thread Zeugswetter Andreas SB SD
 I get the following errors 
 
 
 gmake -C ecpglib all
 gmake[4]: Entering directory
 `/usr/local/postgres/pgsql/src/interfaces/ecpg/ecpglib'
 ../../../../src/backend/port/aix/mkldexport.sh libecpg.a  libecpg.exp
 gcc -O2 -pipe -Wall -Wmissing-prototypes -Wmissing-declarations -Wl,-bnoentry -
 Wl,-H512 -Wl,-bM:SRE -o libecpg.so libecpg.a 
 -L../../../../src/port -lm
 -L/usr/local/ssl/lib  -L../../../../src/port -L/usr/local/ssl/lib 
 -L../pgtypeslib
 -lpgtypes -L../../../../src/interfaces/libpq -lpq
 -Wl,-bI:../../../../src/backend/postgres.imp -Wl,-bE:libecpg.exp

 ld: 0711-317 ERROR: Undefined symbol: .SSL_new
 ld: 0711-317 ERROR: Undefined symbol: .SSL_set_ex_data
 ld: 0711-317 ERROR: Undefined symbol: .SSL_set_fd
 ld: 0711-317 ERROR: Undefined symbol: .SSL_read
 ld: 0711-317 ERROR: Undefined symbol: .SSL_get_error
 ld: 0711-317 ERROR: Undefined symbol: .SSL_write

You can get around that problem by adding the SSL library to the link line
manually (-lssl).

AIX wants all symbols resolved at link time (or at least be told where to
load them from at runtime with an *.imp file).

All Platforms with that requirement will have that problem when linking 
libecpg.so with ssl.

Sorry I don't have time to fix it right now.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Win32 native port

2003-09-12 Thread Zeugswetter Andreas SB SD

  Below is the email that prompted me to add the derived files to
  WIN32_DEV CVS.
 
  However, most people don't want them in there, so I have removed them,
  and updated the web page to recommend the nightly snapshots (which have
  the derived files), and mentioned the tools that will be needed for a
  CVS build:
 
 Here: http://unxutils.sourceforge.net/ are ports of several unix utility
 programs (including bison and flex) for win32.  From my experiences
 compiling the Peer Direct port, this is the easiest way to 
 get started.

While I agree, that those utilities are really fine, the trouble with 
this bison is that it is version 1.28. pg needs a newer bison version.

The bison from MinGW does really only seem to work under obscure 
preconditions :-(

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Vote: Adding flex/bison derived files in WIN32_DEV

2003-09-11 Thread Zeugswetter Andreas SB SD

  I'm confused.  Right on the MinGW download page is a link for bison-1.875.
 
 
 Yep, but I had problems with it. Author confirmed that there could be some 
 problems creating processes (for example M4). However if You make it work, 
 I'll be interested to know how. Check the MinGW mailing list for last few 
 weeks, and see what I'm talking about. By the way the last tip of the 
 author : reconfigure  make, was not working for me :-(

Download the bin package at 
http://prdownloads.sf.net/mingw/bison-1.875.0-2003.02.10-1.exe?download
and use it. You do not need to compile bison yourself.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] 2-phase commit

2003-09-10 Thread Zeugswetter Andreas SB SD

  From our previous discussion of 2-phase commit, there was concern that
  the failure modes of 2-phase commit were not solvable.  However, I think
  multi-master replication is going to have similar non-solvable failure
  modes, yet people still want multi-master replication.
 
 No.  The real problem with 2PC in my mind is that its failure modes
 occur *after* you have promised commit to one or more parties.  In
 multi-master, if you fail you know it before you have told the client
 his data is committed.

Hmm ? The appl cannot take the first phase commit as its commit info. It 
needs to wait for the second phase commit. The second phase is only finished
when all coservers have reported back. 2PC is synchronous.

The problems with 2PC are when after second phase commit was sent to all
servers and before all report back one of them becomes unreachable/down ...
(did it receive and do the 2nd commit or not) Such a transaction must stay
open until the coserver is reachable again or an administrator committed/aborted it. 

It is multi master replication that usually has an asynchronous mode for
performance, and there the trouble starts.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Index creation takes for ever

2003-09-08 Thread Zeugswetter Andreas SB SD

  I don't think so, because the patch does nothing to keep the sort
  order once the index is initially created.
 
 As Tom mentioned, we might not want to keep the tid's in order after the
 index is created because he wants the most recent tid's first, so the
 expired ones migrate to the end.

But on average this argument only holds true for unique indexes, no ?
Is there any code that stops the heap lookup after the visible tuple is found ?
At least in an index with more rows per key you will fetch all heaps after the 
first one anyway to get at the next row. This is better done in heap order, no ?

And the bitmap approach will not work for large result sets.

Summa summarum I would leave the TODO item (maybe add a comment 
(only for non-unique, evaluate performance))

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Correlation in cost_index()

2003-08-14 Thread Zeugswetter Andreas SB SD

 In both cases ANALYZE will calculate correlation 1.0 for column X,
 and something near zero for column Y.  We would like to come out with
 index correlation 1.0 for the left-hand case and something much less
 (but, perhaps, not zero) for the right-hand case.  I don't really see
 a way to do this without actually examining the multi-column ordering
 relationship during ANALYZE.

The only way the second column correlation will be irrelevant is if
the first column is already (nearly) unique (enough so, that the second
column wont scatter fetches enough to fill the buffer before seeing cache hits).
Thus I think when merging correlations you could take nunique into account.

corr = corr_1 * (corr_2 * ( 1 - nunique_1 / nrows))
 
But, I think one (new) correlation metric for the whole index (whole key) and the 
data pages would actually be sufficient. This metric could imho always be used 
instead of the per column correlations to calculate index cost. This holds true 
as long as you walk an index range, and that is what it is all about, no ?

???
Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Changing behavior of BEGIN...sleep...do something...COMMIT

2003-04-04 Thread Zeugswetter Andreas SB SD

 Also, per other discussions, we are removing backend autocommit support
 in 7.4.  It was the wrong way to do it.

Somehow I did not see that conclusion made.
I thought, at least for JDBC, it is already successfully used ?
I think the backend autocommit is useful. Maybe only the 
installation/database/user wide GUC setting should be depricated/
disabled, so it is only used by a session SET ?

Andreas


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


Re: [HACKERS] Nested transactions: low level stuff

2003-04-01 Thread Zeugswetter Andreas SB SD

 In fact, I had proposed a simpler UNDO capability that revisited tuples
 and set their XID to a fixed aborted XID to clean up aborted
 subtransactions, but most now like the multiple XID solution.

I think for the implicit subtransactions that we will want 
(with error codes comming) using a different xid for every command 
inside a transaction is not so sexy, no ?

Andreas


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Roadmap for FE/BE protocol redesign

2003-03-14 Thread Zeugswetter Andreas SB SD

  for that, we get what exactly?  Fetching one row at a time is
  *guaranteed* to be inefficient.  The correct response if that bothers
  you is to fetch multiple rows at a time, not to make a less robust
  protocol.
 I don't feel strongly either way on this one, but IIRC the SQL standard 
 for cursors only specifies fetching one record at a time (at least that 
 is how MSSQL and DB2 implement it).  Thus portable code is likely to 
 only fetch one record at a time.  The current row description isn't too 
 big, but with the changes being suggested it might become so.

Yes, I think it should be the responsibility of the higer level interfaces
(ODBC, esql/c, ...) to fetch rows in optimal packets (e.g. 4-16k at a time).

This is what other db's interfaces do for you. The first fetch gets a packet 
from the db, subsequent fetches are answered from the client side cache until 
it is empty, then the next fetch gets the next packet from the db ...

The trick would be to ask for n bytes of row data (maybe a range), which might 
be 1 - n rows (or maybe even only part of one row if it is very wide ?).
I think this is one of the things the new protocol should be able to cope with. 

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


  1   2   3   4   5   6   >