[HACKERS] Farewell

2003-08-10 Thread Vadim Mikheev
Title: Farewell



It's 
timefor formalacknowledgement that I'm not in The Project any 
more.

I'm not interested in small features/fixes and have 
no time for big ones.
It was this way for very long time and I don't see 
how/when that could change.

My participation in The Project was one of the 
greatest adventures in my life.
Thanks to everyone!

Good luck on your ways.
And - long live to Postgres!!!

Vadim



Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options f

2003-02-18 Thread Vadim Mikheev
So if you do this, do you still need to store that information in
pg_control at all?
 
  Yes: to speeds up the recovery process.

 If it's going to slow down the performance of my database when not doing
 recovery (because I have to write two files for every transaction,
 rather than one)

Control file is not updated for every transaction, only on a few special
events
like checkpoint.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


[HACKERS] Vacation

2002-12-26 Thread Vadim Mikheev
I'll be on vacation from 12/27/02 till 01/20/03.

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] PostgreSQL-R

2002-12-23 Thread Vadim Mikheev
  It seems that locking tuples via LockTable at Phase 1 is not
  required anymore, right?

 We haven't put those hooks in yet, so the current version is master/slave.

So, you are not going to use any LockTable in Phase 1 on master right
now but you still need some LockTable in Phase 3 on slaves. Are you
going to use pg lock manager table in Phase 3? Shouldn't ordering in
Phase 3 be implemented using special LockTable, totally separated from
pg lock manager? (if it's right that Phase 1 doesn't require Phase 3
LockTable at all.)

  Also it seems that we could perform Phases 2  3 periodically
  during transaction execution. This would make WS smaller and
  conflicts between long running transactions from different sites
  would be resoved faster.

And it would increase commit chances for long running transactions:
due to async notification to other nodes about changes made by
transaction, short transactions may have noticeably higher chances
to commit ... and abort conflicting long transactions.

 Seems like a good idea to me, but we won't know for sure until we
 implement the multi-master hooks.

Is it about periodic Phases 2  3 or about using Phase 3' LockTable
in Phase 1? The first one definitely can wait but the second one should
be resolved before merging pg-r code with main CVS, imo.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec

2002-03-15 Thread Vadim Mikheev

 |  Placing a restriction on an application that says it must treat the
values
 |  returned from a sequence as if they might not be committed is absurd.
 |
 | Why? The fact that you are not able to rollback sequences does not
 | necessary mean that you are not required to perform commit to ensure
 | permanent storage of changes made to database.

 I use sequences to generate message identifiers for a simple
 external-to-database message passing system.   I also use
 them for file upload identifiers.  In both cases, if the
 external action (message or file upload) succeeds, I commit;
 otherwise I roll-back.  I assume that the datbase won't give
 me a duplicate sequence... otherwise I'd have to find some

So can you do select nextval() in *separate* (committed)
transaction *before* external action and real transaction where
you store information (with sequence number) about external
action in database?

BEGIN;
SELECT NEXTVAL();
COMMIT;
BEGIN;
-- Do external actions and store info in DB --
COMMIT/ROLLBACK;

Is this totally unacceptable? Is it really *required* to call nextval()
in *the same* transaction where you store info in DB? Why?

 other way go get sequences or I'd have duplicate messages
 or non-unique file identifiers.

 With these changes is this assumption no longer valid?  If

1. It's not valid to assume that sequences will not return duplicate
numbers if there was no commit after nextval.
2. It doesn't matter when sequence numbers are stored in
   database objects only.
3. But if you're going to use sequence numbers in external objects
you must (pre)fetch those numbers in separate committed
transaction.

(Can we have this in FAQ?)

 so, this change will break alot of user programs.

 | And why? Just for convenience of  1% applications which need
 | to use sequences in their own, non-database, external objects?

 I think you may be underestimating the amount of external resources
 which may be associated with a datbase object.  Regardless, may of the
 database features in PostgreSQL are there for 1% or less of the
 user base...

Please note that I was talking about some *inconvenience*, not about
*inability* of using sequence numbers externally (seems my words were
too short). Above is how to do this. And though I agreed that it's not
very convenient/handy/cosy to *take care* and fetch numbers in
separate committed transaction, but it's required only in those special
cases and I think it's better than do fsync() per each nextval() call what
would affect other users/applications where storing sequence numbers
outside of database is not required.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Vadim Mikheev

 I have committed changes to implement this proposal.  I'm not seeing
 any significant performance difference on pgbench on my single-CPU
 system ... but pgbench is I/O bound anyway on this hardware, so that's
 not very surprising.  I'll be interested to see what other people
 observe.  (Tatsuo, care to rerun that 1000-client test?)

What is your system? CPU, memory, IDE/SCSI, OS?
Scaling factor and # of clients?

BTW1 - shouldn't we rewrite pgbench to use threads instead of
libpq async queries? At least as option. I'd say that with 1000
clients current pgbench implementation is very poor.

BTW2 - shouldn't we learn if there are really portability/performance
issues in using POSIX mutex-es (and cond. variables) in place of
TAS (and SysV semaphores)?

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Status of index location patch

2001-09-15 Thread Vadim Mikheev

 Just wondering what is the status of this patch.  Is seems from comments
 that people like the idea.  I have also looked in the archives for other
 people looking for this kind of feature and have found alot of interest.
 
 If you think it is a good idea for 7.2, let me know what needs to be
 changed and I will work on it this weekend.

Just change index' dir naming as was already discussed.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Status of index location patch

2001-09-15 Thread Vadim Mikheev

 Can you explain how I would get the tblNode for an existing database
 index files  if it doesn't have the same OID as the database entry in
 pg_databases.

Well, keeping in mind future tablespace implementation I would
add tblNode to pg_class and in pg_databases I'd have
defaultTblNode and indexTblNode.
If it's too late to do for 7.2 then let's wait till 7.3.

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] User locks code

2001-08-19 Thread Vadim Mikheev

Well, ability to lock only unlocked rows in select for update is useful,
of course. But uniq features of user'locks are:

1. They don't interfere with normal locks hold by session/transaction.
2. Share lock is available.
3. User can lock *and unlock objects* inside transaction, which is not
(and will not be) available with locks held by transactions.

They are interesting too and proposed implementation will not impact lock
manager (just additional 4 bytes in LOCKTAG = same size of LOCKTAG
on machines with 8 bytes alignment).

 An interesting method would be to allow users to simply avoid locked
 rows:

 SELECT * FROM queue FOR UPDATE LIMIT 1 UNLOCKED;

 Unlocked, return immediately, whatever could be used as a keyword to
 avoid rows that are locked (skipping over them).

 For update locks the row of course.  Currently for the above type of
 thing I issue an ORDER BY random() which avoids common rows enough,
 the queue agent dies if queries start taking too long (showing it's
 waiting for other things) and tosses up new copies if it goes a while
 without waiting at all (showing increased load).

 --
 Rod Taylor

 This message represents the official view of the voices in my head

 - Original Message -
 From: Mikheev, Vadim [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Friday, August 17, 2001 2:48 PM
 Subject: [HACKERS] User locks code


  1. Just noted this in contrib/userlock/README.user_locks:
 
   User locks, by Massimo Dal Zotto [EMAIL PROTECTED]
   Copyright (C) 1999, Massimo Dal Zotto [EMAIL PROTECTED]
  
   This software is distributed under the GNU General Public License
   either version 2, or (at your option) any later version.
 
  Well, anyone can put code into contrib with whatever license
  he/she want but user locks package includes interface
  functions in contrib *and* changes in our lock manager, ie
  changes in backend code. I wonder if backend' part of package
  is covered by the same license above? And is it good if yes?
 
  2. Not good implementation, imho.
 
  It's too complex (separate lock method table, etc). Much cleaner
  would be implement this feature the same way as transactions
  wait other transaction commit/abort: by locking objects in
  pseudo table. We could get rid of offnum and lockmethod from
  LOCKTAG and add
 
  struct
 

  Oid RelId;
  Oid ObjId;
  } userObjId;
 
  to objId union of LOCKTAG.
 
  This way user could lock whatever object he/she want in specified
  table and note that we would be able to use table access rights to
  control if user allowed to lock objects in table - missed in 1.
 
  One could object that 1. is good because user locks never wait.
  I argue that never waiting for lock is same bad as always
 waiting.
  Someday we'll have time-wait etc features for general lock method
  and everybody will be happy -:)
 
  Comments?
 
  Vadim
  P.S. I could add 2. very fast, no matter if we'll keep 1. or not.
 
  ---(end of
 broadcast)---
  TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly
 




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] Re: Strangeness in xid allocation / snapshot setup

2001-07-12 Thread Vadim Mikheev

  1. Tx Old is running.
  2. Tx S reads new transaction ID in GetSnapshotData() and swapped away
 before SInval acquired.
  3. Tx New gets new transaction ID, makes changes and commits.
  4. Tx Old changes some row R changed by Tx New and commits.
  5. Tx S gets snapshot data and now sees R changed by *both* Tx Old and
 Tx New *but* does not see *other* changes made by Tx New =
 Tx S reads unconsistent data.

 Hmm, but that doesn't seem to have anything to do with the way that
 GetSnapshotData operates.  If Tx New has an XID = xmax read by Tx S'
 GetSnapshotData, then Tx New will be considered uncommitted by S no
 matter which order we get the locks in; it hardly matters whether Tx New

You forget about Tx Old! The point is that changes made by Tx Old *over*
Tx New' changes effectively make those Tx New' changes *visible* to
Tx S!

And this is not good: Tx New inserts PK and corresponding FK and commits;
Tx Old changes some field in row with that FK and commits - now Tx S will
see
FK row *but not PK one* (and what if Tx S was serializable Tx run by
pd_dump...)

SInval lock prevents Tx Old from commit (xact.c:CommitTransaction())  in
points 2. - 4. above and so Tx Old' changes will not be visible to Tx S.

 manages to physically commit before we finish building the snapshot for
 S.  On the other side of the coin, if Tx New's XID  xmax for S, then
 *with the GetNewTransactionId change that I want* we can be sure that
 Tx New will be seen running by S when it does get the SInval lock
 (unless New has managed to finish before S gets the lock, in which case
 it's perfectly reasonable for S to treat it as committed or aborted).

And this is how it worked (MyProc-xid was updated while holding
XXXGenLockId) in varsup.c from version 1.21 (Jun 1999) till
version 1.36 (Mar 2001) when you occasionally moved it outside
of locked code part:

http://www.ca.postgresql.org/cgi/cvsweb.cgi/pgsql/src/backend/access/transam
/varsup.c.diff?r1=1.35r2=1.36

 Anyway, it seems to me that the possibility of inconsistent data is
 inherent in the way we handle updated rows in Read Committed mode ---
 you can always get to see a row that was emitted by a transaction you
 don't see the other effects of.

If I correctly understand meaning of emitted then sentence above is not
correct:
set of rows to be updated can only be shortened by concurrent transactions.
Yes, changes can be made over changes from concurrent transactions but only
for rows from original set defined by query snapshot and only if
concurrently
updated rows (from that set) satisfy query qual = a row must satisfy
snapshot
*and* query qual = double satisfaction guaranteed -:))
And let's remember that this behaviour is required for current RI
constraints
implementation.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



[HACKERS] Now it's my turn...

2001-06-30 Thread Vadim Mikheev

Baby girl on Jun 27.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Vadim Mikheev

 Yes, that is a good description. And old version is only required in the following 
 two cases:
 
 1. the txn that modified this tuple is still open (reader in default committed read)
 2. reader is in serializable transaction isolation and has earlier xtid
 
 Seems overwrite smgr has mainly advantages in terms of speed for operations
 other than rollback.

... And rollback is required for  5% transactions ...

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Vadim Mikheev

You mean it is restored in session that is running the transaction ?
  
  Depends on what you mean with restored. It first reads the heap page,
  sees that it needs an older version and thus reads it from the rollback segment.
 
 So are whole pages stored in rollback segments or just the modified data?

This is implementation dependent. Storing whole pages is much easy to do,
but obviously it's better to store just modified data.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Hm.  On the other hand, relying on WAL for undo means you cannot drop
 old WAL segments that contain records for any open transaction.  We've
 already seen several complaints that the WAL logs grow unmanageably huge
 when there is a long-running transaction, and I think we'll see a lot
 more.
 
 It would be nicer if we could drop WAL records after a checkpoint or two,
 even in the presence of long-running transactions.  We could do that if
 we were only relying on them for crash recovery and not for UNDO.

As you understand this is old, well-known problem in database practice,
described in books. Two ways - either abort too long running transactions
or (/and) compact old log segments: fetch and save (to use for undo)
records of long-running transactions and remove other records. Neither
way is perfect but nothing is perfect at all -:)

 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
 make lightweight VACUUM work well.  (I definitely don't like the idea

Sorry, but I'm going to consider background vacuum as temporary solution
only. As I've already pointed, original PG authors finally became
disillusioned with the same approach. What is good in using UNDO for 1.
is the fact that WAL records give you *direct* physical access to changes
which should be rolled back.

 that after a very long transaction fails and aborts, I'd have to wait
 another very long time for UNDO to do its thing before I could get on
 with my work.  Would much rather have the space reclamation happen in
 background.)

Understandable, but why other transactions should read dirty data again
and again waiting for background vacuum? I think aborted transaction
should take some responsibility for mess made by them -:)
And keeping in mind 2. very long transactions could be continued -:)

 2. SAVEPOINTs would be awfully nice to have, I agree.
 
 3. Reusing xact IDs would be nice, but there's an answer with a lot less
 impact on the system: go to 8-byte xact IDs.  Having to shut down the
 postmaster when you approach the 4Gb transaction mark isn't going to
 impress people who want a 24x7 commitment, anyway.

+8 bytes in tuple header is not so tiny thing.

 4. Recycling pg_log would be nice too, but we've already discussed other
 hacks that might allow pg_log to be kept finite without depending on
 UNDO (or requiring postmaster restarts, IIRC).

We did... and didn't get agreement.

 I'm sort of thinking that undoing back to a savepoint is the only real
 usefulness of WAL-based UNDO. Is it practical to preserve the WAL log
 just back to the last savepoint in each xact, not the whole xact?

No, it's not. It's not possible in overwriting systems at all - all
transaction records are required.

 Another thought: do we need WAL UNDO at all to implement savepoints?
 Is there some way we could do them like nested transactions, wherein
 each savepoint-to-savepoint segment is given its own transaction number?
 Committing multiple xact IDs at once might be a little tricky, but it
 seems like a narrow, soluble problem.

Implicit savepoints wouldn't be possible - this is very convenient
feature I've found in Oracle.
And additional code in tqual.c wouldn't be good addition.

 Implementing UNDO without creating lots of performance issues looks
 a lot harder.

What *performance* issues?!
The only issue is additional disk requirements.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

  1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
  make lightweight VACUUM work well.
 
  Sorry, but I'm going to consider background vacuum as temporary solution
  only. As I've already pointed, original PG authors finally became
  disillusioned with the same approach.
 
 How could they become disillusioned with it, when they never tried it?
 I know of no evidence that any version of PG has had backgroundable
 (non-blocking-to-other-transactions) VACUUM, still less within-relation
 space recycling.  They may have become disillusioned with the form of
 VACUUM that they actually had (ie, the same one we've inherited) --- but
 please don't call that the same approach I'm proposing.

Pre-Postgres'95 (original) versions had vacuum daemon running in
background. I don't know if that vacuum shrinked relations or not
(there was no shrinking in '95 version), I know that daemon had to
do some extra work in moving old tuples to archival storage, but
anyway as you can read in old papers in the case of consistent heavy
load daemon was not able to cleanup storage fast enough. And the
reason is obvious - no matter how optimized your daemon will be
(in regard to blocking other transactions etc), it will have to
perform huge amount of IO just to find space available for reclaiming.

 Certainly, doing VACUUM this way is an experiment that may fail, or may
 require further work before it really works well. But I'd appreciate it
 if you wouldn't prejudge the results of the experiment.

Why not, Tom? Why shouldn't I say my opinion?
Last summer your comment about WAL, may experiment that time, was that
it will save just a few fsyncs. It was your right to make prejudment,
what's wrong with my rights? And you appealed to old papers as well, BTW.

  Understandable, but why other transactions should read dirty data again
  and again waiting for background vacuum? I think aborted transaction
  should take some responsibility for mess made by them -:)
 
 They might read it again and again before the failed xact gets around to
 removing the data, too.  You cannot rely on UNDO for correctness; at
 most it can be a speed/space optimization. I see no reason to assume
 that it's a more effective optimization than a background vacuum
 process.

Really?! Once again: WAL records give you *physical* address of tuples
(both heap and index ones!) to be removed and size of log to read
records from is not comparable with size of data files.

  Another thought: do we need WAL UNDO at all to implement savepoints?
  Is there some way we could do them like nested transactions, wherein
  each savepoint-to-savepoint segment is given its own transaction number?
 
  Implicit savepoints wouldn't be possible - this is very convenient
  feature I've found in Oracle.
 
 Why not?  Seems to me that establishing implicit savepoints is just a
 user-interface issue; you can do it, or not do it, regardless of the
 underlying mechanism.

Implicit savepoints are setted by server automatically before each
query execution - you wouldn't use transaction IDs for this.

  Implementing UNDO without creating lots of performance issues looks
  a lot harder.
 
  What *performance* issues?!
  The only issue is additional disk requirements.
 
 Not so. UNDO does failed-transaction cleanup work in the interactive
 backends, where it necessarily delays clients who might otherwise be
 issuing their next command.  A VACUUM-based approach does the cleanup
 work in the background. Same work, more or less, but it's not in the
 clients' critical path.

Not same work but much more and in the critical pathes of all clients.
And - is overall performance of Oracle or Informix worse then in PG?
Seems delays in clients for rollback doesn't affect performance so much.
But dirty storage does it.

 BTW, UNDO for failed transactions alone will not eliminate the need for
 VACUUM.  Will you also make successful transactions go back and
 physically remove the tuples they deleted?

They can't do this, as you know pretty well. But using WAL to get TIDs to
be deleted is considerable, no?

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Were you going to use WAL to get free space from old copies too?

Considerable approach.

 Vadim, I think I am missing something.  You mentioned UNDO would be used
 for these cases and I don't understand the purpose of adding what would
 seem to be a pretty complex capability:

Yeh, we already won title of most advanced among simple databases, -:)
Yes, looking in list of IDs assigned to single transaction in tqual.c is much
easy to do than UNDO. As well as couple of fsyncs is easy than WAL.

  1. Reclaim space allocated by aborted transactions.
 
 Is there really a lot to be saved here vs. old tuples of committed
 transactions?

Are you able to protect COPY FROM from abort/crash?

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

  Really?! Once again: WAL records give you *physical* address of tuples
  (both heap and index ones!) to be removed and size of log to read
  records from is not comparable with size of data files.
 
 You sure?  With our current approach of dumping data pages into the WAL
 on first change since checkpoint (and doing so again after each
 checkpoint) it's not too difficult to devise scenarios where the WAL log
 is *larger* than the affected datafiles ... and can't be truncated until
 someone commits.

Yes, but note mine size of log to read records from - each log record
has pointer to previous record made by same transaction: rollback must
not read entire log file to get all records of specific transaction.

  Why not?  Seems to me that establishing implicit savepoints is just a
  user-interface issue; you can do it, or not do it, regardless of the
  underlying mechanism.
 
  Implicit savepoints are setted by server automatically before each
  query execution - you wouldn't use transaction IDs for this.
 
 If the user asked you to, I don't see why not.

Example of one of implicit savepoint usage: skipping duplicate key insertion.
Using transaction IDs when someone want to insert a few thousand records?

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 If its an experiment, shouldn't it be done outside of the main source
 tree, with adequate testing in a high load situation, with a patch
 released to the community for further testing/comments, before it is added
 to the source tree?  From reading Vadim's comment above (re:
 pre-Postgres95), this daemonized approach would cause a high I/O load on
 the server in a situation where there are *alot* of UPDATE/DELETEs
 happening to the database, which should be easily recreatable, no?  Or,
 Vadim, am I misundertanding?

It probably will not cause more IO than vacuum does right now.
But unfortunately it will not reduce that IO. Cleanup work will be spreaded
in time and users will not experience long lockouts but average impact
on overall system throughput will be same (or maybe higher).
My point is that we'll need in dynamic cleanup anyway and UNDO is
what should be implemented for dynamic cleanup of aborted changes.
Plus UNDO gives us natural implementation of savepoints and some
abilities in transaction IDs management, which we may use or not
(though, 4. - pg_log size management - is really good thing).

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Seriously, I don't think that my proposed changes need be treated with
 quite that much suspicion.  The only part that is really intrusive is

Agreed. I fight for UNDO, not against background vacuum -:)

 the shared-memory free-heap-space-management change.  But AFAICT that
 will be a necessary component of *any* approach to getting rid of
 VACUUM.  We've been arguing here, in essence, about whether a background
 or on-line approach to finding free space will be more useful; but that
 still leaves you with the question of what you do with the free space
 after you've found it.  Without some kind of shared free space map,
 there's not anything you can do except have the process that found the
 space do tuple moving and file truncation --- ie, VACUUM.  So even if
 I'm quite wrong about the effectiveness of a background VACUUM, the FSM
 code will still be needed: an UNDO-style approach is also going to need
 an FSM to do anything with the free space it finds.  It's equally clear

Unfortunately, I think that we'll need in on-disk FSM and that FSM is
actually the most complex thing to do in space reclamation project.

 Besides which, Vadim has already said that he won't have time to do
 anything about space reclamation before 7.2.  So even if background
 vacuum does end up getting superseded by something better, we're going
 to need it for a release or two ...

Yes.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] 7.1 startup recovery failure

2001-04-26 Thread Vadim Mikheev
 There's a report of startup recovery failure in Japan.
 Redo done but ...
 Unfortunately I have no time today.

Please ask to start up with wal_debug = 1...

Vadim



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Stuck spins in current

2001-03-21 Thread Vadim Mikheev

  BTW, I've got ~320tps with 50 clients inserting (int4, text[1-256])
  records into 50 tables (-B 16384, wal_buffers = 256) on Ultra10
  with 512Mb RAM, IDE (clients run on the same host as server).
 
 Not bad.  What were you getting before these recent changes?

As I already reported - with O_DSYNC this test shows 30% better
performance than with fsync.

(BTW, seems in all my tests I was using -O0 flag...)

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] BufferSync() FlushRelationBuffers() conflict

2001-03-21 Thread Vadim Mikheev

Just committed changes in bufmgr.c
Regress tests passed but need more specific tests,
as usually. Descr as in CVS:

 Check bufHdr-cntxDirty and call StartBufferIO in BufferSync()
 *before* acquiring shlock on buffer context. This way we should be
 protected against conflicts with FlushRelationBuffers. 
 (Seems we never do excl lock and then StartBufferIO for the same
 buffer, so there should be no deadlock here, - but we'd better
 check this very soon).

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] BufferSync() FlushRelationBuffers() conflict

2001-03-21 Thread Vadim Mikheev

  Tom, since you appear to be able to recreate the bug, can you comment on
  this, as to whether we are okay now?
 
 Sorry for the delay --- I was down in Norfolk all day, and am just now
 catching up on email.  I will pull Vadim's update and run the test some
 more.  However, last night I only saw the failure once in about an
 hour's worth of testing, so it's not that easy to reproduce anyway...

I saw ~ 10 failures with -B 32 in ~ 3 minutes of testing. With old code,
of course -:)

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Stuck spins in current

2001-03-17 Thread Vadim Mikheev

 At this point I must humbly say "yes, you told me so", because if I

No, I didn't - I must humbly say that I didn't foresee this deadlock,
so "I didn't tell you so" -:)

Anyway, deadlock in my tests are very correlated with new log file
creation - something probably is still wrong...

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] RE: xlog loose ends, continued

2001-03-13 Thread Vadim Mikheev

  Anyway I like idea of StartUpID in page headers - this will help
 
 Can you please describe StartUpID for me ? 
 Ideal would be a stamp that has the last (smallest open) XID, or something else 
 that has more or less timestamp characteristics (without the actual need of 
wallclock)
 in regard to the WAL.

StartUpID counts database startups and so has timestamp characteristics.
Actually, idea is to use SUI in future to allow reusing XIDs after startup: seeing
old SUI in data pages we'll know that all transaction on this page was committed
"long ago" (ie visible from MVCC POV). This requires UNDO, of course.

 This could then also be used to scan all pages for modification since
 last backup, to make incremental backups possible. (note, that incremental
 backup is not WAL backup)

We can scan log itself to get all pages modified since last backup or whatever
point we want - thanks to your idea about data pages backup.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] RE: xlog loose ends, continued

2001-03-13 Thread Vadim Mikheev

  StartUpID counts database startups and so has timestamp characteristics.
  Actually, idea is to use SUI in future to allow reusing XIDs after startup: seeing
  old SUI in data pages we'll know that all transaction on this page was committed
  "long ago" (ie visible from MVCC POV). This requires UNDO, of course.
 
 First thanx for the description, but db startup would only count to 5-7 per year :-),
 is that sufficient ? It hardly sounds like anything useful to include in page header.

It will be sufficient if DB will not use all 2^32 XIDs without shutdown.
Removing pg_log *segments* for old XIDs is another story.

 What about the xlog id, that is also used for xlog file name, but I still think a 
xid would 
 be the best candidate.

logid would be ok too, xid is not - we have to shorten xids lifetime in near future.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: AW: AW: AW: [HACKERS] WAL does not recover gracefully from ou t-of -dis k-sp ace

2001-03-10 Thread Vadim Mikheev

  Ok, I've made changes in xlog.c and run tests:
 
 Could you send me your diffs?

Sorry, Monday only.

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed WAL changes

2001-03-08 Thread Vadim Mikheev

What do we debate?
I never told that we shouldn't worry about current WAL disability to restart.
And WAL already correctly works in situation of "failing to write a couple of disk
blocks when the system crashes".
My statement at first place that "WAL can't help in the event of disk errors"
was to remind that we should think over how much are we going to guarantee
and by what means in the event our *base* requirements were not answered
(guaranteed -:)). My POV is that two checkpoints increases disk space
requirements for *everyday usage* while buying near nothing because of data
consistency cannot be guaranteed anyway.
On the other hand this is the fastest way to implement WAL restart-ability
- *which is the real problem we have to fix*.

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space

2001-03-08 Thread Vadim Mikheev
 Was the following bug already fixed ?

I was going to ask same Q.

I see that seek+write was changed to write-s in XLogFileInit
(that was induced by subj, right?), but what about problem
itself?

  DEBUG:  redo starts at (0, 21075520)
  The Data Base System is starting up
  DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
 ^^^
redo started in seg 1 and shouldn't try to read seg 0...

BTW, were performance tests run after seek+write -- write-s
change? Write-s were not obviously faster to me, that's why I've
used seek+write, but never tested that area -:(

Vadim



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: AW: [HACKERS] Proposed WAL changes

2001-03-07 Thread Vadim Mikheev
  I thought the intended way to change a GUC parameter permanently was to
  edit data/postgresql.conf . No ?
  
 
 What I've thought is to implement a new command to
 change archdir under WAL's control.
 If it's different from Vadim's plan I don't object.

Actually, I have no concrete plans for archdir yet - this
one is for WAL based BAR we should discuss in future.
So, I don't see why to remove archdir from pg_control now.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


Re: AW: AW: AW: AW: [HACKERS] WAL-based allocation of XIDs is insecur e

2001-03-07 Thread Vadim Mikheev

 Before commit or rollback the xlog is not flushed to disk, thus you can loose
 those xlog entries, but the index page might already be on disk because of
 LRU buffer reuse, no ?

No. Buffer page is written to disk *only after corresponding records are flushed
to log* (WAL means Write-Ahead-Log - write log before modifying data pages).

 Another example would be a btree reorg, like adding a level, that is partway 
 through before a crash.

And this is what I hopefully fixed recently with btree runtime recovery.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Proposed WAL changes

2001-03-07 Thread Vadim Mikheev

 I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

 proposed diffs for the WAL code.  These changes:
 
 * Store two past checkpoint locations, not just one, in pg_control.
   On startup, we fall back to the older checkpoint if the newer one
   is unreadable.  Also, a physical copy of the newest checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))
And what to do if pg_control was lost? (We already discussed that we
should read all logs from newest to oldest ones to find checkpoint).
And why to keep old log files with older checkpoint?

   is kept in pg_control for possible use in disaster recovery (ie,
   complete loss of pg_xlog).  Also add a version number for pg_control

Mmmm, how recovery is possible if log was lost? All what could be done
with DB in the event of corrupted/lost log is dumping data from tables
*asis*, without any guarantee about consistency. How checkpoint' content
could be useful?

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

   itself.  Remove archdir from pg_control; it ought to be a GUC
   parameter, not a special case (not that it's implemented yet anyway).

I would discuss WAL based BAR management before deciding how to
store/assign archdir. On the other hand it's easy to add archdir
to pg_control later -:)

 * Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
   on alternate bytes.

Great if you've found reliable CRC64 impl!

 * Fix XLOG record length handling so that it will work at BLCKSZ = 32k.

Case I've overlooked -:(
(Though, I always considered BLCKSZ  8K as temp hack -:))

 * Change XID allocation to work more like OID allocation, so that we
   can flush XID alloc info to the log before there is any chance an XID
   will appear in heap files.

I didn't read you postings about this yet.

 * Add documentation and clean up some coding infelicities; move file
   format declarations out to include files where planned contrib
   utilities can get at them.

Thanks for that!
 
 Before committing this stuff, I intend to prepare a contrib utility that
 can be used to reset pg_control and pg_xlog.  This is mainly for
 disaster recovery purposes, but as a side benefit it will allow people

Once again, I would call this "disaster *dump* purposes" -:)
After such operation DB shouldn't be used for anything but dump!

 to update 7.1beta installations to this new code without doing initdb.
 I need to update contrib/pg_controldata, too.

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: AW: AW: AW: AW: [HACKERS] WAL-based allocation of XIDs is insecur e

2001-03-07 Thread Vadim Mikheev

   Before commit or rollback the xlog is not flushed to disk, thus you can loose
   those xlog entries, but the index page might already be on disk because of
   LRU buffer reuse, no ?
  
  No. Buffer page is written to disk *only after corresponding records are flushed
  to log* (WAL means Write-Ahead-Log - write log before modifying data pages).
 
 You mean, that for each dirty buffer that is reused, the reusing backend fsyncs
 the xlog before writing the buffer to disk ?

In short - yes.
To be accurate - XLogFlush is called to ensure that records reflecting buffer' 
modifications
are on disk. That's how it works everywhere. And that's why LRU is not good policy for
bufmgr anymore (we discussed this already).

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] WAL-based allocation of XIDs is insecure

2001-03-07 Thread Vadim Mikheev

 Consider the following scenario:
 
 1. A new transaction inserts a tuple.  The tuple is entered into its
 heap file with the new transaction's XID, and an associated WAL log
 entry is made.  Neither one of these are on disk yet --- the heap tuple
 is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.
 
 2. Now do a lot of read-only operations, in the same or another backend.
 The WAL log stays where it is, but eventually the shmem disk buffer will
 get flushed to disk so that the buffer can be re-used for some other
 disk page.
 
 3. Assume we now crash.  Now, we have a heap tuple on disk with an XID
 that does not correspond to any XID visible in the on-disk WAL log.

Impossible (with fsync ON -:)).

Seems my description of core WAL rule was bad, I'm sorry -:(
WAL = Write-*Ahead*-Log = Write data pages *only after* log records
reflecting data pages modifications are *flushed* on disk =
If a modification was not logged then it's neither in data pages.

No matter when bufmgr writes data buffer (at commit time or to re-use
it) bufmgr first ensures that buffer' modifications are logged.

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] WAL-based allocation of XIDs is insecure

2001-03-07 Thread Vadim Mikheev

 The point is to make the allocation of XIDs and OIDs work the same way.
 In particular, if we are forced to reset the XLOG using what's stored in
 pg_control, it would be good if what's stored in pg_control is a value
 beyond the last-used XID/OID, not a value less than the last-used ones.

If we're forced to reset log (ie it's corrupted/lost) then we're forced
to dump, and only dump, data *because of they are not consistent*.
So, I wouldn't worry about XID/OID/anything - we can only provide user
with way to restore data ... *manually*.

If user really cares about his data he must

U1. Buy good disks for WAL (data may be on not so good disks).
U2. Set up distributed DB if U1. is not enough.

To help user with above we must

D1. Avoid bugs in WAL
D2. Implement WAL based BAR (so U1 will have sence).
D3. Implement distributed DB.

There will be no D2  D3 in 7.1, and who knows about D1. 
So, manual restoring data is the best we can do for 7.1.
And actually, "manual restoring" is what we had before,
anyway.

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] WAL-based allocation of XIDs is insecure

2001-03-07 Thread Vadim Mikheev

 On third thought --- we could still log the original page contents and
 the modification log record atomically, if what were logged in the xlog
 record were (essentially) the parameters to the operation being logged,
 not its results. That is, make the log entry before you start doing the
 mod work, not after. This might also simplify redo, since redo would be
 no different from the normal case. I'm not sure why Vadim didn't choose
 to do it that way; maybe there's some other fine point I'm missing.

There is one - indices over user defined data types: catalog is not
available at the time of recovery, so, eg, we can't know how to order
keys of "non-standard" types. (This is also why we have to recover
aborted index split ops at runtime, when catalog is already available.)

Also, there is no point why should we log original page content and
the next modification record separately.

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] WAL-based allocation of XIDs is insecure

2001-03-07 Thread Vadim Mikheev

  Hm, wasn't it handling non-atomic disk writes, Andreas?
 
 Yes, but for me, that was only one (for me rather minor) issue.
 I still think that the layout of PostgreSQL pages was designed to
 reduce the risc of a (heap) page beeing inconsistent because it is 
 only partly written to an acceptable minimum. If your hw and os can 

I believe that I explained why it's not minor issue (and never was).
Eg - PageRepaireFragmentation "compacts" page exactly like other,
overwriting, DBMSes do and partial write of modified page means
lost page content.

  And for what else "physical log" could be used?
 
 1. create a consistent state if rollforward bails out for some reason
 but log is still readable

What difference between consistent state as it was before checkpoint and
after that? Why should we log old page images? New (after modification) page
images are also consistent and can be used to create consistent state.

 2. have an easy way to handle index rollforward/abort
 (might need to block some index modifications during checkpoint though)

There is no problems now. Page is either splitted (new page created/
properly initialized, right sibling updated) or not.

 3. ease the conversion to overwrite smgr

?

 4. ease the creation of BAR to create consistent snapshot without 
 need for log rollforward

Isn't it the same as 1. with "snapshot" == "state"?

  Now, why should we log page as it was *before* modification?
  We would log modification anyway (yet another log record!) and
 
 Oh, so currently you only do eighter ? I would at least add the 
 info which slot was inserted/modified (maybe that is already there (XID)).

Relfilenode + TID are saved, as well as anything else that would required
to UNDO operation, in future.

  would apply it to page, so result would be the same as now when
  we log page after modification - consistent *modifyed* page.
 
 Maybe I am too focused on the implementation of one particular db,
 that I am not able to see this without prejudice,
 and all is well as is :-)
  ^
I hope so -:)

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



[HACKERS] I'm back

2001-03-06 Thread Vadim Mikheev

Hi!

Snow in New York - I'm arrived only today.
Reading mail...

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] Re: Uh, this is *not* a 64-bit CRC ...

2001-03-06 Thread Vadim Mikheev

 This isn't a 64-bit CRC.  It's two independent 32-bit CRCs, one done
 on just the odd-numbered bytes and one on just the even-numbered bytes
 of the datastream.  That's hardly any stronger than a single 32-bit CRC;

I believe that the longer data the more chance to get same CRC/hash
for different data sets (if data length  CRC/hash length). Or am I wrong?
Having no crc64 implementation (see below) I decided to use 2 CRC32 instead
of one - it looked better, without any additional cost (record header is
8 byte aligned anyway, on, mmm, most platform).

 it's certainly not what I thought we had agreed to implement.

I've asked if anyone can send crc64 impl to me and got only one from
Nathan Myers. Unfortunately, SWISS-PROT impl assumes that long long
is 8 bytes - is it portable?

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] RE: [CORE] WAL RC1 status

2001-03-05 Thread Vadim Mikheev

 I've reported the major problems to the mailing lists
 but gotten almost no feedback about what to do.

I can't comment without access to code -:(

 commit: 2001-02-26 17:19:57
 0/0059996C: prv 0/00599948; xprv 0/; xid 0;
 RM  0 info 00 len 32
 checkpoint: redo 0/0059996C; undo 0/; sui 29;
 nextxid 18903; nextoid 35195; online
 -- this is the last normal-looking checkpoint record.
 -- Judging from the commit timestamps surrounding prior
 -- checkpoints, checkpoints were happening every five
 -- minutes approximately on the 5-minute mark, so

You can't count on this: postmaster runs checkpoint
"maker" in 5 minutes *after* prev checkpoint was created,
not from the moment "maker" started. And checkpoint can
take *minutes*.

 -- this one happened about 17:20.
 -- (There really should be a timestamp
 -- in the checkpoint records...)

Agreed.

 commit: 2001-02-26 17:26:02
 ReadRecord: record with zero len at 0/005A4B4C
 -- My dump program is unhappy here because the rest
 -- of the page is zero. Given that there is a
 -- continuation record at the start of the next
 -- page, there certainly should have been record(s)
 -- here.  But it's worse than that: check the commit
 -- timestamps and the xid numbers before and after the
 -- discontinuity.  Did time go backwards here?

Commit timestamps are created *before* XLogInsert call,
which can suspend backend for some time (in multi-user
env). Random xid-s are also ok, generally.

 -- Also notice the back-pointers in the first valid
 -- record on the next page; they point not into the
 -- zeroed space, which would suggest a mere failure
 -- to write a buffer after filling it, but into the
 -- middle of one of the valid records on the prior
 -- page. It almost looks like page 5A6000 came from
 -- a completely different run than page 5A4000.
 Unexpected page info flags 0001 at offset 5A6000
 Skipping unexpected continuation record at offset 5A6000
 0/005A6904: prv 0/005A48B4(?); xprv 0/005A48B4; xid 19047;
  ^^  ^^
Same. So, TX 19047 really inserted record at 0/005A48B4
position.

 -- What's even nastier (and the immediate cause of
 -- Scott's inability to restart) is that the pg_control
 -- file's checkPoint pointer points to 0/005AF9F0, which
 -- is *not* the location of this checkpoint, but of
 -- the record after it.

Well, well. Checkpoint position is taken from
MyLastRecord - I wonder how could this internal var
take "invalid" data from concurrent backend.

Ok, we're leaving Krasnoyarsk in 8 hrs and should
arrive SF Feb 5 ~ 10pm.

Vadim

---
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance

2001-02-27 Thread Vadim Mikheev

 So 7.0.3 is twice as fast only with fsync off.

Are there FK updates/deletes in pgbench' tests?
Remember how SELECT FOR UPDATE in FK triggers
affects performance...
Also, 5 clients is small number.

Vadim
P.S. Sorry for delays with my replies -
internet connection is pain here: it takes
5-10 minutes to read each message -:(((

---
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com





Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space

2001-02-27 Thread Vadim Mikheev

 Regardless of whether this particular behavior is fixable,
 this brings up something that I think we *must* do before
 7.1 release: create a utility that blows away a corrupted
 logfile to allow the system to restart with whatever is in
 the datafiles.  Otherwise, there is no recovery technique
 for WAL restart failures, short of initdb and restore from
 last backup.  I'd rather be able to get at data of
 questionable up-to-dateness than not have any chance of
 recovery at all.

 I've asked 2 or 3 times how to recover from recovery failure
 but got no answer. We should some recipi for the failure
 before 7.1 release.

And I answered 2 or 3 times with fixes for each reported
recovery failure -:) (And asked to help with testing...)

Seems to me that "fixing" is the only "answer" you would
get asking the same question to Oracle, Informix or any
other system with transaction log. Does anybody know how
"big boys" deal with this issue?

Vadim

---
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com





Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance

2001-02-23 Thread Vadim Mikheev

   It may be that WAL has changed the rollback
   time-characteristics to worse than pre-wal ?
 
  Nothing changed ... yet. And in future rollbacks
  of read-only transactions will be as fast as now,
  anyway.
 
 What about rollbacks of a bunch uf inserts/updates/deletes?

 I remember a scenario where an empty table was used
 by several backends for gathering report data, and
 when the report is done they will rollback to keep
 the table empty.

 Should this kind of usage be replaced in the future by 
 having backend id as a key and then doing delete by that 
 key in the end ?

Isn't it what we have right now?
But I believe that in future we must remove
modifications made by aborted transactions
immediately, without keeping them till vacuum.
So - yes: rollback of read-write transactions
will take longer time.

Vadim

---
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com





Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance

2001-02-23 Thread Vadim Mikheev

 It may be that WAL has changed the rollback
 time-characteristics to worse than pre-wal ?

Nothing changed ... yet. And in future rollbacks
of read-only transactions will be as fast as now,
anyway.

  So my guess is that the 7.1 updates (with default
  fsync) are significantly slower than 7.0.3 fsync=off
  now.

Do you update tables with foreign keys?
Did you run tests in multi-user or single-user
environment?

Vadim

---
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com





[HACKERS] Re: Recovery of PGSQL after system crash failing!!!

2001-02-14 Thread Vadim Mikheev

  It removes the need to disable fsync to get best performance! 
 
 -F performance is still better, only the difference is not so big as before.

Well, when "checkpoint seek in logs" will be implemented difference
will be the same - lost consistency.

  Since there is a fundamental recovery problem if the WAL file
  disappears, then perhaps we should have a workaround which can ignore
  the requirement for that file on startup? Or maybe we do already?
  Vadim??
 
 This was discussed, but iirc not yet implemented.

Yes  yes.

  Also, could the "-F" option be disabled now that WAL is enabled? Or is
  there still some reason to encourage/allow folks to use it?

I've used it when testing btree runtime recovery to increase concurrence.

 I use it, since I restore after a system crash (which never happens).
 I think all that is probably missing in -F mode is probably 2-3 fsyncs
 during checkpoint. One for the xlog, and one for pg_control (maybe also pg_log).
 All other fsyncs are only to not buffer transactions.

Probably we could just force fsync during checkpoint, for the moment.

Thanks to all for help!

Vadim





Re: [HACKERS] Open 7.1 items

2001-02-14 Thread Vadim Mikheev

 As you can see from the current open items list, there isn't much left
 to do for the 7.1 release.  I am going to suggest we remove the LAZY
 VACUUM option at this point.  I know Tom Lane posted an item about the

Well, leaving for vacation tomorrow I have to agree -:(
LAZY patch will be available in a few days after 7.1 release.

Vadim





Re: [HACKERS] RE: [ADMIN] SOS !!: Porstgress forgot all ! Help !

2001-02-14 Thread Vadim Mikheev

  O, your system reached max transaction ID -:(
 
 That's two reports now of people who have managed to wrap around the XID
 counter.  It doesn't seem that hard to do in a heavily used database.
 
 Does anyone want to take more seriously the stopgap solution I proposed
 for this problem (pghackers archives around 3-Nov-00)?  I believe you
 shot it down that time, but I don't think that ignoring the problem
 for another release cycle is a better answer.

Actually, I believed that you've done this temp solution till I've found
that it's not true couple weeks ago. If you'll do this please don't forget
about reusing ID of *committed* transactions and crashes - this should
be handled somehow on recovery.

Vadim





[HACKERS] Leaving for vacation

2001-02-14 Thread Vadim Mikheev

from Feb 15 till Mar 6...
I'll not be able to read mail lists, so
in the event of needs please use
[EMAIL PROTECTED] address.

Regards!

Vadim





Re: [HACKERS] Recovery of PGSQL after system crash failing!!!

2001-02-13 Thread Vadim Mikheev

  #2  0x20dc71 in abort () from /lib/libc.so.6
  #3  0x8080495 in XLogFileOpen ()
 
 Hm.  Evidently it's failing to open the xlog file, but the code is set
 up in such a way that it dies before telling you why :-(  Take a look
 at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code
 to tell you the path and errno it's failing on before it abort()s.

I don't remember why there is abort() in XLogFileOpen just before
appropriate elog(STOP) there - I'll remove it in few moments, - but
it's already obvious why open failed: there is no file with checkpoint
record pointed by pg_control - data/pg_xlog/002F.
So, the question is who removed this file - PG or Linux?
Ryan, do you have postmaster' log before crash (where MoveOfflineLogs
reports WAL files to be deleted) and/or some logs from Linux' startup?
And meanwhile I'll take a look arround MoveOfflineLogs...

Vadim





Re: [HACKERS] Recovery of PGSQL after system crash failing!!!

2001-02-11 Thread Vadim Mikheev

 DEBUG: starting up
 DEBUG: database system was interrupted at 2001-02-11 04:08:12
 DEBUG: Checkpoint record at (0, 805076492)
 postmaster: reaping dead processes...
 Startup failed - abort
 
 And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get
 the same thing each time I run it. I assume that WAL is for some reason
 failing to restore/recover the database. 
 The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql
 7.1beta4. The system crash occured during the nightly update of the
 databases (from another, internal, non-pgsql, database system). Is there
 anyway to recover the database, or do I need to do a 'rm -rf

Please try to restart with option wal_debug = 1 so postmaster log
will be more informative and send this log me.

 data; initdb'? A quick response would be greatly appreciated. Thanks.

Please archieve PG' data dir - it probably will be useful to find bug.

Vadim





Re: [HACKERS] Btree runtime recovery. Stuck spins.

2001-02-09 Thread Vadim Mikheev

  Hm.  It was OK to use spinlocks to control buffer access when the max
  delay was just the time to read or write one disk page.  But it sounds

Actually, btree split requires 3 simult. buffers locks and after that
_bt_getstackbuf may read *many* parent buffers while holding locks on
2 buffers. AFAIR, the things are even worse in hash.

And anyway, there is always probability that someone else will get just
freed lock while you're waiting next 0.01 sec. The problem is that
there is no priority/ordering while waiting for spin lock.

  like we've pushed the code way past what it was designed to do.  I think
  this needs some careful thought, not just a quick hack like increasing
  the timeout interval.

I fear there is not enough time -:(

 After thinking more about this, simply increasing S_MAX_BUSY is clearly
 NOT a good answer.  If you are under heavy load then processes that are
 spinning are making things worse, not better, because they are sucking
 CPU cycles that would be better spent on the processes that are holding
 the locks.
 
 It would not be very difficult to replace the per-disk-buffer spinlocks
 with regular lockmanager locks.  Advantages:
   * Processes waiting for a buffer lock aren't sucking CPU cycles.
   * Deadlocks will be detected and handled reasonably.  (The more stuff
 that WAL does while holding a buffer lock, the bigger the chances
 of deadlock.  I think this is a significant concern now.)

I disagree. Lmgr needs in deadlock detection code because of deadlock
may be caused by *user application* design and we must not count on
*user application* correctness. But we must not use deadlock detection
code when we protected from deadlock by *our* design. Well, anyone can
make mistake and break order of lock acquiring - we should just fix
those bugs -:)
So, it doesn't matter *how much stuff that WAL does while holding buffer
locks* as long as WAL itself doesn't acquire buffer locks.

 Of course the major disadvantage is:
   * the lock setup/teardown overhead is much greater than for a
 spinlock, and the overhead is just wasted when there's no contention.

Exactly.

 A reasonable alternative would be to stick with the spinlock mechanism,
 but use a different locking routine (maybe call it S_SLOW_LOCK) that is
 designed to deal with locks that may be held for a long time.  It would
 use much longer delay intervals than the regular S_LOCK code, and would
 have either a longer time till ultimate timeout, or no timeout at all.
 The main problem with this idea is choosing an appropriate timeout
 behavior.  As I said, I am concerned about the possibility of deadlocks
 in WAL-recovery scenarios, so I am not very happy with the thought of
 no timeout at all.  But it's hard to see what a reasonable timeout would

And I'm unhappy with timeouts -:) It's not solution at all. We should
do right design instead.

 be if a minute or more isn't enough in your test cases; seems to me that
 that suggests that for very large indexes, you might need a *long* time.
 
 Comments, preferences, better ideas?

For any spins which held while doing IO ops we should have queue of
waiting backend' PROCs. As I said - some kind of lightweight lock manager.
Just two kind of locks - shared  exclusive. No structures to find locked
objects. No deadlock detection code. Backends should wait on their
semaphores, without timeouts.

For "true" spins (held for really short time when accessing control
structures in shmem) we should not sleep 0.01 sec! tv_usec == 1 would be
reasonable - just to yield CPU. Actually, mutexes would be much better...

Vadim





Re: [HACKERS] Btree runtime recovery. Stuck spins.

2001-02-08 Thread Vadim Mikheev

  Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL?
 
 No. If you have delays exceeding a minute, or that are even a visible
 fraction of a minute, then a spinlock is NOT the correct mechanism to be
 using to wait ... because guess what, it's spinning, and consuming
 processor time to no purpose.  You should be using a lock instead for
 anything that involves more than a trivial amount of delay.

"Amount of delay" depends on system load - something we can't control...

Btree uses spins to lock buffers (as all other access methods) and so
I could use only spins in new code. And though tree recovery locks buffers
for longer time than normal insert operations it's possible to get
"stuck" spins when using concurrent buffers locks *everywhere* under
heavy load (especially with WAL which requires holding buffer locks
for duration of logging).

So, probably we have to use some kind of light locks (without deadlock
detection) for buffers, in future.

Vadim





Re: [HACKERS] WAL Crash during index vacuuming 7.1beta4

2001-02-05 Thread Vadim Mikheev

 during the nightly vacuum pgsql closed and do not start any more.
 Attached the log.
 
 Seems the problem was rebuilding an Index,
 There is a way to force wal to ignore indexes ?

The problem was in redoing tuple movement in *table*.

 Can I delete it ?

...

 DEBUG:  redo starts at (6, 4144414796)
 NOTICE:  PageAddItem: tried overwrite of used ItemId
^^^
 FATAL 2:  heap_update_redo: failed to add tuple

I think that I've just fixed this problem (must not check itemid'
flag in PageAddItem in overwrite mode when offset number
== maxoff + 1). I hope that Giuseppe will check new code soon.

Thanks to Giuseppe for help!

Vadim





Re: [HACKERS] Open 7.1 items

2001-01-27 Thread Vadim Mikheev

 Here are the open items for 7.1.  Much shorter:

+ Runtime btree recovery

Vadim





[HACKERS] Out of town these long weekends...

2001-01-13 Thread Vadim Mikheev






Re: [HACKERS] Pg7.1beta3: connect failed: The DB System is starting up.

2001-01-12 Thread Vadim Mikheev

 With Apache Mod Perl, Apache::DBI, stress test with apache bench (ab -n 
 10 -c 4) in apache error_log i've got:
 
 [Pg7.1beta3 with standard conf files.]

And how many simult connections you did?

 ..
 [Fri Jan 12 07:48:58 2001] [error] DBI-connect(dbname=mydb) failed: The 
 Data Base System is starting up
 
 
 Also messages: "DB in recovery ...".

Looks like server was crashed and now is in recovery.

Vadim





Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Vadim Mikheev

  No, I thought we agreed disk block CRC was way overkill.  If the CRC on
  the WAL log checks for errors that are not checked anywhere else, then
  fine, but I thought disk CRC would just duplicate the I/O subsystem/disk
  checks.
 
 A disk-block CRC would detect partially written blocks (ie, power drops
 after disk has written M of the N sectors in a block).  The disk's own
 checks will NOT consider this condition a failure.  I'm not convinced
 that WAL will reliably detect it either (Vadim?).  Certainly WAL will

Idea proposed by Andreas about "physical log" is implemented!
Now WAL saves whole data blocks on first after checkpoint
modification. This way on recovery modified data blocks will be
first restored *as a whole*. Isn't it much better than just
detection of partially writes?

Only one type of modification isn't covered at the moment -
updated t_infomask of heap tuples.

 not help for corruption caused by external agents, away from any updates
 that are actually being performed/logged.

What do you mean by "external agents"?

Vadim





[HACKERS] Re: CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Vadim Mikheev

  But CRC is used in WAL records only.
 
 Oh.  I thought we'd agreed that a CRC on each stored disk block would
 be a good idea as well.  I take it you didn't do that.
 
 Do we want to consider doing this (and forcing another initdb)?
 Or shall we say "too late for 7.1"?

I personally was never agreed to this. Reasons?

Vadim





Re: [HACKERS] Quite strange crash

2001-01-09 Thread Vadim Mikheev

  Well, it's not good idea because of SIGTERM is used for ABORT + EXIT
  (pg_ctl -m fast stop), but shouldn't ABORT clean up everything?
 
 Er, shouldn't ABORT leave the system in the exact state that it's
 in so that one can get a crashdump/traceback on a wedged process
 without it trying to clean up after itself?

Sorry, I've meant "transaction abort"...

Vadim





[HACKERS] New WAL version (CRC data pages backup)

2000-12-28 Thread Vadim Mikheev

just committed. initdb is required.
*I didn't do any serious tests yet*, simple regression only.
I'll run more tests in the next couple of days.
Plase, help with this.

Vadim





Re: [HACKERS] switching txlog file in 7.1beta

2000-12-14 Thread Vadim Mikheev

 when doing txlog switches there seems to be a problem with remembering the 
 correct = active logfile, when the postmaster crashes.
 
 This is one of the problems I tried to show up previously:
 You cannot rely on writes to other files except the txlog itself !!!

Why? If you handle those files specifically, as txlog itself.

 Thus the current way of recording the active txlog seg and position in 
 pg_control is busted, and must be avoided. I would try to not use pg_control
 for this at all, but scan the pg_xlog directory for this purpose.
 
 cusejoua=# update journaleintrag set txt_funktion=trim(txt_funktion);
 FATAL 2:  write(logfile 0 seg 2 off 4612096) failed: No such file or directory
 pqReadData() -- backend closed the channel unexpectedly.
 This probably means the backend terminated abnormally
 before or while processing the request.
 The connection to the server was lost. Attempting reset: Failed.

Can you start up db with --wal_debug=1 and send me output?

Vadim





Re: [HACKERS] beta testing version

2000-12-03 Thread Vadim Mikheev

 I totaly missed your point here. How closing source of ERserver is related
 to closing code of PostgreSQL DB server? Let me clear things:
 
 (not based on WAL)
 
 That's wasn't clear from the blurb.
 
 Still, this notion that PG, Inc will start producing closed-source products
 poisons the well.  It strengthens FUD arguments of the "open source can't
 provide enterprise solutions" variety.  "Look, even PostgreSQL, Inc realizes
 that you must follow a close sourced model
 in order to provide tools for the corporate world."


Did you miss Thomas' answer? Wasn't it clear that the order is to provide
income?

Vadim





Re: [HACKERS] beta testing version

2000-12-02 Thread Vadim Mikheev

 There is risk here.  It isn't so much in the fact that PostgreSQL, Inc
 is doing a couple of modest closed-source things with the code.  After
 all, the PG community has long acknowleged that the BSD license would
 allow others to co-op the code and commercialize it with no obligations.
 
 It is rather sad to see PG, Inc. take the first step in this direction.
 
 How long until the entire code base gets co-opted?

I totaly missed your point here. How closing source of ERserver is related
to closing code of PostgreSQL DB server? Let me clear things:

1. ERserver isn't based on WAL. It will work with any version = 6.5

2. WAL was partially sponsored by my employer, Sectorbase.com,
not by PG, Inc.

Vadim





Re: [HACKERS] beta testing version

2000-12-01 Thread Vadim Mikheev

   As for replaying logs against a restored snapshot dump... AIUI, a 
   dump records tuples by OID, but the WAL refers to TIDs.  Therefore, 
   the WAL won't work as a re-do log to recover your transactions 
   because the TIDs of the restored tables are all different.   
  
  True for current way of backing up - ie saving data in "external"
  (sql) format. But there is another way - saving data files in their
  natural (binary) format. WAL records may be applyed to
  such dump, right?
 
 But (AIUI) you can only safely/usefully copy those files when the 
 database is shut down.

No. You can read/save datafiles at any time. But block reads must be
"atomic" - no one should be able to change any part of a block while
we read it. Cp  tar are probably not suitable for this, but internal
BACKUP command could do this.

Restoring from such backup will like recovering after pg_ctl -m i stop: all
data blocks are consistent and WAL records may be applyed to them.

 Many people hope to run PostgreSQL 24x7x365.  With vacuuming, you 
 might just as well shut down afterward; but when that goes away 
 (in 7.2?), when will you get the chance to take your backups?  

Ability to shutdown 7.2 will be preserved -:))
But it's not required for backup.

   To get replaying we need an "update log", something that might be
   in 7.2 if somebody does a lot of work.
  
  What did you mean by "update log"?
  Are you sure that WAL is not "update log" ? -:)
 
 No, I'm not sure.  I think it's possible that a new backup utility 
 could be written to make a hot backup which could be restored and 
 then replayed using the current WAL format.  It might be easier to
 add another log which could be replayed against the existing form
 of backups.  That last is what I called the "update log".

Consistent read of data blocks is easier to implement, sure.

 The point is, WAL now does one job superbly: maintain a consistent
 on-disk database image.  Asking it to do something else, such as 
 supporting hot BAR, could interfere with it doing its main job.  
 Of course, only the person who implements hot BAR can say.

There will be no interference because of BAR will not ask WAL to do
anything else it does right now - redo-ing changes.

Vadim





Re: [HACKERS] Are pg_control contents really variable-length?

2000-11-25 Thread Vadim Mikheev

 In xlog.c, the declaration of struct ControlFileData says:
 
 /*
  * MORE DATA FOLLOWS AT THE END OF THIS STRUCTURE - locations of data
  * dirs
  */
 
 Is this comment accurate?  I don't see any sign in the code of placing
 extra data after the declared structure.  If you're planning to do it
 in future, I think it would be a bad idea.  I'd prefer to see all the

That was my thought but as you see nothing was done for it, so
feel free to change anything you want there.

Vadim






Re: [HACKERS] WAL

2000-11-20 Thread Vadim Mikheev

Now WAL is ON by default. make distclean + initdb are required.

Vadim





Re: [HACKERS] Re: [COMMITTERS] pgsql/src/backend/access/transam (xlog.c)

2000-11-20 Thread Vadim Mikheev

 Nope.  Still fails...

I know, but looks better, eh? -:)

 
 *** ./expected/opr_sanity.out Tue Nov 14 13:32:58 2000
 --- ./results/opr_sanity.out Mon Nov 20 20:27:46 2000
 ***
 *** 482,489 
 (p2.pronargs = 1 AND p1.aggbasetype = 0)));
 oid  | aggname | oid |   proname   
   ---+-+-+-
 !  16998 | max | 768 | int4larger
 !  17012 | min | 769 | int4smaller
   (2 rows)
   
   -- Cross-check finalfn (if present) against its entry in pg_proc.
 --- 482,489 
 (p2.pronargs = 1 AND p1.aggbasetype = 0)));
 oid  | aggname | oid |   proname   
   ---+-+-+-
 !  16997 | max | 768 | int4larger
 !  17011 | min | 769 | int4smaller
   (2 rows)
   
   -- Cross-check finalfn (if present) against its entry in pg_proc.
 
 ==
 
 * [EMAIL PROTECTED] [EMAIL PROTECTED] [001120 20:11]:
Date: Monday, November 20, 2000 @ 21:11:06
  Author: vadim
  
  Update of /home/projects/pgsql/cvsroot/pgsql/src/backend/access/transam
   from hub.org:/home/projects/pgsql/tmp/cvs-serv62721/src/backend/access/transam
  
  Modified Files:
  xlog.c 
  
  -  Log Message  -
  
  Init ShmemVariableCache in BootStrapXLOG()
  (should fix OID bootstraping).
 
 -- 
 Larry Rosenman  http://www.lerctr.org/~ler
 Phone: +1 972-414-9812 (voice) Internet: [EMAIL PROTECTED]
 US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
 




Re: [HACKERS] Re: [COMMITTERS] pgsql/src/backend/access/transam (xlog.c)

2000-11-20 Thread Vadim Mikheev

  Larry Rosenman [EMAIL PROTECTED] writes:
  Nope.  Still fails...
 
 You should've said that the OIDs are now just off-by-one from where they
 were before, instead of off by several thousand.  That I'm willing to
 accept as an implementation change ;-)  I've updated the expected file.

Actually, pg_shadow' oid for DBA inserted by initdb is 2 now - I'm fixing
this now...

Vadim





Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)

2000-11-19 Thread Vadim Mikheev

   Ok, so with CHECKPOINTS, we could move the offline log files to
   somewhere else so that we could archive them, in my
   undertstanding. Now question is, how we could recover from disaster
   like losing every table files except log files. Can we do this with
   WAL? If so, how can we do it?
  
  Not currently. WAL based BAR is required. I think there will be no BAR
  in 7.1, but it may be added in 7.1.X (no initdb will be required).
  Anyway BAR implementation is not in my plans. All in your hands, guys -:)
  
  Vadim
 
 Cam I ask what BAR is ?

Backup And Restore.

Vadim





Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

2000-11-15 Thread Vadim Mikheev

   Earlier, Vadim was talking about arranging to share fsyncs of the WAL
   log file across transactions (after writing your commit record to the
   log, sleep a few milliseconds to see if anyone else fsyncs before you
   do; if not, issue the fsync yourself).  That would offer less-than-
   one-fsync-per-transaction performance without giving up any 
   guarantees.
   If people feel a compulsion to have a tunable parameter, let 'em tune
   the length of the pre-fsync sleep ...
  
  Already implemented (without ability to tune this parameter - 
  xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
  backend sleeps 1/200 sec before checking/forcing log fsync.
 
 But it returns _completed_ to the client before sleeping, right?

No.

Vadim





[HACKERS] WAL

2000-11-11 Thread Vadim Mikheev

Hi!

I'll be in Las Vegas (Comdex) till next week
Wednesday.
I wasn't able to implement redo for sequences but
was going to turn WAL on by default anyway.
Unfortunately, I've got core in regress:opr_sanity
test (in FileSeek code), so WAL still is not 
default.

See you!

Vadim





Re: [HACKERS] Transaction ID wraparound: problem and proposed solution

2000-11-05 Thread Vadim Mikheev

 One idea I had from this is actually truncating pg_log at some point if
 we know all the tuples have the special committed xid.  It would prevent
 the file from growing without bounds.

Not truncating, but implementing pg_log as set of files - we could remove
files for old xids.

 Vadim, can you explain how WAL will make pg_log unnecessary someday?

First, I mentioned only that having undo we could remove old pg_log after
postmaster startup because of only committed changes would be in data
files and they would be visible to new transactions (small changes in tqual
will be required to take page' startup id into account) which would reuse xids.
While changing a page first time in current startup, server would do exactly
what Tom is going to do at vacuuming - just update xmin/xmax to "1" in all items
(or setting some flag in t_infomask), - and change page' startup id to current.

I understand that this is not complete solution for xids problem, I just wasn't
going to solve it that time. Now after Tom' proposal I see how to reuse xids
without vacuuming (but having undo): we will add XidWrapId (XWI) - xid wrap
counter - to pages and set it when we change page. First time we do this for
page with old XWI we'll mark old items (to know later that they were changed
by xids with old XWI). Each time we change page we can mark old xmin/xmax
with xid = current xid as committed long ago (basing on xact TTL restrinctions).

All above assumes that there will be no xids from aborted transactions in pages,
so we need not lookup in pg_log to know is a xid committed/aborted, - there will
be only xids from running or committed xactions there.

And we need in undo for this.

Vadim





Re: [HACKERS] Transaction ID wraparound: problem and proposed solution

2000-11-04 Thread Vadim Mikheev

  So, we'll have to abort some long running transaction.
 
 Well, yes, some transaction that continues running while ~ 500 million
 other transactions come and go might give us trouble.  I wasn't really
 planning to worry about that case ;-)

Agreed, I just don't like to rely on assumptions -:)

  Required frequency of *successful* vacuum over *all* tables.
  We would have to remember something in pg_class/pg_database
  and somehow force vacuum over "too-long-unvacuumed-tables"
  *automatically*.
 
 I don't think this is a problem now; in practice you couldn't possibly
 go for half a billion transactions without vacuuming, I'd think.

Why not?
And once again - assumptions are not good for transaction area.

 If your plans to eliminate regular vacuuming become reality, then this
 scheme might become less reliable, but at present I think there's plenty
 of safety margin.

  If undo would be implemented then we could delete pg_log between
  postmaster startups - startup counter is remembered in pages, so
  seeing old startup id in a page we would know that there are only
  long ago committed xactions (ie only visible changes) there
  and avoid xid comparison. But ... there will be no undo in 7.1.
  And I foresee problems with WAL based BAR implementation if we'll
  follow proposed solution: redo restores original xmin/xmax - how
  to "freeze" xids while restoring DB?
 
 So, we might eventually have a better answer from WAL, but not for 7.1.
 I think my idea is reasonably non-invasive and could be removed without
 much trouble once WAL offers a better way.  I'd really like to have some
 answer for 7.1, though.  The sort of numbers John Scott was quoting to
 me for Verizon's paging network throughput make it clear that we aren't
 going to survive at that level with a limit of 4G transactions per
 database reload.  Having to vacuum everything on at least a
 1G-transaction cycle is salable, dump/initdb/reload is not ...

Understandable. And probably we can get BAR too but require full
backup every WRAPLIMIT/2 (or better /4) transactions.

Vadim





Re: [HACKERS] Alternative database locations are broken

2000-11-04 Thread Vadim Mikheev

  I think that to handle locations we could symlink catalogs - ln -s
  path_to_database_in_some_location .../base/DatabaseOid
 
  But that's a kludge.  We ought to discourage people from messing with the
  storage internals.
 
 It's not a kluge, it's a perfectly fine implementation.  The only kluge
 here is if people have to reach in and establish such symlinks by hand.
 We want to set up a user interface that hides the implementation.

Agreed. And I don't see problems with handling this at CREATE DATABASE
time. Create database dir in specified location, create symlink from
base dir and remember location name in pg_database.datpath.

Vadim





Re: [HACKERS] WAL status update

2000-11-02 Thread Vadim Mikheev

 "Mikheev, Vadim" [EMAIL PROTECTED] writes:
  I think that at least 1  2 from WAL todo (checkpoints and port to
  machines without TAS) is required before beta.
 
 I'm not sure that you do need to add support for machines without TAS.
 I pointed out a couple months ago that the non-TAS support code hasn't
 even compiled for the past release or three, and proposed ripping it
 all out rather than fixing code that clearly isn't being used.
 No one objected.
 
 I haven't got round to doing the ripping yet, but as far as I know
 there is no reason to expend work on adding more code for the non-TAS
 case.

Oh, it's great! Thanks!
one todo item is gone -:)

 I have a few other things I still need to do before 7.1 beta.  Do you
 think you'll be ready in, say, a week?

yes.

Vadim





Re: [HACKERS] WAL status update

2000-10-29 Thread Vadim Mikheev
  First, as I've already mentioned in answer to Tom about DROP TABLE, undo
  logic will not be implemented in 7.1 -:( Doable for tables but for
indices we
  would need either in compensation records or in xmin/cmin in index
tuples.
  So, we'll still live with dust from aborted xactions in our
tables/indices.

 Does it mean that there would still be inconsistency between
 tables and their indexes ?

Not related. I just meant to say that tuples inserted into tables/indices by
aborted transactions will stay there till vacuum.
Redo should guarantee that index tuples will not be lost in split operation
(what's possible now), but not that an index will have correct structure
after crash - parent page may be unupdated, what could be handled
at run time.

Vadim


Re: [HACKERS] Proposal for DROP TABLE rollback mechanism

2000-10-28 Thread Vadim Mikheev

  Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr
  doesn't look like right thing. ?
 
 Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we
 will likely be doing some useless writes (to flush out pages that we are
 only going to throw away anyway).  If we leave the buffers alone till
 commit, then we'll only write out pages if we need to recycle a buffer
 for another use during that transaction.

BTW, why do we force buffers to disk in FlushRelationBuffers at all?
Seems all what is required is to flush them *from* pool, not *to* disk
immediately. In WAL bufmgr version (temporary in xlog_bufmgr.c)
I've changed this. Actually I wouldn't worry about these writes at all -
drop relation is rare case, - but ok.

 Also, I don't feel comfortable with the idea of doing
 FlushRelationBuffers mid-transaction and then relying on the buffer
 cache to still be empty of pages for that relation much later on when
 we finally commit.  Sure, it *should* be empty, but I'll be happier
 if we flush the buffer cache immediately before deleting the file.

(It *must* be empty -:) Sorry, I don't understand "*immediately* before"
in multi-user environment. Relation must be excl locked and this must
be only guarantee for us, not time of doing anything with this relation.)

 What might make sense is to make a pass over the buffer cache at the
 time of DROP (inside the transaction) to make sure there are no pinned
 buffers for the rel --- if so, we want to elog() during the transaction
 not after commit.  We could also release any non-dirty buffers at
 that point.  Then after commit we know we don't care about the dirty
 buffers anymore, so we come back and discard them.

Please note that there is xlog_bufmgr.c If you'll add/change something in
bufmgr please let me know later.

Vadim





Re: [HACKERS] Proposal for DROP TABLE rollback mechanism

2000-10-28 Thread Vadim Mikheev

  BTW, why do we force buffers to disk in FlushRelationBuffers at all?
  Seems all what is required is to flush them *from* pool, not *to* disk
  immediately.

 Good point.  Seems like it'd be sufficient to do a standard async write
 rather than write + fsync.

 We'd still need some additional logic at commit time, however, because
 we want to make sure there are no BufferDirtiedByMe bits set for the
 file we're about to delete...

Note that there is no BufferDirtiedByMe in WAL bufmgr version.

 How about this: change FlushRelationBuffers so that it does standard
 async writes for dirty buffers and then removes all the rel's buffers

^
When it's used from vacuum no reason to do this.

 from the pool.  This is invoked inside the transaction, same as now.
 Make DROP TABLE call this routine, *not* RemoveRelationBuffers.
 Then call RemoveRelationBuffers from smgr during transaction commit or
 abort.  In the commit case, there really shouldn't be any buffers for
 the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?)

^
Sure, but good time for Assert -:)

 if we find any.  But in the abort case, we'd not be surprised to find
 buffers, even dirty buffers, and we just want to throw 'em away.  This
 would also be the place to clean out the BufferDirtiedByMe state.

BTW, currently smgrcommit is called twice on commit *before* xact
status in pg_log updated and so you can't use it to remove files. In
WAL version smgrcommit isn't called at all now but we easy can
add smgrcommit call after commit record is logged.

Vadim





Re: [HACKERS] Numeric file names

2000-10-28 Thread Vadim Mikheev

  Now that we have numeric file names, I would like to have a command I
  can run from psql that will dump a mapping of numeric file name to table
  name, i.e.,
 
  121233 pg_proc
  143423 pg_index

 select oid, relname from pg_class;

No. select relfilenode, relname from pg_class - in theory relfilenode may
differ
from relation oid.

Vadim





Re: [HACKERS] relation ### modified while in use

2000-10-24 Thread Vadim Mikheev
   In my understanding,locking levels you provided contains
   an implicit share/exclusive lock on the corrsponding
   pg_class tuple i.e. AccessExclusive Lock acquires an
   exclusive lock on the corresping pg_class tuple and
   other locks acquire a share lock, Is it right ?
 
  No. Access...Locks are acquired over target table
  (table' oid is used as key for lmgr hash table),
  not over corresponding pg_class tuple, in what case
  we would use pg_clas' oid + table' oid as key
  (possibility I've described below).
 
 
 Yes,I know that "lock table" doesn't touch the correpon
 ding pg_class tuple at all.  However isn't it equivalent ?

From what POV?
Lock manager will allow two simultaneous exclusive locks using these
different methods (keys) and so we can interpret (use) them differently.

Vadim


Re: [HACKERS] relation ### modified while in use

2000-10-23 Thread Vadim Mikheev
  in general.  What I'm proposing is that once an xact has touched a
  table, other xacts should not be able to apply schema updates to that
  table until the first xact commits.
 
 
 I agree with you.

I don't know. We discussed this issue just after 6.5 and decided to
allow concurrent schema modifications.
Oracle has disctionary locks but run each DDL statement in separate
xaction, so - no deadlock condition here. OTOH, I wouldn't worry
about deadlock - one just had to follow common anti-deadlock rules.

 I've wondered why AccessShareLock is a short term lock.

MUST BE. AccessShare-/Exclusive-Locks are *data* locks.
If one want to protect schema then new schema share/excl locks
must be inroduced. There is no conflict between data and
schema locks - they are orthogonal.

We use AccessShare-/Exclusive-Locks for schema because of...
we allow concurrent schema modifications and no true schema
locks were required.

 If we have a mechanism to acquire a share lock on a tuple,we
 could use it for managing system info generally. However the
 only allowed lock on a tuple is exclusive.  Access(Share/Exclusive)

Actually, just look at lock.h:LTAG structure - lock manager supports
locking of "some objects" inside tables:

typedef struct LTAG
{
OidrelId;
OiddbId;
union
{
BlockNumberblkno;
Transactionxid;
} objId;
...
 - we could add oid to union above and lock tables by acquiring lock
on pg_class with objId.oid = table' oid. Same way we could lock indices
and whatever we want... if we want -:)

 Lock on tables would give us a restricted solution about pg_class
 tuples.
 
 Thers'a possibility of deadlock in any case but there are few
 cases when AccessExclusiveLock is really needed and we could
 acquire an AccessExclusiveLock manually from the first if
 necessary.
 
 I'm not sure about the use of AccessShareLock in parse-analyze-
 optimize phase however.

There is notion about breakable (parser) locks in Oracle documentation -:)

Vadim


Re: AW: [HACKERS] relation ### modified while in use

2000-10-23 Thread Vadim Mikheev
  As for locks,weak locks doesn't pass intensive locks.  Dba
  seems to be able to alter a table at any time.

 Sorry, I don't understand this sentence. Tom suggested placing a shared
lock on
 any table that is accessed until end of tx. Noone can alter table until
all users have
 closed their txns and not accessed tables again.

More of that - while one xaction will wait to alter a table no new xaction
will be
allowed to access this table too.

 Remember that this would include creating an index ...

I don't think so. Index creation requires
1. share lock on schema
2. share lock on data

Vadim


Re: [HACKERS] Re: [COMMITTERS] pgsql/src/test/regress/expected (plpgsql.out inet.out foreign_key.out errors.out)

2000-10-23 Thread Vadim Mikheev

 Bruce Momjian [EMAIL PROTECTED] writes:
  Speaking of error messages, one idea for 7.2 might be to prepended
  numbers to the error messages.
 
 Isn't that long since on the TODO list?  I know we've had long
 discussions about a thoroughgoing revision of error reporting.

Yes, yes, yes! We need in numbers especially because of we
hopefully will have savepoints in 7.2 and so we would get powerful
error handling by *applications* not by *human* only.

Vadim





Re: [HACKERS] SetQuerySnapshot() for utility statements

2000-10-19 Thread Vadim Mikheev

 I notice that ProcessUtility() calls SetQuerySnapshot() for FETCH
 and COPY TO statements, and nothing else.

 Seems to me this is very broken.  Isn't a query snapshot needed for
 any utility command that might do database accesses?

Not needed. We don't support multi-versioning for schema operations.
More of that, sometimes it would be better to read *dirty* data from
system tables - so, no snapshot required.

What is really, hm, not good is that first SetQuerySnapshot defines
serializable snapshot for *all* transactions, even for ones with read
committed
isolevel: in the times of 6.5 I thought about ability to switch between
isolevels
inside single xaction - this is not required by standard and *bad* for
system:
just remember that vacuum doesn't clean up deleted tuples if there is some
transaction *potentially* interested in them. For read committed xactions
must be no serializable snapshot defined and MyProc-xmin must be
updated when *each*  top-level query begins.

Vadim





Re: [HACKERS] time stops within transaction

2000-10-19 Thread Vadim Mikheev
   Snapshot is made per top-level statement and functions/subqueries
   use the same snapshot as that of top-level statement.
 
  Not so.  SetQuerySnapshot is executed per querytree, not per top-level
  statement --- for example, if a rule generates multiple queries from
  a user statement, SetQuerySnapshot is called again for each query.

This is true. I just make it to work as it was in pre-6.5 times - each
query of *top level* query list uses own snapshot (in read committed mode
only) as if they were submitted by user one by one.

But functions/subqueries called while executing query uses same snapshot
as query itself.

  With the current structure of pg_exec_query_string(), an operation
  executed in the outer loop, rather than the inner, would more or less
  correspond to one "top level" query --- if you want to assume that
  pg_exec_query_string() is only called from PostgresMain.  That's
  true today but hasn't always been true --- I believe it used to be
  used to parse SPI commands, and someday it may be again.

It was never used in SPI. Just look at _SPI_execute. Same parent query
snapshot is used in SPI functions. *But* SPI' queries *see* changes
made by parent query - I never was sure about this and think I've asked
other opinions. No opinions - no changes -:)

 If there's no concept of top-level statement,there's no
 concept of read consistency and MVCC isn't needed.

Except of the fact that SPI' queries see changes made by parent same
snapshot
is used all time while executing top-level query (single query, not query
list).

Vadim


Re: [HACKERS] SetQuerySnapshot() for utility statements

2000-10-19 Thread Vadim Mikheev

  Seems to me this is very broken.  Isn't a query snapshot needed for
  any utility command that might do database accesses?
 
  Not needed. We don't support multi-versioning for schema operations.
 
 No?  Seems to me we're almost there.  Look for instance at that DROP
 USER bug I just fixed: it was failing because it wasn't careful to make
 sure that during "DROP USER foo,bar", the loop iteration to delete user
 bar would see the changes the first loop iteration had made.  So even
 ^^^
Snapshot defines visibility of changes made by other transactions.
Seems that you talk here about self-visibility, defined by CommandId.

 though we use a lot of table-level locking rather than true MVCC
 behavior for schema changes, ISTM that we still have to play by all the
 rules when it comes to tuple visibility.  In particular I suspect we
 ought to be using standard query snapshot behavior...

What would it buy for us? MVCC lies to user - it returns view of data
as they were some time ago. What would we get by seeing old
view of catalog?

  More of that, sometimes it would be better to read *dirty* data from
  system tables - so, no snapshot required.
 
 There may be a small number of places like that, but for generic utility
 operations like CREATE/DROP USER, I don't see that this is a good idea.

But your fix for DROP USER didn't change anything about snapshot used by
scans so it's not good example.

Vadim





Re: [HACKERS] time stops within transaction

2000-10-19 Thread Vadim Mikheev

 I am inclined to think that we should do SetQuerySnapshot in the outer
 loop of pg_exec_query_string, just before calling
 pg_analyze_and_rewrite.  This would ensure that parse/plan accesses to
   ^^
Actually not - snapshot is passed as parameter to heap_beginscan...
And currently SnapshotNow is used everywhere.

 If we did that, then SetQuerySnapshot would be called once per user-
 written command (defining a command as whatever the grammar produces
 a single parsetree for, which is probably OK) so long as SPI functions
 don't try to use pg_exec_query_string...

SPI doesn't try this from its birthday in ~6.2

Vadim





[HACKERS] WAL status todo

2000-10-14 Thread Vadim Mikheev

Well, hopefully WAL will be ready for alpha testing in a few days.
Unfortunately
at the moment I have to step side from main stream to implement new file
naming,
the biggest todo for integration WAL into system.

I would really appreciate any help in the following issues (testing can
start regardless
of their statuses but they must be resolved anyway):

1. BTREE: sometimes WAL can't guarantee right order of items on leaf pages
after recovery - new flag BTP_REORDER introduced to mark such pages.
Btree should be changed to handle this case in normal processing mode.
2. HEAP: like 1., this issue is result of attempt to go without compensation
records
(ie without logging undo operations): it's possible that sometimes in
redo
there will be no space for new records because of in recovery we don't
undo changes for aborted xactions immediately - function like BTREE'
_bt_cleanup_page_
required for HEAP as well as general inspection of all places where
HEAP' redo ops
try to insert records (initially I thought that in recovery we'll undo
changes immediately
after reading abort record from log - this wouldn't work for BTREE:
splits must be
redo-ne before undo).
3. There are no redo/undo for HASH, RTREE  GIST yet. This would be *really
really
great* if someone could implement it using BTREE' redo/undo code as
prototype.
These are the most complex parts of this todo.

Probably, something else will follow later.

Regards,
Vadim