Re: [HACKERS] Nested transactions: low level stuff

2003-03-21 Thread Mikheev, Vadim
 I see no concurrency problems.  If two or more backends visit the same
 tuple, they either write the same value to the same position which
 doesn't hurt, or one sees the other's changes which is a good thing.

AFAIR, on multi-CPU platforms it's possible that second transaction could
see COMMITTED state but still old (subtrans id) in xmin: it's not
guaranteed that changes made on CPU1 (V1 was changed first, then V2 was
changed) will appear at the same order on CPU2 (V2 may come first, then V1).

Vadim


_
Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with
Sector, Inc., or SIAC.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Nested transactions: low level stuff

2003-03-19 Thread Mikheev, Vadim
 I see no concurrency problems.  If two or more backends visit the same
 tuple, they either write the same value to the same position which
 doesn't hurt, or one sees the other's changes which is a good thing.

AFAIR, on multi-CPU platforms it's possible that second transaction could
see COMMITTED state but still old (subtrans id) in xmin: it's not
guaranteed that changes made on CPU1 (V1 was changed first, then V2 was
changed) will appear at the same order on CPU2 (V2 may come first, then V1).

Vadim


_
Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with
Sector, Inc., or SIAC.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options f

2003-02-18 Thread Mikheev, Vadim
   Added to TODO:
  
 * Allow WAL information to recover corrupted pg_controldata
  ...
 Using pg_control to get the checkpoint position 
 speeds up the
 recovery process, but to handle possible 
 corruption of pg_control,
 we should actually implement the reading of 
 existing log segments
 in reverse order -- newest to oldest -- in order 
 to find the last
 checkpoint. This has not been implemented, yet.
  
  So if you do this, do you still need to store that information in
  pg_control at all?

Yes: to speeds up the recovery process.

Vadim


_
Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] PostgreSQL-R

2002-12-20 Thread Mikheev, Vadim
 http://www.cs.mcgill.ca/~kemme/papers/vldb00.html

Thanks for the link, Darren, I think everyone interested
in discussion should read it.
First, I like approach. Second, I don't understand why
ppl oppose pg-r  2pc. 2pc is just simple protocol to
perform distributed commits *after* distributed conflicts
were resolved. It says nothing about *how* to resolve
conflicts. Commonly, distributed locks are used, pg-r uses
GCS  kind of batch locking to order distributed transactions
and serialize execution of conflicting ones. Actually, this
serialization is the only drawback I see at the moment: due
to batching of writes/locks pg-r will not allow execution
of transactions from different sites in read committed mode -
one of conflicting transactions will be aborted instead of
waiting for abort/commit of another one, continuing execution
after that. Because of resolving conflicts *before* commit
pg-r is not async solution. But it's not true sync replication
neither due to async commit (read Jan vs Darren discussion in
http://archives.postgresql.org/pgsql-hackers/2002-12/msg00754.php).
What's problem with using 2pc for commit in pg-r? We could make it
optional (and can discuss it later).
Next, pg-r was originally based on 6.4, so what was changed for
current pg versions when MV is used for CC? It seems that locking
tuples via LockTable at Phase 1 is not required anymore, right?
Upon receiving local WS in Phase 3 local transaction should just
check that there are no conflicting locks from remote transactions
in LockTable and can commit after that. Remote transactions will not
see conflicts from local ones in LockTable but will notice them
during execution and will be able to abort local transactions.
(I hope I didn't miss something here.) Also it seems that we could
perform Phases 2  3 periodically during transaction execution.
This would make WS smaller and conflicts between long running
transactions from different sites would be resoved faster.

Comments?

Vadim


_
Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] RI_FKey_check: foreign key constraint blocks parall

2002-11-16 Thread Mikheev, Vadim
   void
   heap_mark4fk_lock_acquire(Relation relation, HeapTuple tuple) {

Just wonder how are you going to implement it - is it by using
some kind of read-locks, ie FK transaction locks PK to prevent
delete (this is known as pessimistic approach)?
About two years ago we discussed with Jan optimistic approach
with using dirty reads, when PK/FK transactions do not check
existence of FK/PK untill constraint should be checked (after
statement processed for immediate mode, at the commit time/
set constraint immediate for deferred constraints).

So, at the check time, FK transaction uses dirty reads to know
about existence/status of PK:
1. No PK - abort.
2. PK (inserted?/)deleted/updated/selected for update by concurrent
transaction P - wait for P commit/abort (just like transactions do
for concurrent same-row-update); go to 1.
3. Else (PK exists and no one changing it right now) - proceed.

PK transaction does the same:
1. No FK - proceed.
2. FK inserted/updated/selected for update by concurrent transaction
F - wait for F commit/abort; go to 1.

This would be more in MVCC style -:)

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Idea for better handling of cntxDirty

2002-11-15 Thread Mikheev, Vadim
 Wouldn't it work for cntxDirty to be set not by LockBuffer, but by
 XLogInsert for each buffer that is included in its argument list?

I thought to add separate call to mark context dirty but above
should work if all callers to XLogInsert always pass all
modified buffers - please check.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] [GENERAL] Database replication... - Mission Critica

2002-11-07 Thread Mikheev, Vadim
 My presumption would be that if you initialize 2 databases to
 a known identical start, have all the same triggers and rules
 on both, then send all queries to both databases, you will
 have 2 identical databases at the end. 

This is wrong assumption. If

1st client executes UPDATE t SET a = 1 WHERE b = 2;
2nd client executes UPDATE t SET a = 2 WHERE b = 2;

at the same time you don't know in what order these
queries will be executed on two different servers (because
you can't control what transaction will lock record(s)
for update first).

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  The predicate for files we MUST (fuzzy) copy is: 
File exists at start of backup  File exists at end of backup
 
 Right, which seems to me to negate all these claims about needing a
 (horribly messy) way to read uncommitted system catalog entries, do
 blind reads, etc.  What's wrong with just exec'ing tar after having
 done a checkpoint?

Right.

It looks like insert/update/etc ops over local relations are
WAL-logged, and it's Ok (we have to do this).

So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  (In particular, I *strongly* object to using the buffer 
 manager at all
  for reading files for backup.  That's pretty much 
 guaranteed to blow out
  buffer cache.  Use plain OS-level file reads.  An OS 
 directory search
  will do fine for finding what you need to read, too.)
 
 How do you get atomic block copies otherwise?

You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  So, we only have to use shared buffer pool for local (but probably
  not for temporary) relations to close this issue, yes? I personally
  don't see any performance issues if we do this.
 
 Hmm.  Temporary relations are a whole different story.
 
 It would be nice if updates on temp relations never got WAL-logged at
 all, but I'm not sure how feasible that is.  Right now we don't really

There is no any point to log them.

 distinguish temp relations from ordinary ones --- in particular, they
 have pg_class entries, which surely will get WAL-logged even if we
 persuade the buffer manager not to do it for the data pages.  Is that
 a problem?  Not sure.

It was not about any problem. I just mean that local buffer pool
still could be used for temporary relations if someone thinks
that it has any sence, anyone?

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  You don't need it.
  As long as whole block is saved in log on first after
  checkpoint (you made before backup) change to block.
 
 I thought half the point of PITR was to be able to turn
 off pre-image logging so you can trade potential recovery

Correction - *after*-image.

 time for speed without fear of data-loss. Didn't we have
 this discussion before?

Sorry, I missed this.

So, it's already discussed what to do about partial
block updates? When system crashed just after LSN,
but not actual tuple etc, was stored in on-disk block
and on restart you compare log record' LSN with
data block' LSN, they are equal and so you *assume*
that actual data are in place too, what is not the case?

I always thought that the whole point of PITR is to be
able to restore DB fast (faster than pg_restore) *AND*
up to the last committed transaction (assuming that
log is Ok).

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  How do you get atomic block copies otherwise?
 
 Eh?  The kernel does that for you, as long as you're reading the
 same-size blocks that the backends are writing, no?

Good point.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

   As long as whole block is saved in log on first after
   checkpoint (you made before backup) change to block.
 
 I thought half the point of PITR was to be able to
 turn off pre-image logging so you can trade potential
 recovery time for speed without fear of data-loss.
 Didn't we have this discussion before?

 Suppose you can turn off/on PostgreSQL's atomic write on
 the fly.  Which means turning on or off whether XLoginsert
 writes a copy of the block into the log file upon first
 modification after a checkpoint.
 So ALTER SYSTEM BEGIN BACKUP would turn on atomic write
 and then checkpoint the database.
 So while the OS copy of the data files is going on the
 atomic write would be enabled. So any read of a partial
 write would be fixed up by the usual crash recovery mechanism.

Yes, simple way to satisfy everyone.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

 Are you sure this is true for all ports?

Well, maybe you're right and it's not.
But with after-image blocks in log after checkpoint
you really shouldn't worry about block atomicity, right?
And ability to turn blocks logging on/off, as suggested
by Richard, looks as appropriate for everyone, ?

 And if so, why would it be cheaper for the kernel to do it in
 its buffer manager, compared to us doing it in ours? This just
 seems bogus to rely on. Does anyone know what POSIX has to say
 about this? 

Does doing it in ours mean reading all data files through
our shared buffer pool? Sorry, I just don't see point in this
when tar ect will work just fine. At least for the first release
tar is SuperOK, because of there must be and will be other
problems/bugs, unrelated to how to read data files, and so
the sooner we start testing the better.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

 So I think what will work then is pg_copy (hot backup) would:
 1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on 
 atomic write,
 checkpoints the database and disables further checkpoints (so 
 wal files
 won't be reused) until the backup is complete.
 2) Change ALTER SYSTEM BACKUP DATABASE TO directory read 
 the database
 directory to find which files it should backup rather than 
 pg_class and for
 each file just use system(cp...) to copy it to the backup directory.

Did you consider saving backup on the client host (ie from where
pg_copy started)?

 3) ALTER SYSTEM FINISH BACKUP does at it does now and backs 
 up the pg_xlog
 directory and renables database checkpointing.

Well, wouldn't be single command ALTER SYSTEM BACKUP enough?
What's the point to have 3 commands?

(If all of this is already discussed then sorry - I'm not going
to start new discussion).

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

 I really dislike the notion of turning off checkpointing.  What if the
 backup process dies or gets stuck (eg, it's waiting for some 
 operator to
 change a tape, but the operator has gone to lunch)?  IMHO, backup
 systems that depend on breaking the system's normal 
 operational behavior
 are broken.  It should be sufficient to force a checkpoint when you
 start and when you're done --- altering normal operation in between is
 a bad design.

But you have to prevent log files reusing while you copy data files.
That's why I asked are 3 commands from pg_copy required and couldn't
be backup accomplished by issuing single command

ALTER SYSTEM BACKUP dir | stdout (to copy data to client side)

(even from pgsql) so backup process would die with entire system -:)
As for tape changing, maybe we could use some timeout and then just
stop backup process.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  It should be sufficient to force a checkpoint when you
  start and when you're done --- altering normal operation 
 in between is
  a bad design.
 
  But you have to prevent log files reusing while you copy data files.
 
 No, I don't think so.  If you are using PITR then you presumably have
 some process responsible for archiving off log files on a continuous
 basis.  The backup process should leave that normal 
 operational behavior in place, not muck with it.

Well, PITR without log archiving could be alternative to
pg_dump/pg_restore, but I agreed that it's not the big
feature to worry about.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread Mikheev, Vadim

  Well, PITR without log archiving could be alternative to
  pg_dump/pg_restore, but I agreed that it's not the big
  feature to worry about.
 
 Seems like a pointless feature to me.  A pg_dump dump serves just
 as well to capture a snapshot --- in fact better, since it's likely
 smaller, definitely more portable, amenable to selective restore, etc.

But pg_restore probably will take longer time than copy data files
back and re-apply log.

 I think we should design the PITR dump to do a good job for PITR,
 not a poor job of both PITR and pg_dump.

As I already said - agreed -:)

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec

2002-03-15 Thread Mikheev, Vadim

 Attached is a patch against current CVS that fixes both of the known
 problems with sequences: failure to flush XLOG after a transaction

Great! Thanks... and sorry for missing these cases year ago -:)

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec

2002-03-14 Thread Mikheev, Vadim

  This isn't an issue for a SELECT nextval() standing on
  its own AFAIK the result will not be transmitted to the
  client until after the commit happens. But it would be
  an issue for a select executed inside a transaction
  block (begin/commit).
 
 The behavior of SELECT nextval() should not be conditional
 on being in or out of a transaction block.

And it's not. But behaviour of application *must* be
conditional on was transaction committed or not.

What's the problem for application that need nextval() for
external (out-of-database) purposes to use sequence values
only after transaction commit? What's *wrong* for such application
to behave the same way as when dealing with other database objects
which are under transaction control (eg only after commit you can
report to user that $100 was successfully added to his/her account)?

---

I agree that if nextval-s were only write actions in transaction
and they made some XLogInsert-s then WAL must be flushed at commit
time. But that's it. Was this fixed? Very easy.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] [GENERAL] Database corruption?

2001-10-24 Thread Mikheev, Vadim

  Um, Vadim? Still of the opinion that elog(STOP) is a good
  idea here? That's two people now for whom that decision has
  turned localized corruption into complete database failure.
  I don't think it's a good tradeoff.
 
  One is able to use pg_resetxlog so I don't see point in
  removing elog(STOP) there. What do you think?

 Well, pg_resetxlog would get around the symptom, but at the cost of
 possibly losing updates that are further along in the xlog than the
 update for the corrupted page. (I'm assuming that the problem here
 is a page with a corrupt LSN.) I think it's better to treat flush
  
On restart, entire content of all modified after last checkpoint pages
should be restored from WAL. In Denis case it looks like newly allocated
for update page was somehow corrupted before heapam.c:2235 (7.1.2 src)
and so there was no XLOG_HEAP_INIT_PAGE flag in WAL record = page
content was not initialized on restart. Denis reported system crash -
very likely due to memory problem.

 request past end of log as a DEBUG or NOTICE condition and keep going.
 Sure, it indicates badness somewhere, but we should try to have some
 robustness in the face of that badness.  I do not see any reason why
 XLOG has to declare defeat and go home because of this condition.

Ok - what about setting some flag there on restart and abort restart
after all records from WAL applied? So DBA will have choice either
to run pg_resetxlog after that and try to dump data or restore from
old backup. I still object just NOTICE there - easy to miss it. And
in normal processing mode I'd leave elog(STOP) there.

Vadim
P.S. Further discussions will be in hackers-list, sorry.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Bulkloading using COPY - ignore duplicates?

2001-10-01 Thread Mikheev, Vadim

 The effects don't stop propagating there, either. The decision
 not to insert the tuple must be reported up still further, so
 that the executor knows not to run any AFTER INSERT/UPDATE
 triggers and knows not to count the tuple as inserted/updated
 for the command completion report.

But what about BEFORE insert/update triggers which could insert
records too?

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Bulkloading using COPY - ignore duplicates?

2001-10-01 Thread Mikheev, Vadim

  But what about BEFORE insert/update triggers which could
  insert records too?
 
 Well, what about them?  It's already possible for a later
 BEFORE trigger to cause the actual insertion to be suppressed,
 so I don't see any difference from what we have now.
 If a BEFORE trigger takes actions on the assumption that the
 insert will happen, it's busted already.

This problem could be solved now by implementing *single* trigger.
In future, we could give users ability to specify trigger
execution order.
But with proposed feature ...

 Mind you, I'm not actually advocating that we do any of this ;-).

I understand -:)

 I was just sketching a possible implementation approach in
 case someone wants to try it.

And I'm just sketching possible problems -:)

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Index location patch for review (more pgbench resul

2001-09-13 Thread Mikheev, Vadim

 Moving the test to a system with SCSI disks gave different results. 
 There is NO difference between having the indexes on the same disk or 
 different disk with the data while running pgbench. So I 
 leave it up to you guys as to include the patch or not. I do believe
 that even if performance doesn't increase, this patch as alot of other
 benefits for admins.

Agreed.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Index location patch for review

2001-09-12 Thread Mikheev, Vadim

  The more general and standard way to go are TABLESPACEs.
  But probably proposed feature will be compatible with
  tablespaces, when we'll got them:
 
 Will it be?  I'm afraid of creating a backwards-compatibility
 problem for ourselves when it comes time to implement tablespaces.

As I said, INDEX_LOCATION in CREATE DATABASE could mean location
of default tablespace for indices in future and one will be able
to override tablespace for particular index with TABLESPACE
clause in CREATE INDEX command.

 At the very least I'd like to see some information demonstrating
 how much benefit there is to this proposed patch, before we
 consider whether to adopt it.  If there's a significant performance
 benefit to splitting a PG database along the table-vs-index divide,
 then it's interesting as a short-term improvement ... but Jim didn't
 even make that assertion, let alone provide evidence to back it up.

Agreed. He mentioned significant performance difference but it would
be great to see results of pgbench tests with scaling factor of = 10.
Jim?

Also, after reviewing patch I have to say that it will NOT work
with WAL. Jim, please do not name index' dir as TBL_NODE_index.
Instead, just use different TBL_NODE for indices (different number).
It's not good to put if(reln-rd_rel-relkind == RELKIND_INDEX)
stuff into storage manager - only two numbers (tblnode  relnode)
must be used to identify file, no any other logical information
totally unrelated to storage issues.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Index location patch for review

2001-09-12 Thread Mikheev, Vadim

 I don't understand the WAL issue below, can you explain. The dir name
 is the same name as the database with _index added to it. This is how
 the current datpath stuff works.  I really just copied the datpath
 code to get this patch to work...

At the time of after crash recovery WAL is not able to read relation
description from catalog and so only relfilenode is provided for
storage manager in relation structure (look backend/access/transam/
xlogutils.c:XLogOpenRelation). Well, we could add Index/Table
file type identifier to RmgrData (rmgr.c in the same dir) to set
relkind in relation structure, but I don't see any reason to
do so when we can just use different tblnode number for indices and
name index dirs just like other dirs under 'base' named - ie
only tblnode number is used for dir names, without any additions
unrelated to storage issues.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Index location patch for review

2001-09-12 Thread Mikheev, Vadim

 Also I have been running this patch (both 7.1.3 and 7.2devel) against
 some of my companies applications.  I have loaded a small database 10G

We are not familiar with your applications. It would be better to see
results of test suit available to the community. pgbench is first to
come in mind. Such tests would be more valuable.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index location patch for review

2001-09-12 Thread Mikheev, Vadim

  Attached is a patch that adds support for specifying a
  location  for indexes via the create database command.
  
  I believe this patch is complete, but it is my first .
 
 This patch allows index locations to be specified as
 different from data locations. Is this a feature direction
 we want to go in?  Comments?

The more general and standard way to go are TABLESPACEs.
But probably proposed feature will be compatible with
tablespaces, when we'll got them: we could use new create
database syntax to specify default tablespace for indices.
Unfortunately I removed message with patch, can you send it
to me, Bruce?

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Problems starting up postgres

2001-09-06 Thread Mikheev, Vadim

  Sep  6 02:09:30 mx postgres[13468]: [9] FATAL 2:
  XLogFlush: request(1494286336, 786458) is not satisfied -- 
  flushed to (23, 2432317444) 

First note that Denis could just restart with wal_debug = 1
to see bad request, without code change. (We should ask ppl
to set wal_debug ON in the case of any WAL problem...)
Denis, could you provide us with debug output?

 Yeek.  Looks like you have a page somewhere in the database
 with a bogus LSN value (xlog pointer) ... and, most likely,
 other corruption as well.

We got error during checkpoint, when backend flushes pages
changed by REDO (and *only those pages*). So, that page X (with
bad LSN) was recovered from log. We didn't see CRC errors,
so log is Ok, physically. We should know what is the X page
(by setting breakpoint as suggested by Tom) and than look
into debug output to see where we got bad LSN.
Maybe it comes from restored pages or from checkpoint LSN,
due to errors in XLogCtl initialization, but for sure it looks
like bug in WAL code.

 Vadim, what do you think of reducing this elog from STOP to a notice
 on a permanent basis?  ISTM we saw cases during 7.1 beta where this

And increase probability that ppl will just miss/ignore NOTICE
and bug in WAL will continue to harm others?

 STOP prevented people from recovering, so I'm thinking it does more

And we fixed bug in WAL that time...

 harm than good to overall system reliability.

No reliability having bugs in WAL code, so I object. But I'd move
check into XLogWrite code to STOP if flush request is beyond write
point.

Denis, please help us to fix this bug. Some GDB-ing probably will be
required. If you have not enough time/disk resources but able to
give us copy of data-dir, it would be great (I have RedHat 7.? and
Solaris 2.6 hosts, Tom ?). In any case debug output is the first
thing I'd like to see. If it's big please send it to Tom and me only.
And of course you can contact with me in Russian -:)

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [GENERAL] getting the oid for a new tuple in a BEFORE trigger

2001-08-31 Thread Mikheev, Vadim

 we need to control database changes within BEFORE triggers.
 There is no problem with triggers called by update, but there is
 a problem with triggers called by insert.
 
 We strongly need to know the oid of a newly inserted tuple.
 In this case, we use tg_newtuple of the TriggerData structure
 passed to thetrigger function, and its t_data - t_oid will
 have the value '0'.
 
 Using BEFORE and AFTER triggers would make our lives much harder.
 
 Is there any way (even hack) to get the oid the newly
 inserted tuple will receive?

Just set t_data-t_oid = newoid() - this is what backend does
in heapam.c:heap_insert().

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] CURRENT OF cursor without OIDs

2001-08-24 Thread Mikheev, Vadim
 Oops I'm referring to client side cursors in our ODBC
 driver. We have no cross-transaction cursors yet though
 I'd like to see a backend cross-transaction cursor also.

Ops, sorry.
BTW, what are "visibility" rules for ODBC cross-tx cursor?
No Repeatable reads, no Serializability?
Do you hold some locks over table while cursor opened
(I noticed session locking in lmgr recently)?
Could ODBC cross-tx cursors be implemented using server
cross-tx cursors?

  I think we'll be able to restore old tid along with other tuple
  data from rollback segments, so I don't see any problem from
  osmgr...
 
 How do we detect the change of tuples from clients ?

What version of tuple client must see? New one?

 TIDs are invariant under osmgr. xmin is about to be
 unreliable for the purpose.

Seems I have to learn more about ODBC cross-tx cursors -:(
Anyway, *MSQL*, Oracle, Informix - all have osmgr. Do they
have cross-tx cursors in their ODBC drivers?

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


RE: [HACKERS] User locks code

2001-08-24 Thread Mikheev, Vadim

  Besides, anyone who actually wanted to use the userlock
  code would need only to write their own wrapper functions
  to get around the GPL license.
 
 This is a part of copyright law that eludes me - can i write
 a replacement function for something so simple that it can
 essentially be done in one way only (like incrementing a
 value by one) ?

Yes, this is what bothers me in user-lock case.
On the other hand contrib/user-lock' licence
cannot cover usage of LOCKTAG and LockAcquire
(because of this code is from backend) and this is
all what used in user_lock funcs. So, that licence
is unenforceable to everything... except of func names -:)

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [OT] Re: [HACKERS] User locks code

2001-08-24 Thread Mikheev, Vadim

 Because the code we got from Berkeley was BSD licensed, we
 can't change it, and because many of us like the BSD license
 better because we don't want to require them to release the
 source code, we just want them to use PostgreSQL. And we
 think they will release the source code eventually anyway.

And we think that no one will try to fork and commercialize
server code - todays, when SAP  InterBase open their DB
code, it seems as no-brain.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] User locks code

2001-08-24 Thread Mikheev, Vadim

 So, rather than going over everone's IANAL opinons about mixing
 licenses, let's just let Massimo know that it'd just be a lot
 easier to PostgreSQL/BSD license the whole thing, if he doesn't
 mind too much.

Yes, it would be better.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] User locks code

2001-08-23 Thread Mikheev, Vadim

  If the licence becomes a problem I can easily change it,
  but I prefer the GPL if possible.
 
 We just wanted to make sure the backend changes were not
 under the GPL.

No, Bruce - backend part of code is useless without interface
functions and I wonder doesn't GPL-ed interface implementation
prevent using of user-locks in *commercial* applications.
For example, one could use user-locks for processing incoming
orders by multiple operators:
select * from orders where user_lock(orders.oid) = 1 LIMIT 1
- so each operator would lock one order for processing and
operators wouldn't block each other. So, could such
application be commercial with current licence of
user_lock()? (Sorry, I'm not licence guru.)

DISCLAIMER (to avoid ungrounded rumors -:))
I have no plans to use user-locks in applications
of *any* kind (free/commercial). It's just matter of
principle - anything in/from backend code maybe used
for any purposes, - that's nature of our licence.

DISCLAIMER II (to avoid ungrounded rumors in future -:))
I would be interested in using proposed key-locking
in some particular commercial application but this
feature is not must have for that application -
for my purposes it's enough:

--
LOCKTAG tag;
tag.relId = XactLockTableId;
tag.dbId = _tableId_;
// tag.dbId = InvalidOid is used in XactLockTableInsert
// and no way to use valid OID for XactLock purposes,
// so no problem
tag.objId.xid = _user_key_;
--

- but I like standard solutions more -:)
(BTW, key-locking was requested by others a long ago.)

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] User locks code

2001-08-23 Thread Mikheev, Vadim

  For example, one could use user-locks for processing incoming
  orders by multiple operators:
  select * from orders where user_lock(orders.oid) = 1 LIMIT 1
  - so each operator would lock one order for processing and
  operators wouldn't block each other. So, could such
  application be commercial with current licence of
  user_lock()? (Sorry, I'm not licence guru.)
 
 I assume any code that uses contrib/userlock has to be GPL'ed,
 meaning it can be used for commercial purposes but can't be sold
 as binary-only, and actually can't be sold for much because you
 have to make the code available for near-zero cost.

I'm talking not about solding contrib/userlock separately, but
about ability to sold applications which use contrib/userlock.
Sorry, if it was not clear.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] User locks code

2001-08-23 Thread Mikheev, Vadim

   I assume any code that uses contrib/userlock has to be GPL'ed,
   meaning it can be used for commercial purposes but can't be sold
   as binary-only, and actually can't be sold for much because you
   have to make the code available for near-zero cost.
  
  I'm talking not about solding contrib/userlock separately, but
  about ability to sold applications which use contrib/userlock.
  Sorry, if it was not clear.
 
 No, you were clear.

So I missed your near-zero cost sentence.

 My assumption is that once you link that code into
 the backend, the entire backend is GPL'ed and any other
 application code you link into it is also (stored procedures,
 triggers, etc.) I don't think your client application will
 be GPL'ed, assuming you didn't link in libreadline.

Application would explicitly call user_lock() functions in
queries, so issue is still not clear for me. And once again -
compare complexities of contrib/userlock and backend' userlock
code: what's reason to cover contrib/userlock by GPL?

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] User locks code

2001-08-23 Thread Mikheev, Vadim

  Application would explicitly call user_lock() functions in
  queries, so issue is still not clear for me. And once again -
 
 Well, yes, it calls user_lock(), but the communication is not
 OS-linked, it is linked over a network socket, so I don't think
 the GPL spreads over a socket. Just as telnet'ing somewhere an
 typing 'bash' doesn't make your telnet GPL'ed, so I think the
 client code is safe. To the client, the backend is just
 returning information. You don't really link to the query
 results.

Ah, ok.

  compare complexities of contrib/userlock and backend' userlock
  code: what's reason to cover contrib/userlock by GPL?
 
 Only because Massimo prefers it. I can think of no other reason.
 It clearly GPL-stamps any backend that links it in.

Ok, let's do one step back - you wrote:

 My assumption is that once you link that code into the backend,
 the entire backend is GPL'ed and any other application code
 you link into it is also (stored procedures, triggers, etc.)

So, one would have to open-source and GPL all procedures/triggers
used by application just because of application uses user_lock()
in queries?! Is it good?

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] CURRENT OF cursor without OIDs

2001-08-23 Thread Mikheev, Vadim
  AFAICS, if you are holding an open SQL cursor, it is sufficient
  to check that ctid hasn't changed to know that you have the
  same, un-updated tuple. Under MVCC rules, VACUUM will be unable
  to delete any tuple that is visible to your open transaction,
  and so new-style VACUUM cannot recycle the ctid.
...
 
 As Tom mentiond once in this thread, I've referred to non-SQL
 cursors which could go across transaction boundaries.
 TIDs aren't that reliable across transactions.

We could avoid reassignment of MyProc-xmin having cursors
opened across tx boundaries and so new-style vacuum wouldn't
remove old tuple versions...

 OIDs and xmin have already lost a part of its nature. Probably
 I have to guard myself beforehand and so would have to mention
 repeatedly from now on that if we switch to an overwriting smgr,
 there's no system item to detect the change of tuples. 

So, is tid ok to use for your purposes?
I think we'll be able to restore old tid along with other tuple
data from rollback segments, so I don't see any problem from
osmgr...

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html


[HACKERS] RE: User locks code

2001-08-21 Thread Mikheev, Vadim

 yep:
 lock tablename.colname.val=1
 select count(*) from tablename where colname=1
 If no rows, insert, else update.
 (dunno if the locks would scale to a scenario with hundreds
 of concurrent inserts - how many user locks max?).

I don't see problem here - just a few bytes in shmem for
key. Auxiliary table would keep refcounters for keys.

 Why wouldn't it work with serializable isolevel?

Because of selects see old database snapshot and so you
wouldn't see key inserted+committed by concurrent tx.

Vadim
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] User locks code

2001-08-21 Thread Mikheev, Vadim

 Regarding the licencing of the code, I always release my code
 under GPL, which is the licence I prefer, but my code in the
 backend is obviously released under the original postgres
 licence. Since the module is loaded dynamically and not linked
 into the backend I don't see a problem here.

The problem is how to use user-locks in commercial projects.
Some loadable interface functions are required to use in-backend
user lock code, but interface is so simple - if one would write
new functions they would look the same as yours covered by GPL.

 If the licence becomes a problem I can easily change it, but
 I prefer the GPL if possible.

Actually I don't see why to cover your contrib module by GPL.
Not so much IP (intellectual property) there. Real new things
which make new feature possible are in lock manager.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] RE: User locks code

2001-08-21 Thread Mikheev, Vadim

  I don't see problem here - just a few bytes in shmem for
  key. Auxiliary table would keep refcounters for keys.
 
 I think that running out of shmem *would* be a problem for such a
 facility.  We have a hard enough time now sizing the lock table for

Auxiliary table would have fixed size and so no new keys would be
added if no space. I don't see problem with default 8Kb aux table,
do you?

 system locks, even though they use fixed-size keys and the system as
 a whole is designed to ensure that not too many locks will be held
 simultaneously.  (For example, SELECT FOR UPDATE doesn't try to use
 per-tuple locks.)  Earlier in this thread, someone proposed using
 user locks as a substitute for SELECT FOR UPDATE.  I can guarantee
 you that that someone will run out of shared memory before long,
 if the userlock table resides in shared memory.

How is proposed key locking is different from user locks we
have right now? Anyone can try to acquire many-many user locks.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



[HACKERS] RE: User locks code

2001-08-20 Thread Mikheev, Vadim

 Would your suggested implementation allow locking on an
 arbitrary string?

Well, placing string in LOCKTAG is not good so we could
create auxilary hash table in shmem to keep such strings
and use string' address as part of LOCKTAG. New function
(LockRelationKey?) in lmgr.c would first find/place
key in that table, than construct LOCKTAG and call
LockAcquire.
Possible syntax:

LOCK TABLE relation IN {SHARE | EXCLUSIVE} MODE
ON KEY user-string [FOR TRANSACTION | FOR SESSION];

UNLOCK (RELEASE?) TABLE relation {SHARE | EXCLUSIVE} LOCK
ON KEY user-string;

(or just some built-in functions).

 If it does then one of the things I'd use it for is to insert
 unique data without having to lock the table or rollback on
 failed insert (unique index still kept as a guarantee).

(Classic example how could be used SAVEPOINTs -:))

So, in your application you would first lock a key in excl mode
(for duration of transaction), than try to select and insert unless
found? (Note that this will not work with serializable isolevel.)

Comments?

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] User locks code

2001-08-17 Thread Mikheev, Vadim

1. Just noted this in contrib/userlock/README.user_locks:

 User locks, by Massimo Dal Zotto [EMAIL PROTECTED]
 Copyright (C) 1999, Massimo Dal Zotto [EMAIL PROTECTED]
 
 This software is distributed under the GNU General Public License
 either version 2, or (at your option) any later version.

Well, anyone can put code into contrib with whatever license
he/she want but user locks package includes interface
functions in contrib *and* changes in our lock manager, ie
changes in backend code. I wonder if backend' part of package
is covered by the same license above? And is it good if yes?

2. Not good implementation, imho.

It's too complex (separate lock method table, etc). Much cleaner
would be implement this feature the same way as transactions
wait other transaction commit/abort: by locking objects in
pseudo table. We could get rid of offnum and lockmethod from
LOCKTAG and add

struct
{
Oid RelId;
Oid ObjId;
} userObjId;

to objId union of LOCKTAG.

This way user could lock whatever object he/she want in specified
table and note that we would be able to use table access rights to
control if user allowed to lock objects in table - missed in 1.

One could object that 1. is good because user locks never wait.
I argue that never waiting for lock is same bad as always waiting.
Someday we'll have time-wait etc features for general lock method
and everybody will be happy -:)

Comments?

Vadim
P.S. I could add 2. very fast, no matter if we'll keep 1. or not.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



[HACKERS] Using POSIX mutex-es

2001-08-01 Thread Mikheev, Vadim

1. Just changed
TAS(lock) to pthread_mutex_trylock(lock)
S_LOCK(lock) to pthread_mutex_lock(lock)
S_UNLOCK(lock) to pthread_mutex_unlock(lock)
(and S_INIT_LOCK to share mutex-es between processes).

2. pgbench was initialized with scale 10.
   SUN WS 10 (512Mb), Solaris 2.6 (I'm unable to test on E4500 -:()
   -B 16384, wal_files 8, wal_buffers 256,
   checkpoint_segments 64, checkpoint_timeout 3600
   50 clients x 100 transactions
   (after initialization DB dir was saved and before each test
copyed back and vacuum-ed).

3. No difference.
   Mutex version maybe 0.5-1 % faster (eg: 37.264238 tps vs 37.083339 tps).

So - no gain, but no performance loss from using pthread library
(I've also run tests with 1 client), at least on Solaris.

And so - looks like we can use POSIX mutex-es and conditional variables
(not semaphores; man pthread_cond_wait) and should implement light lmgr,
probably with priority locking.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Performance TODO items

2001-07-30 Thread Mikheev, Vadim

 * Order duplicate index entries by tid
  
  In other words - add tid to index key: very old idea.
 
 I was thinking during index creation, it would be nice to
 order them by tid, but not do lots of work to keep it that way.

I hear this not do lots of work so often from you -:)
Days of simplicity are gone, Bruce. To continue, this project
requires more and more complex solutions.

 * Add queue of backends waiting for spinlock
  
  We shouldn't mix two different approaches for different
  kinds of short-time internal locks - in one cases we need in
  light lmgr (when we're going to keep lock long enough, eg for IO)
  and in another cases we'd better to proceed with POSIX' mutex-es
  or semaphores instead of spinlocks. Queueing backends waiting
  for spinlock sounds like nonsense - how are you going to protect
  such queue? With spinlocks? -:)
 
 Yes, I guess so but hopefully we can spin waiting for the queue lock
 rather than sleep. We could use POSIX spinlocks/semaphores now but we
 don't because of performance, right?

No. As long as no one proved with test that mutexes are bad for
performance...
Funny, such test would require ~ 1 day of work.

 Should we be spinning waiting for spinlock on multi-cpu machines?  Is
 that the answer?

What do you mean?

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Performance TODO items

2001-07-30 Thread Mikheev, Vadim

   We could use POSIX spinlocks/semaphores now but we
   don't because of performance, right?
  
  No. As long as no one proved with test that mutexes are bad for
  performance...
  Funny, such test would require ~ 1 day of work.
 
 Good question. I know the number of function calls to spinlock stuff
 is huge. Seems real semaphores may be a big win on multi-cpu boxes.

Ok, being tired of endless discussions I'll try to use mutexes instead
of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Performance TODO items

2001-07-30 Thread Mikheev, Vadim

 New TODO entries are:
 
   * Order duplicate index entries by tid

In other words - add tid to index key: very old idea.

   * Add queue of backends waiting for spinlock

We shouldn't mix two different approaches for different
kinds of short-time internal locks - in one cases we need in
light lmgr (when we're going to keep lock long enough, eg for IO)
and in another cases we'd better to proceed with POSIX' mutex-es
or semaphores instead of spinlocks. Queueing backends waiting
for spinlock sounds like nonsense - how are you going to protect
such queue? With spinlocks? -:)

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: OID wraparound (was Re: [HACKERS] pg_depend)

2001-07-19 Thread Mikheev, Vadim

 Yes, nowhere near, and yes.  Sequence objects require disk I/O to
 update; the OID counter essentially lives in shared memory, and can
 be bumped for the price of a spinlock access.

Sequences also cache values (32 afair) - ie one log record is required
for 32 nextval-s. Sequence' data file is updated at checkpoint time,
so - not so much IO. I really think that using sequences for system
tables IDs would be good.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: OID wraparound (was Re: [HACKERS] pg_depend)

2001-07-18 Thread Mikheev, Vadim

 If you want to make oids optional on user tables,
 we can vote on that.

Let's vote. I'm proposing optional oids for 2-3 years,
so you know how I'll vote -:)

 However, OID's keep our system tables together.

How?! If we want to find function with oid X we query
pg_proc, if we want to find table with oid Y we query
pg_class - we always use oids in context of class
to what an object belongs. This means that two tuples
from different system tables could have same oid values
and everything would work perfectly.

There is no magic around OIDs.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: OID wraparound (was Re: [HACKERS] pg_depend)

2001-07-18 Thread Mikheev, Vadim

 OK, we need to vote on whether Oid's are optional,
 and whether we can have them not created by default.

Optional OIDs:  YES
No OIDs by default: YES

   However, OID's keep our system tables together.
  
  How?! If we want to find function with oid X we query
  pg_proc, if we want to find table with oid Y we query
  pg_class - we always use oids in context of class
  to what an object belongs. This means that two tuples
  from different system tables could have same oid values
  and everything would work perfectly.
 
 I meant we use them in many cases to link entries, and in
 pg_description for descriptions and lots of other things
 that may use them in the future for system table use.

So, add class' ID (uniq id from pg_class) when linking.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup

2001-07-12 Thread Mikheev, Vadim

  You forget about Tx Old! The point is that changes made by
  Tx Old *over* Tx New' changes effectively make those Tx New'
  changes *visible* to Tx S!
 
 Yes, but what's that got to do with the order of operations in
 GetSnapshotData?  The scenario you describe can occur anyway.

Try to describe it step by step.

 Only if Tx Old is running in Read Committed mode, of course.
 But if it is, then it's capable of deciding to update a row updated
 by Tx New.  Whether Tx S's xmax value is before or after Tx New's ID
 is not going to change the behavior of Tx Old.

1.  I consider particular case when Tx S' xmax is before Tx New' ID.
1.1 For this case acquiring SInval lock before ReadNewTransactionId()
changes behavior of Tx Old: it postpones change of Tx Old'
(and Tx New') MyProc-xid in xact.c:CommitTransaction(), so Tx S
will see Tx Old as running, ie Tx Old' changes will be invisible
to Tx S on the base of analyzing MyProc.xid-s, just like Tx New'
changes will be invisible on the base of analyzing next Tx ID.
2.  If you can find examples when current code is not able to provide
consistent snapshot of running (out of interest) transactions
let's think how to fix code. Untill then my example shows why
we cannot move SInval lock request after ReadNewTransactionId().

 I'd still like to change GetSnapshotData to read the nextXid before
 it acquires SInvalLock, though.  If we did that, it'd be safe to make
 GetNewTransactionId be
 
   SpinAcquire(XidGenLockId);
   xid = nextXid++;
   SpinAcquire(SInvalLockId);
   MyProc-xid = xid;
   SpinRelease(SInvalLockId);
   SpinRelease(XidGenLockId);
 
 which is really necessary if you want to avoid assuming that
 TransactionIds can be fetched and stored atomically.

To avoid that assumption one should add per MyProc spinlock.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup

2001-07-12 Thread Mikheev, Vadim

 Oh, now I get it: the point is to prevent Tx Old from exiting the set
 of still running xacts as seen by Tx S.  Okay, it makes sense.
 I'll try to add some documentation to explain it.

TIA! I had no time from '99 -:)

 Given this, I'm wondering why we bother with having a separate
 XidGenLock spinlock at all.  Why not eliminate it and use SInval
 spinlock to lock GetNewTransactionId and ReadNewTransactionId?

Reading all MyProc in GetSnashot may take long time - why disallow
new Tx to begin.

 What did you think about reordering the vacuum qual tests and
 AbortTransaction sequence?

Sorry, no time at the moment.

 BTW, I'm starting to think that it would be really nice if we could
 replace our spinlocks with not just a semaphore, but something that has
 a notion of shared and exclusive lock requests.  For example,
 if GetSnapshotData could use a shared lock on SInvalLock, it'd
 improve concurrency.

Yes, we already told about light lock manager (no deadlock detection etc).

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup

2001-07-12 Thread Mikheev, Vadim

  Isn't spinlock just a few ASM instructions?... on most platforms...
 
 If we change over to something that supports read vs write locking,
 it's probably going to be rather more than that ... right now, I'm
 pretty dissatisfied with the performance of our spinlocks under load.

We shouldn't use light locks everywhere. Updating/reading MyProc.xid
is very good place to use simple spinlocks... or even better mutexes.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup

2001-07-12 Thread Mikheev, Vadim

  Given this, I'm wondering why we bother with having a separate
  XidGenLock spinlock at all.  Why not eliminate it and use SInval
  spinlock to lock GetNewTransactionId and ReadNewTransactionId?
 
  Reading all MyProc in GetSnashot may take long time - why disallow
  new Tx to begin.
 
 Because we need to synchronize?  It bothers me that we're assuming
 that fetching/storing XIDs is atomic.  There's no possibility at all
 of going to 8-byte XIDs as long as the code is like this.
 
 I doubt that a spinlock per PROC structure would be a better answer,
 either; the overhead of getting and releasing each lock would be
 nontrivial, considering the small number of instructions spent at
 each PROC in these routines.

Isn't spinlock just a few ASM instructions?... on most platforms...

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] Rule recompilation

2001-07-12 Thread Mikheev, Vadim

 Anyway,  what's  the preferred syntax for triggering the rule
 recompilation?  I thought about
 
 ALTER RULE {rulename|ALL} RECOMPILE;
 
 Where ALL triggers only those rules where the  user  actually
 has RULE access right on a relation.

In good world rules (PL functions etc) should be automatically
marked as dirty (ie recompilation required) whenever referenced
objects are changed.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[HACKERS] RE: Strangeness in xid allocation / snapshot setup

2001-07-11 Thread Mikheev, Vadim

 I am trying to understand why GetSnapshotData() needs to acquire the
 SInval spinlock before it calls ReadNewTransactionId, rather than after.
 I see that you made it do so --- in the commit at

http://www.ca.postgresql.org/cgi/cvsweb.cgi/pgsql/src/backend/storage/ipc/sh
mem.c.diff?r1=1.41r2=1.42
 but I don't understand why the loss of concurrency is necessary.
 Since we are going to treat all xids = xmax as in-progress anyway,
 what's wrong with reading xmax before we acquire the SInval lock?

AFAIR, I made so to prevent following:

1. Tx Old is running.
2. Tx S reads new transaction ID in GetSnapshotData() and swapped away
   before SInval acquired.
3. Tx New gets new transaction ID, makes changes and commits.
4. Tx Old changes some row R changed by Tx New and commits.
5. Tx S gets snapshot data and now sees R changed by *both* Tx Old and
   Tx New *but* does not see *other* changes made by Tx New =
   Tx S reads unconsistent data.

-

As for issue below - I don't remember why I decided that
it's not important and will need in some time to remember.

 Also, it seems to me that in GetNewTransactionId(), it's important
 for MyProc-xid to be set before releasing XidGenLockId, not after.
 Otherwise there is a narrow window for failure:
 
 1. Process A calls GetNewTransactionId.  It allocates an xid of, say,
 1234, and increments nextXid to 1235.  Just after releasing the
 XidGenLock spinlock, but before it can set its MyProc-xid, control
 swaps away from it.
 
 2. Process B gets to run.  It runs GetSnapshotData.  It sees nextXid =
 1235, and it does not see xid = 1234 in any backend's proc-xid.
 Therefore, B will assume xid 1234 has already terminated, when it
 hasn't.
 
 Isn't this broken?  The problem would be avoided if 
 GetNewTransactionId
 sets MyProc-xid before releasing the spinlock, since then after
 GetSnapshotData has called ReadNewTransactionId, we know that 
 all older
 XIDs that are still active are recorded in proc structures.
 
 Comments?
 
   regards, tom lane
 

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Re: Buffer access rules, and a probable bug

2001-07-03 Thread Mikheev, Vadim

 On further thought, btbuild is not that badly broken at the moment,
 because CREATE INDEX acquires ShareLock on the relation, so
 there can be no concurrent writers at the page level. Still, it
 seems like it'd be a good idea to do LockBuffer(buffer,
BUFFER_LOCK_SHARE)
 here, and probably also to invoke HeapTupleSatisfiesNow() via the
 HeapTupleSatisfies() macro so that infomask update is checked for.
 Vadim, what do you think?

Looks like there is no drawback in locking buffer so let's lock it.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] RE: [BUGS] Update is not atomic

2001-06-21 Thread Mikheev, Vadim

  Incrementing comand counter is not enough - dirty reads are required
  to handle concurrent PK updates.
 
 What's that with you and dirty reads? Every so often you tell
 me that something would require them -  you  really  like  to
 read dirty things - no? :-)

Dirty things occure - I like to handle them -:)
All MVCC stuff is just ability to handle dirties, unlike old,
locking, behaviour when transaction closed doors to table while
doing its dirty things. Welcome to open world but be ready to
handle dirty things -:)

 So  let  me  get it straight: I execute the entire UPDATE SET
 A=A+1, then increment the command counter and  don't  see  my
 own  results?  So  an  index  scan with heap tuple check will
 return OLD (+NEW?) rows? Last  time  I  fiddled  around  with
 Postgres it didn't, but I could be wrong.

How are you going to see concurrent PK updates without dirty reads?
If two transactions inserted same PK and perform duplicate check at
the same time - how will they see duplicates if no one committed yet?
Look - there is very good example of using dirty reads in current
system: uniq indices, from where we started this thread. So, how uniq
btree handles concurrent (and own!) duplicates? Btree calls heap_fetch
with SnapshotDirty to see valid and *going to be valid* tuples with
duplicate key. If VALID -- ABORT, if UNCOMMITTED (going to be valid)
-- wait for concurrent transaction commit/abort (note that for
obvious reasons heap_fetch(SnapshotDirty) doesn't return OLD rows
modified by current transaction). I had to add all this SnapshotDirty
stuff right to get uniq btree working with MVCC. All what I propose now
is to add ability to perform dirty scans to SPI (and so to PL/*), to be
able make right decisions in SPI functions and triggers, and make those
decisions *at right time*, unlike uniq btree which makes decision
too soon. Is it clear now how to use dirty reads for PK *and* FK?

You proposed using share *row* locks for FK before. I objected then and
object now. It will not work for PK because of PK rows do not exist
for concurrent transactions. What would work here is *key* locks (locks
placed for some key in a table, no matter does row with this key exist
or not). This is what good locking systems, like Informix, use. But
PG is not locking system, no reasons to add key lock overhead, because
of PG internals are able to handle dirties and we need just add same
abilities to externals.

Vadim
 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Good name for new lock type for VACUUM?

2001-06-21 Thread Mikheev, Vadim

 Any better ideas out there?

Names were always hard for me -:)

 Where did the existing lock type names
 come from, anyway?  (Not SQL92 or SQL99, for sure.)

Oracle. Except for Access Exclusive/Share Locks.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[HACKERS] RE: [BUGS] Update is not atomic

2001-06-20 Thread Mikheev, Vadim

 Problem can be demonstrated by following example
 
 create table a (a numeric primary key);
 insert into a values (1);
 insert into a values (2);
 insert into a values (3);
 insert into a values (4);
 update a set a=a+1 where a2;
 ERROR:  Cannot insert a duplicate key into unique index a_pkey

We use uniq index for UK/PK but shouldn't. Jan?

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] RE: [BUGS] Update is not atomic

2001-06-20 Thread Mikheev, Vadim

  update a set a=a+1 where a2;
  ERROR:  Cannot insert a duplicate key into unique index a_pkey
 
 This is a known problem with unique contraints, but it's not
 easy to fix it.

Yes, it requires dirty reads.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] RE: [BUGS] Update is not atomic

2001-06-20 Thread Mikheev, Vadim

   update a set a=a+1 where a2;
   ERROR:  Cannot insert a duplicate key into unique index a_pkey
 
  We use uniq index for UK/PK but shouldn't. Jan?
 
 What  else  can  you  use  than  an  index? A deferred until
 statement end trigger checking for  duplicates?  Think  it'd
 have a real bad performance impact.

AFAIR, standard requires deffered (until statement/transaction(?)
end) as default behaviour for RI (all?) constraints. But no matter
what is default, deffered *must* be available = uniq indices
must not be used.

 Whatever  the  execution order might be, the update of '3' to
 '4' will see the other '4' as existent WRT the scan commandId
 and  given snapshot - right? If we at the time we now fire up
 the ERROR add the key, the  index  and  heap  to  a  list  of
 possible dupkeys, that we'll check at the end of the actual
 command, the above would work. The  check  at  statement  end
 would have to increment the commandcounter and for each entry
 do an index scan with the key, counting the number of  found,
 valid heap tuples.

Incrementing comand counter is not enough - dirty reads are required
to handle concurrent PK updates.

 Well,  with  some  million rows doing a set a = a + 1 could
 run out of memory. So this would be something that'd work  in
 the  sandbox  and  for non-broken applications (tm). Maybe at

How is this different from (deffered) updates of million FK we allow
right now? Let's user decide what behaviour (deffered/immediate) he
need. The point is that now user has no ability to choose what's
right for him.

 some level (when we escalate the lock to a full table  lock?)
 we  simply  forget  about  single  keys, but have a new index
 access function that checks the entire index for  uniqueness.

I wouldn't bother to implement this. User always has ability to excl.
lock table, drop constraints, update whatever he want and recreate
constraints again.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: AW: [HACKERS] Postgres Replication

2001-06-12 Thread Mikheev, Vadim

 Here are some disadvantages to using a trigger based approach:
 
 1) Triggers simply transfer individual data items when they 
 are modified, they do not keep track of transactions.

I don't know about other *async* replication engines but Rserv
keeps track of transactions (if I understood you corectly).
Rserv transfers not individual modified data items but
*consistent* snapshot of changes to move slave database from
one *consistent* state (when all RI constraints satisfied)
to another *consistent* state.

 4) The activation of triggers in a database cannot be easily
 rolled back or undone.

What do you mean?

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Baby girl

2001-06-11 Thread Mikheev, Vadim

 I had a baby girl on Tuesday.  I am working through my 
 backlogged emails
 today.

Congratulations -:)

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   So are whole pages stored in rollback segments or just
   the modified data?
  
  This is implementation dependent. Storing whole pages is
  much easy to do, but obviously it's better to store just
  modified data.
 
 I am not sure it is necessarily better. Seems to be a tradeoff here.
 pros of whole pages:
   a possible merge with physical log (for first
   modification of a page after checkpoint
   there would be no overhead compared to current 
   since it is already written now)

Using WAL as RS data storage is questionable.

   in a clever implementation a page already in the
   rollback segment might satisfy the 
   modification of another row on that page, and 
   thus would not need any additional io.

This would be possible only if there was no commit (same SCN)
between two modifications.

But, aren't we too deep on overwriting smgr (O-smgr) implementation?
It's doable. It has advantages in terms of IO active transactions
must do to follow MVCC. It has drawback in terms of required
disk space (and, oh yeh, it's not easy to implement -:)).
So, any other opinions about value of O-smgr?

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   Seems overwrite smgr has mainly advantages in terms of
   speed for operations other than rollback.
  
  ... And rollback is required for  5% transactions ...
 
 This obviously depends on application. 

Small number of aborted transactions was used to show
useless of UNDO in terms of space cleanup - that's why
I use same argument to show usefulness of O-smgr -:)

 I know people who rollback most of their transactions
 (actually they use it to emulate temp tables when reporting). 

Shouldn't they use TEMP tables? -:)

 OTOH it is possible to do without rolling back at all as
 MySQL folks have shown us ;)

Not with SDB tables which support transactions.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   OTOH it is possible to do without rolling back at all as
   MySQL folks have shown us ;)
  
  Not with SDB tables which support transactions.
 
 My point was that MySQL was used quite a long time without it 
 and still quite many useful applications were produced.

And my point was that needless to talk about rollbacks in
non-transaction system and in transaction system one has to
implement rollback somehow.

 BTW, do you know what strategy is used by BSDDB/SDB for 
 rollback/undo ?

AFAIR, they use O-smgr = UNDO is required.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

 Do we want to head for an overwriting storage manager?
 
 Not sure.  
 
 Advantages:  UPDATE has easy space reuse because usually done
 in-place, no index change on UPDATE unless key is changed.
 
 Disadvantages:  Old records have to be stored somewhere for MVCC use. 
 Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

 UNDO disadvantages are:
 
   Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

   Oracle has MVCC?
   
   With restrictions, yes.
  
  What restrictions? Rollback segments size?
 
 No, that is not the whole story. The problem with their
 rollback segment approach is, that they do not guard against
 overwriting a tuple version in the rollback segment.
 They simply recycle each segment in a wrap around manner.
 Thus there could be an open transaction that still wanted to
 see a tuple version that was already overwritten, leading to the
 feared snapshot too old error.
 
 Copying their rollback segment approach is imho the last 
 thing we want to do.

So, they limit size of rollback segments and we don't limit
how big our datafiles may grow if there is some long running
transaction in serializable mode. We could allow our rollback
segments to grow without limits as well.

  Non-overwriting smgr can eat all disk space...
  
   You didn't know that?  Vadim did ...
  
  Didn't I mention a few times that I was inspired by Oracle? -:)
 
 Looking at what they supply in the feature area is imho good.
 Copying their technical architecture is not so good in general.

Copying is not inspiration -:)

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

  Removing dead records from rollback segments should
  be faster than from datafiles.
 
 Is it for better locality or are they stored in a different way ?

Locality - all dead data would be localized in one place.

 Do you think that there is some fundamental performance advantage
 in making a copy to rollback segment and then deleting it from
 there vs. reusing space in datafiles ?

As it showed by WAL additional writes don't mean worse performance.
As for deleting from RS (rollback segment) - we could remove or reuse
RS files as whole.

   How does it do MVCC with an overwriting storage manager ?
  
  1. System Change Number (SCN) is used: system increments it
 on each transaction commit.
  2. When scan meets data block with SCN  SCN as it was when
 query/transaction started, old block image is restored
 using rollback segments.
 
 You mean it is restored in session that is running the transaction ?
 
 I guess thet it could be slower than our current way of doing it.

Yes, for older transactions which *really* need in *particular*
old data, but not for newer ones. Look - now transactions have to read
dead data again and again, even if some of them (newer) need not to see
those data at all, and we keep dead data as long as required for other
old transactions *just for the case* they will look there.
But who knows?! Maybe those old transactions will not read from table
with big amount of dead data at all! So - why keep dead data in datafiles
for long time? This obviously affects overall system performance.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

  Impractical ? Oracle does it.
 
 Oracle has MVCC?
 
 With restrictions, yes.

What restrictions? Rollback segments size?
Non-overwriting smgr can eat all disk space...

 You didn't know that?  Vadim did ...

Didn't I mention a few times that I was
inspired by Oracle? -:)

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

 If PostgreSQL wants to stay MVCC, then we should imho forget
 overwriting smgr very fast.
 
 Let me try to list the pros and cons that I can think of:
 Pro:
   no index modification if key stays same
   no search for free space for update (if tuple still
 fits into page)
   no pg_log
 Con:
   additional IO to write before image to rollback segment
   (every before image, not only first after checkpoint)
   (also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove before image (deleted tuple data) from data files.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim
 I think so too. I've never said that an overwriting smgr
 is easy and I don't love it particularily.
 
 What I'm objecting is to avoid UNDO without giving up
 an overwriting smgr. We shouldn't be noncommittal now. 

Why not? We could decide to do overwriting smgr later
and implement UNDO then. For the moment we could just
change checkpointer to use checkpoint.redo instead of
checkpoint.undo when defining what log files should be
deleted - it's a few minutes deal, and so is changing it
back.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

  - A simple typo in psql can currently cause a forced 
  rollback of the entire TX. UNDO should avoid this.
 
 Yes, I forgot to mention this very big advantage, but undo is
 not the only possible way to implement savepoints. Solutions
 using CommandCounter have been discussed.

This would be hell.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

   Oracle has MVCC?
  
   With restrictions, yes.
  
  What restrictions? Rollback segments size?
  Non-overwriting smgr can eat all disk space...
 
 Is'nt the same true for an overwriting smgr ? ;)

Removing dead records from rollback segments should
be faster than from datafiles.

   You didn't know that?  Vadim did ...
  
  Didn't I mention a few times that I was
  inspired by Oracle? -:)
 
 How does it do MVCC with an overwriting storage manager ?

1. System Change Number (SCN) is used: system increments it
   on each transaction commit.
2. When scan meets data block with SCN  SCN as it was when
   query/transaction started, old block image is restored
   using rollback segments.

 Could it possibly be a Postgres-inspired bolted-on hack 
 needed for better concurrency ?

-:)) Oracle has MVCC for years, probably from the beginning
and for sure before Postgres.

 BTW, are you aware how Interbase does its MVCC - is it more 
 like Oracle's way or like PostgreSQL's ?

Like ours.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] ? potential bug in LockBuffer ?

2001-05-22 Thread Mikheev, Vadim

 (buf-r_locks)--;
 if (!buf-r_locks)
 *buflock = ~BL_R_LOCK;
 
 
 Or I am missing something...

buflock is per-backend flag, it's not in shmem. Backend is
allowed only single lock per buffer.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Mikheev, Vadim

  And, I cannot say that I would implement UNDO because of
  1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
  but because of ALL of 1., 2., 4.
 
 OK, I understand your reasoning here, but I want to make a comment.
 
 Looking at the previous features you added, like subqueries, MVCC, or
 WAL, these were major features that greatly enhanced the system's
 capabilities.
 
 Now, looking at UNDO, I just don't see it in the same league as those
 other additions.  Of course, you can work on whatever you want, but I
 was hoping to see another major feature addition for 7.2.  We know we
 badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for another major feature in 7.2 anyway.

 It would be better to put work into one mechanism that would
 reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  We could keep share buffer lock (or add some other kind of lock)
  untill tuple projected - after projection we need not to read data
  for fetched tuple from shared buffer and time between fetching
  tuple and projection is very short, so keeping lock on buffer will
  not impact concurrency significantly.
 
 Or drop the pin on the buffer to show we no longer have a pointer to it.

This is not good for seqscans which will return to that buffer anyway.

  Or we could register callback cleanup function with buffer so bufmgr
  would call it when refcnt drops to 0.
 
 Hmm ... might work.  There's no guarantee that the refcnt
 would drop to zero before the current backend exits, however.
 Perhaps set a flag in the shared buffer header, and the last guy
 to drop his pin is supposed to do the cleanup?

This is what I've meant - set (register) some pointer in buffer header
to cleanup function.

 But then you'd be pushing VACUUM's work into productive transactions,
 which is probably not the way to go.

Not big work - I wouldn't worry about it.

  Two ways: hold index page lock untill heap tuple is checked
  or (rough schema) store info in shmem (just IndexTupleData.t_tid
  and flag) that an index tuple is used by some scan so cleaner could
  change stored TID (get one from prev index tuple) and set flag to
  help scan restore its current position on return.
 
 Another way is to mark the index tuple gone but not forgotten, so to
 speak --- mark it dead without removing it. (We could know that we need
 to do that if we see someone else has a buffer pin on the index page.)

Register cleanup function just like with heap above.

 None of these seem real clean though.  Needs more thought.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  We could keep share buffer lock (or add some other kind of lock)
  untill tuple projected - after projection we need not to read data
  for fetched tuple from shared buffer and time between fetching
  tuple and projection is very short, so keeping lock on buffer will
  not impact concurrency significantly.
 
 Or drop the pin on the buffer to show we no longer have a pointer
 to it. I'm not sure that the time to do projection is short though
 --- what if there are arbitrary user-defined functions in the quals
 or the projection targetlist?

Well, while we are on this subject I finally should say about issue
bothered me for long time: only simple functions should be allowed
to deal with data in shared buffers directly. Simple means: no SQL
queries there. Why? One reason: we hold shlock on buffer while doing
seqscan qual - what if qual' SQL queries will try to acquire exclock
on the same buffer? Another reason - concurrency. I think that such
heavy functions should be provided with copy of data.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  It probably will not cause more IO than vacuum does right now.
  But unfortunately it will not reduce that IO.
 
 Uh ... what?  Certainly it will reduce the total cost of vacuum,
 because it won't bother to try to move tuples to fill holes.

Oh, you're right here, but daemon will most likely read data files
again and again with in-memory FSM. Also, if we'll do partial table
scans then we'll probably re-read indices  1 time.

 The index cleanup method I've proposed should be substantially
 more efficient than the existing code, as well.

Not in IO area.

  My point is that we'll need in dynamic cleanup anyway and UNDO is
  what should be implemented for dynamic cleanup of aborted changes.
 
 UNDO might offer some other benefits, but I doubt that it will allow
 us to eliminate VACUUM completely.  To do that, you would need to

I never told this -:)

 keep track of free space using exact, persistent (on-disk) bookkeeping
 data structures.  The overhead of that will be very substantial: more,
 I predict, than the approximate approach I proposed.

I doubt that big guys use in-memory FSM. If they were able to do this...

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 I hope we can avoid on-disk FSM.  Seems to me that that would create
 problems both for performance (lots of extra disk I/O) and reliability
 (what happens if FSM is corrupted?  A restart won't fix it).

We can use WAL for FSM.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  Really?! Once again: WAL records give you *physical*
  address of tuples (both heap and index ones!) to be
  removed and size of log to read records from is not
  comparable with size of data files.
 
 So how about a background vacuum like process, that reads
 the WAL and does the cleanup ? Seems that would be great,
 since it then does not need to scan, and does not make
 forground cleanup necessary.
 
 Problem is when cleanup can not keep up with cleaning WAL
 files, that already want to be removed. I would envision a
 config, that sais how many Mb of WAL are allowed to queue
 up before clients are blocked.

Yes, some daemon could read logs and gather cleanup info.
We could activate it when switching to new log file segment
and synchronization with checkpointer is not big deal. That
daemon would also archive log files for WAL-based BAR,
if archiving is ON.

But this will be useful only with on-disk FSM.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  My point is that we'll need in dynamic cleanup anyway and UNDO is
  what should be implemented for dynamic cleanup of aborted changes.
 
 I do not yet understand why you want to handle aborts different than
 outdated tuples.

Maybe because of aborted tuples have shorter Time-To-Live.
And probability to find pages for them in buffer pool is higher.

 The ratio in a well tuned system should well favor outdated tuples.
 If someone ever adds dirty read it is also not the case that it
 is guaranteed, that nobody accesses the tuple you currently want
 to undo. So I really miss to see the big difference.

It will not be guaranteed anyway as soon as we start removing
tuples without exclusive access to relation.

And, I cannot say that I would implement UNDO because of
1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
but because of ALL of 1., 2., 4.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 From: Mikheev, Vadim 
 Sent: Monday, May 21, 2001 10:23 AM
 To: 'Jan Wieck'; Tom Lane
 Cc: The Hermit Hacker; 'Bruce Momjian';
 [EMAIL PROTECTED]

Strange address, Jan?

 Subject: RE: [HACKERS] Plans for solving the VACUUM problem
 
 
  I think the in-shared-mem FSM could have  some  max-per-table
  limit  and  the background VACUUM just skips the entire table
  as long as nobody  reused  any  space.  Also  it  might  only
  compact pages that lead to 25 or more percent of freespace in
  the first place. That makes it more likely  that  if  someone
  looks  for  a place to store a tuple that it'll fit into that
  block (remember that the toaster tries to  keep  main  tuples
  below BLKSZ/4).
 
 This should be configurable parameter like PCFREE (or something
 like that) in Oracle: consider page for insertion only if it's
 PCFREE % empty.
 
 Vadim
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 Correct me if I am wrong, but both cases do present a problem
 currently in 7.1.  The WAL log will not remove any WAL files
 for transactions that are still open (even after a checkpoint
 occurs). Thus if you do a bulk insert of gigabyte size you will
 require a gigabyte sized WAL directory. Also if you have a simple
 OLTP transaction that the user started and walked away from for
 his one week vacation, then no WAL log files can be deleted until
 that user returns from his vacation and ends his transaction.

Todo:

1. Compact log files after checkpoint (save records of uncommitted
   transactions and remove/archive others).
2. Abort long running transactions.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Mikheev, Vadim

 I have been thinking about the problem of VACUUM and how we 
 might fix it for 7.2.  Vadim has suggested that we should
 attack this by implementing an overwriting storage manager
 and transaction UNDO, but I'm not totally comfortable with
 that approach: it seems to me that it's an awfully large
 change in the way Postgres works. 

I'm not sure if we should implement overwriting smgr at all.
I was/is going to solve space reusing problem with non-overwriting
one, though I'm sure that we have to reimplement it ( 1 table
per data file, stored on disk FSM etc).

 Second: if VACUUM can run in the background, then there's no
 reason not to run it fairly frequently. In fact, it could become
 an automatically scheduled activity like CHECKPOINT is now,
 or perhaps even a continuously running daemon (which was the
 original conception of it at Berkeley, BTW).

And original authors concluded that daemon was very slow in
reclaiming dead space, BTW.

 3. Lazy VACUUM processes a table in five stages:
A. Scan relation looking for dead tuples;...
B. Remove index entries for the dead tuples...
C. Physically delete dead tuples and compact free space...
D. Truncate any completely-empty pages at relation's end.  
E. Create/update FSM entry for the table.
...
 If a tuple is dead, we care not whether its index entries are still
 around or not; so there's no risk to logical consistency.

What does this sentence mean? We canNOT remove dead heap tuple untill
we know that there are no index tuples referencing it and your A,B,C
reflect this, so ..?

 Another place where lazy VACUUM may be unable to do its job completely
 is in compaction of space on individual disk pages.  It can physically
 move tuples to perform compaction only if there are not currently any
 other backends with pointers into that page (which can be tested by
 looking to see if the buffer reference count is one).  Again, we punt
 and leave the space to be compacted next time if we can't do it right
 away.

We could keep share buffer lock (or add some other kind of lock)
untill tuple projected - after projection we need not to read data
for fetched tuple from shared buffer and time between fetching
tuple and projection is very short, so keeping lock on buffer will
not impact concurrency significantly.

Or we could register callback cleanup function with buffer so bufmgr
would call it when refcnt drops to 0.

 Presently, VACUUM deletes index tuples by doing a standard index
 scan and checking each returned index tuple to see if it points
 at any of the tuples to be deleted. If so, the index AM is called
 back to delete the tested index tuple. This is horribly inefficient:
...
 This is mainly a problem of a poorly chosen API. The index AMs
 should offer a bulk delete call, which is passed a sorted array
 of main-table TIDs. The loop over the index tuples should happen
 internally to the index AM.

I agreed with others who think that the main problem of index cleanup
is reading all index data pages to remove some index tuples. You told
youself about partial heap scanning - so for each scanned part of table
you'll have to read all index pages again and again - very good way to
trash buffer pool with big indices.

Well, probably it's ok for first implementation and you'll win some CPU
with bulk delete - I'm not sure how much, though, and there is more
significant issue with index cleanup if table is not locked exclusively:
concurrent index scan returns tuple (and unlock index page), heap_fetch
reads table row and find that it's dead, now index scan *must* find
current index tuple to continue, but background vacuum could already
remove that index tuple = elog(FATAL, _bt_restscan: my bits moved...);

Two ways: hold index page lock untill heap tuple is checked or (rough
schema)
store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
is used by some scan so cleaner could change stored TID (get one from prev
index tuple) and set flag to help scan restore its current position on
return.

I'm particularly interested in discussing this issue because of it must be
resolved for UNDO and chosen way will affect in what volume we'll be able
to implement dirty reads (first way doesn't allow to implement them in full
- ie selects with joins, - but good enough to resolve RI constraints
concurrency issue).

 There you have it.  If people like this, I'm prepared to commit to
 making it happen for 7.2.  Comments, objections, better ideas?

Well, my current TODO looks as (ORDER BY PRIORITY DESC):

1. UNDO;
2. New SMGR;
3. Space reusing.

and I cannot commit at this point anything about 3. So, why not to refine
vacuum if you want it. I, personally, was never be able to convince myself
to spend time for this.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Re: [SQL] possible row locking bug in 7.0.3 7.1

2001-05-18 Thread Mikheev, Vadim

 I see postgres 7.1.1 is out now.  Was the fix for this
 problem included in the new release?

I fear it will be in 7.2 only.

 On Thursday 29 March 2001 20:02, Philip Warner wrote:
  At 19:14 29/03/01 -0800, Mikheev, Vadim wrote:
   Reported problem is caused by bug (only one tuple 
 version must be
   returned by SELECT) and this is way to fix it.
  
   I assume this is not possible in 7.1?
  
  Just looked in heapam.c - I can fix it in two hours.
  The question is - should we do this now?
  Comments?
 
  It's a bug; how confident are you of the fix?
 
 ---(end of 
 broadcast)---
 TIP 4: Don't 'kill -9' the postmaster
 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Mikheev, Vadim

 Vadim, can you remind me what UNDO is used for?

Ok, last reminder -:))

On transaction abort, read WAL records and undo (rollback)
changes made in storage. Would allow:

1. Reclaim space allocated by aborted transactions.
2. Implement SAVEPOINTs.
   Just to remind -:) - in the event of error discovered by server
   - duplicate key, deadlock, command mistyping, etc, - transaction
   will be rolled back to the nearest implicit savepoint setted
   just before query execution; - or transaction can be aborted by
   ROLLBACK TO savepoint_name command to some explicit savepoint
   setted by user. Transaction rolled back to savepoint may be continued.
3. Reuse transaction IDs on postmaster restart.
4. Split pg_log into small files with ability to remove old ones (which
   do not hold statuses for any running transactions).

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] 7.1 startup recovery failure

2001-04-27 Thread Mikheev, Vadim

  There's a report of startup recovery failure in Japan.
 
  DEBUG:  redo done at (1, 3923880100)
  FATAL 2:  XLogFlush: request is not satisfied
  postmaster: Startup proc 4228 exited with status 512 - abort
 
 Is this person using 7.1 release, or a beta/RC version?  That looks
 just like the last WAL bug Vadim fixed before final ...

No, it doesn't. That bug was related to cases when there is no room
on last log page for startup checkpoint. ~5k is free in this case.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Re: 7.1 vacuum

2001-04-27 Thread Mikheev, Vadim

 What's the deal with vacuum lazy in 7.1? I was looking
 forward to it. It was never clear whether or not you guys
 decided to put it in.
 
 If it is in as a feature, how does one use it?
 If it is a patch, how does one get it?
 If it is neither a patch nor an existing feature, has
 development stopped?

I still had no time to port it to 7.1 -:(
I'll post message to -hackers when it will be ready.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] WAL feature

2001-04-27 Thread Mikheev, Vadim

 WAL was a difficult feature to add to 7.1.  Currently, it is only used
 as a performance benefit, but I expect it will be used in the future to

Not only. Did you forget about btree stability?
Partial disk writes?

 add new features like:
 
   Advanced Replication

I'm for sure not fan of this.

   Point-in-time recovery
   Row reuse without vacuum

Yes, it will help to remove uncommitted rows.

And don't forget about SAVEPOINTs.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] WAL feature

2001-04-27 Thread Mikheev, Vadim

 Yep, WAL collects all database changes into one file.  Easy to see how
 some other host trying to replication a different host would find the
 WAL contents valuable.

Unfortunately, slave database(s) should be on the same platform
(hardware+OS) to be able to use information about *physical*
changes in data files. Also, this would be still *async* replication.
Maybe faster than rserv, maybe with less space requirements (no rserv'
log table), but maybe not.

I believe that making efforts to implement (bi-directional) *sync*
replication would be more valuable.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] v7.1.1 branched and released on Tuesday ...

2001-04-27 Thread Mikheev, Vadim

 As Tom's mentioned the other day, we're looking at doing up v7.1.1 on
 Tuesday, and starting in on v7.2 ...
 
 Does anyone have any outstanding fixes for v7.1.x that they
 want to see in *before* we do this release? Any points unresolved
 that anyone knows about that we need to look at?

Hiroshi reported about startup problem yesterday - we should fix this
for 7.1.1...

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: AW: [HACKERS] timeout on lock feature

2001-04-18 Thread Mikheev, Vadim

 This is the real reason why I've been holding out for restricting the
 feature to a specific LOCK TABLE statement: if it's designed that way,
 at least you know which lock you are applying the timeout to, and have
 some chance of being able to estimate an appropriate timeout.

As I pointed before - it's half useless.

And I totally do not understand why to object feature

1. that affects users *only when explicitly requested*;
2. whose implementation costs nothing - ie has no drawbacks
   for overall system.

It was general practice in project so far: if user want some
feature and it doesn't affect others - let's do it.
What's changed?

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: AW: [HACKERS] timeout on lock feature

2001-04-18 Thread Mikheev, Vadim

 One idea Tom had was to make it only active in a transaction, 
 so you do:
 
   BEGIN WORK;
   SET TIMEOUT TO 10;
   UPDATE tab SET col = 3;
   COMMIT
 
 Tom is concerned people will do the SET and forget to RESET 
 it, causing all queries to be affected by the timeout.

And what? Queries would be just aborted. It's not critical event
to take responsibility from user.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: AW: [HACKERS] timeout on lock feature

2001-04-17 Thread Mikheev, Vadim

  Added to TODO:
  * Add SET parameter to timeout if waiting for lock too long
 
 I repeat my strong objection to any global (ie, affecting all locks)
 timeout.  Such a "feature" will have unpleasant consequences.

But LOCK TABLE T IN ROW EXCLUSIVE MODE WITH TIMEOUT X will not give
required results not only due to parser/planner locks - what if
UPDATE T will have to wait for other transactions commit/abort
(same row update)? Lock on pseudo-table is acquired in this case...

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: AW: AW: [HACKERS] timeout on lock feature

2001-04-17 Thread Mikheev, Vadim

  The timeout will be useful to let the client or user decide
  on an alternate course of action other that killing his
  application (without the need for timers or threads in the
  client program).
 
 This assumes (without evidence) that the client has a good
 idea of what the timeout limit ought to be. I think this "feature"
 has no real use other than encouraging application programmers to
 shoot themselves in the foot. I see no reason that we should make
 it easy to misdesign applications.

AFAIR, Big Boys have this feature. If its implementation is safe,
ie will not affect applications not using it, why do not implement it?

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Re: TODO list

2001-04-05 Thread Mikheev, Vadim

  So, for what CRC could be used? To catch disk damages?
  Disk has its own CRC for this.
 
 OK, this was already discussed, maybe while Vadim was absent.  
 Should I re-post the previous text?

Let's return to this discussion *after* 7.1 release.
My main objection was (and is) - no time to deal with
this issue for 7.1

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Re: TODO list

2001-04-05 Thread Mikheev, Vadim

  Blocks that have recently been written, but failed to make
  it down to the disk platter intact, should be restorable from
  the WAL log.  So we do not need a block-level CRC to guard
  against partial writes.
 
 If a block is missing some sectors in the middle, how would you know
 to reconstruct it from the WAL, without a block CRC telling you that
 the block is corrupt?

On recovery we unconditionally copy *entire* block content from the log
for each block modified since last checkpoint. And we do not write new
checkpoint record (ie do not advance recovery start point) untill we know
that all data blocks are flushed on disk (including blocks modified before
checkpointer started).

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



  1   2   3   >