Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-03 Thread Dieter Maurer
Jim Fulton wrote at 2008-10-1 13:40 -0400:
 ...
 It may well be that a restart *may* not lead into a fully functional
 state (though this would indicate a storage bug)

A failure in tpc_finish already indicates a storage bug.

Maybe -- although file system is full might not be so easy to avoid
in all cases



-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-03 Thread Dieter Maurer
Christian Theune wrote at 2008-10-3 10:32 +0200:
On Fri, 2008-10-03 at 09:55 +0200, Dieter Maurer wrote:
 Jim Fulton wrote at 2008-10-1 13:40 -0400:
  ...
  It may well be that a restart *may* not lead into a fully functional
  state (though this would indicate a storage bug)
 
 A failure in tpc_finish already indicates a storage bug.
 
 Maybe -- although file system is full might not be so easy to avoid
 in all cases

That should be easy to avoid by allocating the space you need in the
first phase and either release it on an abort or write your 'committed'
marker into it in the second phase.

That's true for a FileStorage -- but it may not be that easy for
other storages (e.g. BSDDB storage).



-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-03 Thread Christian Theune
On Fri, 2008-10-03 at 10:51 +0200, Dieter Maurer wrote:
 That's true for a FileStorage -- but it may not be that easy for
 other storages (e.g. BSDDB storage).

Those storages using another system in the backend have to rely on them
providing a two-phase commit API which needs to implement the second
phase guarantee, or not?

-- 
Christian Theune · [EMAIL PROTECTED]
gocept gmbh  co. kg · forsterstraße 29 · 06112 halle (saale) · germany
http://gocept.com · tel +49 345 1229889 7 · fax +49 345 1229889 1
Zope and Plone consulting and development


signature.asc
Description: This is a digitally signed message part
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-03 Thread Jim Fulton

On Oct 3, 2008, at 3:55 AM, Dieter Maurer wrote:

 Jim Fulton wrote at 2008-10-1 13:40 -0400:
 ...
 It may well be that a restart *may* not lead into a fully functional
 state (though this would indicate a storage bug)

 A failure in tpc_finish already indicates a storage bug.

 Maybe -- although file system is full might not be so easy to avoid
 in all cases


This is why FileStorage writes data to the end of the file before  
returning from tpc_vote. tpc_finish only updates a single byte of the  
file.  This is why FileStorage no-longer tries to write it's index  
during tpc_finish.

Jim

--
Jim Fulton
Zope Corporation


___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-01 Thread Dieter Maurer
Jim Fulton wrote at 2008-9-30 18:30 -0400:
 ...
  c. Close the file storage, causing subsequent reads and writes to
 fail.

 Raise an easily recognizable exception.

I raise the original exception.

Sad.

The original exception may have many consequences -- most probably
harmless. The special exception would express that the consequence was
very harmfull.

 In our error handling we look out for some nasty exceptions and  
 enforce
 a restart in such cases. The exception above might be such a nasty
 exception.

The critical log entry should be easy enough to spot.

For humans, but I had in mind that software recognizes the exception
automatically and forces a restart.

Or do you have a logger customization in mind that intercepts the
log entry and then forces a restart?

In may not be trivial to get this right (in a way such that
the log entry does appear in the logfile before the restart starts).

...
 - Have a storage server restart when a tpc_finish call fails.  This
 would work fine for FileStorage, but might be the wrong thing to do
 for another storage.  The server can't know.

 Why do you think that a failing tpc_finish is less critical
 for some other kind of storage?


It's not a question of criticality.  It's a question of whether a  
restart will fix the problem.  I happen to know that a file storage  
would be in a reasonable state after a restart.  I don't know this to  
be the case for some other storage.

But what should an administrator do when this is not the case?
Either a stop or a restart

It may well be that a restart *may* not lead into a fully functional
state (though this would indicate a storage bug) but a definitely not
working system is not much better than one that may potentially not
be fully functional but usually will be apart from storage bugs.



-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-10-01 Thread Jim Fulton

On Oct 1, 2008, at 1:21 PM, Dieter Maurer wrote:

 Jim Fulton wrote at 2008-9-30 18:30 -0400:
 ...
 c. Close the file storage, causing subsequent reads and writes to
 fail.

 Raise an easily recognizable exception.

 I raise the original exception.

 Sad.

 The original exception may have many consequences -- most probably
 harmless. The special exception would express that the consequence was
 very harmfull.

The fact that it occurs in this place at all indicates this.


 In our error handling we look out for some nasty exceptions and
 enforce
 a restart in such cases. The exception above might be such a nasty
 exception.

 The critical log entry should be easy enough to spot.

 For humans, but I had in mind that software recognizes the exception
 automatically and forces a restart.

I suppose we could define such an exception.  A storage that raises it  
is indicating that it will come back in some sort of consistent state  
after a restart.


 Or do you have a logger customization in mind that intercepts the
 log entry and then forces a restart?

No
...

 - Have a storage server restart when a tpc_finish call fails.  This
 would work fine for FileStorage, but might be the wrong thing to do
 for another storage.  The server can't know.

 Why do you think that a failing tpc_finish is less critical
 for some other kind of storage?


 It's not a question of criticality.  It's a question of whether a
 restart will fix the problem.  I happen to know that a file storage
 would be in a reasonable state after a restart.  I don't know this to
 be the case for some other storage.

 But what should an administrator do when this is not the case?
 Either a stop or a restart

Yes

 It may well be that a restart *may* not lead into a fully functional
 state (though this would indicate a storage bug)

A failure in tpc_finish already indicates a storage bug.

 but a definitely not
 working system is not much better than one that may potentially not
 be fully functional but usually will be apart from storage bugs.


If the alternative to a non-working system is a system with  
inconsistent data, I'll take the former.

I can see some benefit from raising a special error to indicate that a  
restart would be beneficial.  If I hadn't already done the proposed  
work, I might even pursue this idea. :)  At this point, I think I've  
reduced the probability of a failure in FileStorage._finish enough  
that further effort, at least by me, isn't warranted.

Jim

--
Jim Fulton
Zope Corporation


___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

2008-09-30 Thread Jim Fulton

On Sep 30, 2008, at 1:38 PM, Dieter Maurer wrote:

 Jim Fulton wrote at 2008-9-19 13:45 -0400:
 ...
 2. We (ZC) are moving to 64-bit OSs.  I've resisted this for a while
 due to the extra memory overhead of 64-bit pointers in Python
 programs, but I've finally (too late) come around to realizing that
 the benefit far outweighs the cost.  (In this case, the process was
 around 900MB in size.

 That is very strange.
 On our Linux systems (Debian etch), the processes can use 2.7 to 2.9  
 GB
 of memory before the os refuses to allocate more.

Yeah. Strange.

 It was probably trying to malloc a few hundred
 MB.  The malloc failed despite the fact that there was more than 2GB
 of available process address space and system memory.)

 3. I plan to add code to FileStorage's _finish that will, if there's
 an error:

  a. Log a critical message.

  b. Try to roll back the disk commit.

I decided not to do this. Too complicated.



  c. Close the file storage, causing subsequent reads and writes to
 fail.

 Raise an easily recognizable exception.

I raise the original exception.

 In our error handling we look out for some nasty exceptions and  
 enforce
 a restart in such cases. The exception above might be such a nasty
 exception.

The critical log entry should be easy enough to spot.

...

 I considered some other ideas:

 - Try to get FileStorage to repair it's meta data.  This is certainly
 theoretically doable.  For example, it could re-build it's in-memory
 index. At this point, that's the only thing in question. OTOH,
 updating it is the only thing left to fail at this point.  If  
 updating
 it fails, it seems likely that rebuilding it will fail as well.

 - Have a storage server restart when a tpc_finish call fails.  This
 would work fine for FileStorage, but might be the wrong thing to do
 for another storage.  The server can't know.

 Why do you think that a failing tpc_finish is less critical
 for some other kind of storage?


It's not a question of criticality.  It's a question of whether a  
restart will fix the problem.  I happen to know that a file storage  
would be in a reasonable state after a restart.  I don't know this to  
be the case for some other storage.

Jim

--
Jim Fulton
Zope Corporation


___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev