Resolved.  Thanks to all that responded. 

The symptoms were due to an external method which runs a program (htmldoc)
that converts html to pdf.  The external method uses the "old" subprocess
interface to fire off the process and then wait() for it to complete.  To
fix another problem, we had upgraded to the latest version, but that
htmldoc version apparently is confused by some of our html input and never
completes so the external method just sits there waiting for the child
process to terminate.  Multiple invokations of the blocked htmldoc process
eventually brought Zope to its knees and no further progress was made.

Confession is good for the soul

The problem took a long time to find because failures were infrequent (we
only saw three or for instances) so there was limited data.  Moreover,the
external method the was failing had been in service for almost two years
and has not been a problem.  And, other very similar external methods
which also use htmldoc were known to work with the upgraded htmldoc.  It
was not until the external method was identified as the only commonality
in the failures that I thought to look closely at the external method.  
And then I noticed the blocked htmldoc processes, etc.  etc.

The fix

Our solution was to revert htmldoc to the earlier version.  

When time permits we'll add a timeout to the external method and try to
extablish what in the current release htmldoc triggered the failure.  We
probably should move to the new subprocess module rather than use popen2
in the external method.

On Sun, 27 Nov 2005, Dennis Allison wrote:

> Python2.4.2
> Zope 2.8.4
> ZODB/Zeo 2.3.4
> Mysql 4.0
> Dual Athalon processors
> Linux RH7.3
> We have two recent instances in our production sites where Zope suddenly
> stops responding.  It is not a new problem, but we've now been confronted
> with two clean examples and nothing to blame them on.  The problem appears
> to be independent of load as both incidents were on lightly loaded
> machines.
> A check of the logs (Linux and Zope) shows nothing obviously amiss except
> that the trace log (the old -M log) shows a sudden increase in active
> requests from the typical 0 or 1 to 1300 or more.  In this context an
> "active request' is total number of requests pending at the end of this
> request and is computed by post-processing.  We front-end Zope with pound 
> and make heavy use of MySQL.  Both show a plethora of incomplete 
> transactions.  
> Examination of the raw trace log shows that Zope is continuing to accept
> requests, but nothing getting done.  The raw log date-stamps four internal
> states for each transaction.  The states are Begin (B), Input (I),
> action (A), and End (E).  Inputs are gathered between B and I, outputs is
> made between A and E.  The raw log shows B and I transactions, but
> apparently no processing is completing.  I suspect that nothing is getting
> scheduled.
> I am at a loss as to where to begin to track this one down.  The failure
> is spontaneous and apparently not triggered by any readily distinguishable
> inputs or pattern of inputs.  The behavior smells a bit of resource limits
> or process synchronization problems, but there is not real evidence for
> either being the root cause.   I am not sure what monitoring I should be 
> doing to help locate the source of the problem.
> Has anyone seen seen a similar problem?  Any advice as to how to proceed?


Zope maillist  -
**   No cross posts or HTML encoding!  **
(Related lists - )

Reply via email to