Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Jens Vagelpohl

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On 10 Apr 2007, at 20:19, Alan Runyan wrote:


 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
343, in rec

v
   data = self.socket.recv(buffer_size)
error: (113, 'No route to host')



 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
343, in rec

v
   data = self.socket.recv(buffer_size)
error: (110, 'Connection timed out')


With errors like this I would concentrate on the network first  
(hardware as well as network settings/routes etc) and make 100% sure  
you don't have anything strange happening on that end.


jens


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFGHIWaRAx5nvEhZLIRAmnLAJwP2Jc28uDJy+i0DIfRPvZU8aW6rwCfa8Kz
GLobtRNCkuwTG2IskkNnOig=
=wOYW
-END PGP SIGNATURE-
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Russ Ferriday


On 10 Apr 2007, at 21:39, Chris Withers wrote:


I'd look at the switches and maybe even the nics and cables :-S
And if you are doing that, also check your network to make sure all  
IP addresses are unique.



(..but I'm not a sysadmin.)

And neither am I!

--r

Russ Ferriday - Topia Systems - Open Source content management with  
Plone and Zope
[EMAIL PROTECTED] - office: +44 2076 1777588 - mobile: +44 7789 338868  
- skype: ferriday

a member of
Zea Partners

___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Chris Withers

Alan Runyan wrote:

>data = self.socket.recv(buffer_size)
> error: (113, 'No route to host')

That *is* very odd, anything other than pound being used for load
balancing or traffic shaping?


This has to be a major problem maker in the system.  Pound is simply
round robin connections to pool of Zope.


I'd look at the switches and maybe even the nics and cables :-S
(..but I'm not a sysadmin.)

>  File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 343, 
in rec

> v
>data = self.socket.recv(buffer_size)
> error: (110, 'Connection timed out')

Storage server too busy :-(


the storage server can be "too busy" when its only reading?


'corse - too many pickles getting sucked down...

Chris

--
Simplistix - Content Management, Zope & Python Consulting
   - http://www.simplistix.co.uk
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZEO client cache tempfile oddness

2007-04-10 Thread Tim Peters

[Paul Winkler]

...
If I understand this stuff correctly, the code in question on a
filesystem that *doesn't* have the sparse file optimization would
equate to "write N null bytes to this file as fast as possible."
True?


[Dieter Maurer]

Posix defines the semantics.

I have not looked it up, but a possible interpretation would also be:
write the "n.th" byte and let all other bytes undefined.


POSIX specifies that uninitialized file positions must "act as if"
they contained NUL bytes.  NTFS isn't POSIX, but NTFS in fact does
"write N NUL bytes to the file as fast as possible" for a non-sparse
file (and for a sparse file, NTFS acts like a POSIX file in this
respect).
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZEO client cache tempfile oddness

2007-04-10 Thread Tim Peters

[Paul Winkler]

Our experiments suggest that ext2, ext3,
and reiserfs optimize for sparse files so there is no such guarantee.
AFAICT from some quick googling and wikipediaing, the same is true
for NTFS, XFS, JFS, ZFS. I suspect we've accounted for the majority
of the production Zope installations in the world.


[Benji York]

In that case it would seem better to just remove the ineffectual
code altogether.


[Jim Fulton]

+1


-0.  See my other response.  +1 on beefing up the comment, though.  On
all systems the code does no harm and does /inform/ the OS of how
large a file is needed.  At least on NTFS it also does all that
reasonably can be done to ensure enough space actually exists (while
NTFS supports sparse files, it's not the default, and the ZEO code
does not create a sparse file under NTFS).

The Windows behavior could be effectively gotten on most other
platforms by adding a loop to write a non-NUL byte to every (say)
512th byte position.  This would be redundant on Windows (NTFS
physically writes NUL bytes to every "missing" position in a dense
file whenever the EOF pointer advances), but would do no harm there
either.

Or it could be that people are happy to have ZEO blow up later ;-)
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZEO client cache tempfile oddness

2007-04-10 Thread Tim Peters

[Tim Peters]

The ZEO client cache is stored in a fixed-size disk file.  When a ZEO
client needs to create this file for the first time, it's trying to
ensure there's enough space on disk for it at the start, and reserve
that disk space then, rather than risk dying with a "no space left on
device" error umpteen hours later.  Alas, there is no portable way to
do so (the C standard says nothing about physical devices such as
disks).


[Paul Winkler]

So it might work as intended on some systems but inspire false
confidence on others. That's sad, but I guess we can't do much about
it.


The full intent is almost met on Windows.  On other systems it at
least /informs/ the OS about the intended size of the file.  In any
case it does no harm.

[Paul]

...
"saves" means. Does the behavior vary on different filesystems?



Yes.

Sounds like it optimizes for sparse files.  Not all filesystems do.



OK... I'm still wondering on which filesystems this code actually does
guarantee sufficient space. Our experiments suggest that ext2, ext3,
and reiserfs optimize for sparse files so there is no such guarantee.
AFAICT from some quick googling and wikipediaing, the same is true for
NTFS, XFS, JFS, ZFS. I suspect we've accounted for the majority of the
production Zope installations in the world.  The only fs I found that
has no sparse file support is HFS+.  Not sure about UFS (I found
people claiming both no and yes).


NTFS has "optional" sparse-file support, meaning there is a way to
create sparse files under NTFS, but it's not the default.  It isn't
exposed at the C stdio level (creating a sparse file under NTFS
requires additional Windows-specific API calls), and Python builds on
C stdio.

In any case there's no benefit to sparse files in this context:  while
the ZEO cache file starts out "almost empty", every byte is eventually
used, so it seems good to make whatever cheap efforts can be made to
minimize the chance that ZEO will die after umpteen hours if there's
not enough space for the client cache file it needs at the start.


If you care, the intent could probably be better served by changing
the code to write a junk byte at every (say) thousandth byte offset
throughout the file (at least one non-NUL byte per disk block).  Alas,
there would still be no guarantee that there's actually enough space
on the physical disk to store it.



OK, then I don't see how we could get a real guarantee without
actually writing junk to every byte, which might be just a little
slower :)


Still no guarantee (e.g., there might be enough RAM to hold all the
bytes in I/O buffers, but not enough disk space remaining to
materialize them), but writing a non-NUL byte per disk block would be
as effective in practice as writing to every byte position.


Short of that, since crystal ball technology is still not widely
deployed outside the Python Secret Underground, we can't know if
currently free space will still be available when we need it.


That's right.  On all platforms the current dance suffices to inform
the I/O system of how large a file ZEO needs (note that just seeking
to the max size does not suffice -- a byte needs to be written at the
end to set the EOF pointer).  On NTFS that's almost reliable (see
above); on other systems it may or may not be.
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Alan Runyan

I'ev not had anything but bad experiences with pound myself, lvs seems a
much more preferable alternative...


We have not had such negative experiences with pound.


>data = self.socket.recv(buffer_size)
> error: (113, 'No route to host')

That *is* very odd, anything other than pound being used for load
balancing or traffic shaping?


This has to be a major problem maker in the system.  Pound is simply
round robin connections to pool of Zope.


>  File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 343, in rec
> v
>data = self.socket.recv(buffer_size)
> error: (110, 'Connection timed out')

Storage server too busy :-(


the storage server can be "too busy" when its only reading?

cheers
alan
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Chris Withers

Alan Runyan wrote:

We have 10 ZEO clients that are for public consumption "READ ONLY".
We have a separate ZEO client that is writing that is on a separate box.


I'd put money on the client doing the writing causing problems.
That or client side cache thrash caused by zcatalog or similar ;-)


The ZEO Server was consistently at 2% CPU.


What was the network and disk i/o like?


getting through the cache and back to pound which was then load


I'ev not had anything but bad experiences with pound myself, lvs seems a 
much more preferable alternative...



The customer was posting content throughout the slashdot.  The problem
was that when the clients would update they would end up through 500s


not sure what that last bit means...


   data = self.socket.recv(buffer_size)
error: (113, 'No route to host')


That *is* very odd, anything other than pound being used for load 
balancing or traffic shaping?



 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 343, in rec
v
   data = self.socket.recv(buffer_size)
error: (110, 'Connection timed out')


Storage server too busy :-(

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
   - http://www.simplistix.co.uk
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Jim Fulton


On Apr 10, 2007, at 2:19 PM, Alan Runyan wrote:
...

For Jim: We did not adjust the transaction timeout.  Would that have
helped in the case of READ's?


Possibly, I'm not sure and I don't have time now to dig.  It might be  
worth trying, however:



The customer was posting content throughout the slashdot.  The problem
was that when the clients would update they would end up through 500s

I am seeing something *very* strange in zeo.log:
2007-04-10T12:20:53 ERROR ZEO.zrpc.Connection(S)  
(172.16.235.120:49351) Erro

r caught in asyncore
Traceback (most recent call last):
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 69,  
in read

   obj.handle_read_event()
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
391, in han

dle_read_event
   self.handle_read()
 File "/usr/local/zope/lib/python/ZEO/zrpc/smac.py", line 147, in  
handle_re

ad
   d = self.recv(8192)
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
343, in rec

v
   data = self.socket.recv(buffer_size)
error: (113, 'No route to host')


but several more of these:
2007-04-10T13:55:36 ERROR ZEO.zrpc.Connection(S)  
(172.16.235.119:44322) Erro

r caught in asyncore
Traceback (most recent call last):
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 69,  
in read

   obj.handle_read_event()
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
391, in han

dle_read_event
   self.handle_read()
 File "/usr/local/zope/lib/python/ZEO/zrpc/smac.py", line 147, in  
handle_re

ad
   d = self.recv(8192)
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line  
343, in rec

v
   data = self.socket.recv(buffer_size)
error: (110, 'Connection timed out')
--


Obviously, both of these are very suspicious.

Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Benji York

Alan Runyan wrote:

The website got slashdoted this morning [...]



Just FYI: Varnish didnt go over 3% CPU during the traffic surge; over
200 req/second.


Off topic: 200 requests a second seems a bit light for a slashdotting, 
any more details you can divulge there?

--
Benji York
Senior Software Engineer
Zope Corporation
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZEO client cache tempfile oddness

2007-04-10 Thread Dieter Maurer
Paul Winkler wrote at 2007-4-6 13:30 -0400:
> ...
>If I understand this stuff correctly, the code in question on a
>filesystem that *doesn't* have the sparse file optimization would
>equate to "write N null bytes to this file as fast as possible."
>True?

Posix defines the semantics.

I have not looked it up, but a possible interpretation would also be:
write the "n.th" byte and let all other bytes undefined.

-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Alan Runyan

How does data get into the ZEO storage then?


We have 10 ZEO clients that are for public consumption "READ ONLY".
We have a separate ZEO client that is writing that is on a separate box.

The website got slashdoted this morning and we had 4 zeo clients go
out.  Basically waiting for the zeo server for many minutes (basically
hung and reporting 500s back to browsers).

4/10/07 - delta ah2 lock up - http://paste.plone.org/13919
4/10/07 - epsilon ah2 lock up - http://paste.plone.org/13920
4/10/07 - delta ah1 lock up - http://paste.plone.org/13936
4/10/07 - epsilon ah3 lock up - http://paste.plone.org/13935

Just FYI: Varnish didnt go over 3% CPU during the traffic surge; over
200 req/second.

The ZEO Server was consistently at 2% CPU.  Lots of traffic was
getting through the cache and back to pound which was then load
balancing to the ZEO clients.

For Jim: We did not adjust the transaction timeout.  Would that have
helped in the case of READ's?

The customer was posting content throughout the slashdot.  The problem
was that when the clients would update they would end up through 500s

I am seeing something *very* strange in zeo.log:
2007-04-10T12:20:53 ERROR ZEO.zrpc.Connection(S) (172.16.235.120:49351) Erro
r caught in asyncore
Traceback (most recent call last):
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 69, in read
   obj.handle_read_event()
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 391, in han
dle_read_event
   self.handle_read()
 File "/usr/local/zope/lib/python/ZEO/zrpc/smac.py", line 147, in handle_re
ad
   d = self.recv(8192)
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 343, in rec
v
   data = self.socket.recv(buffer_size)
error: (113, 'No route to host')


but several more of these:
2007-04-10T13:55:36 ERROR ZEO.zrpc.Connection(S) (172.16.235.119:44322) Erro
r caught in asyncore
Traceback (most recent call last):
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 69, in read
   obj.handle_read_event()
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 391, in han
dle_read_event
   self.handle_read()
 File "/usr/local/zope/lib/python/ZEO/zrpc/smac.py", line 147, in handle_re
ad
   d = self.recv(8192)
 File "/usr/local/python-2.4.4/lib/python2.4/asyncore.py", line 343, in rec
v
   data = self.socket.recv(buffer_size)
error: (110, 'Connection timed out')
--
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] more lockup information / zope2.9.6+zodb 3.6.2

2007-04-10 Thread Chris Withers

Alan Runyan wrote:

Do you have anything that is committing very large transactions?


No. In fact; these clients could be running in read only mode.  As far
as I'm concerned.


How does data get into the ZEO storage then?

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
   - http://www.simplistix.co.uk
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev