I have been building an "ExternalFile" class which stores the body of
the file in an external file, mirroring the Zope path/hierarchy. This
will allow easy integration with servers that can mount the external
representation of the content and serve it with a consistent namespace.
To make life zimple, I tried to move all file manipulation to Zope,
including upload/download/copy/cut/paste/delete and permissions. These
external files are transaction aware, blah blah..
Working with files > 20MB I notices some serious performance/scalability
issues and investigated. Here are the results.
A diff with my changes against version 2.2.2 is available at
Zope objects like File require data as a seekable file or as a
coherent block, rather than as a stream. Initializing/updating
these objects *may* require loading the entire file into memory.
In memory buffering of request or response data could cause
excessive swapping of the working set.
Multi-service architecture (ZServer->ZPublisher) could limit the
reuse of stream handles.
Creating temporary files as FIFOs buffers between the services
causes signficant swapping.
Using pipes I found that FTPServer.ContentCollector was using a
StringIO to buffer the uploads from FTP clients. I changed this
into a TemporaryFile for a while which revealed the leaked file
descriptor bug (see below). This intermediary temp file caused 1
extra file copy for each request. The goal is to not have any
intermediary files at all, and pipeline the content directly into
the Zope objects.
To remove this FTP upload file buffer, I converted the FTP collector
again from a TemporaryFile into a pipe with a reader and writer file
objects. The FTPRequest receives the reader from which it can
process the input on the publish thread in processInputs.
Since we are dealing with blocking pipes it is OK to have a reader
on the publish thread and a writer on the ZServer thread. The major
considerations were regarding the proper way to read from a pipe
through the chain of control, especially in cgi.FieldStorage.
Stdin is treated as the reader of the pipe throughout the code. All
seek()s and tell()s on sys.stdin type objects (a tty not a seekable
file) should be considered illegal and removed.
Usage of FieldStorage from FTP (Unknown content-length)
To gain access to the body of a request, one typically calls
REQUEST['BODY'] or REQUEST['BODYFILE']. This returns the file
object the FieldStorage copied from stdin.
To prevent FieldStorage from copying the file from stdin to a
temporary file, we can set the CONTENT_LENGTH header to '0' in the
FTP _get_env for a STOR.
In this case, FieldStorage creates a temporary file but doesn't read
any data from stdin so we can return stdin directly when BODYFILE is
requested and 'content-length' is '0'. However, BODYFILE could be a
pipe which doesn't support 'seek' or 'tell'. The code used to suck
the data off the BODYFILE needs to be modified to adapt to the
possibly of being passed a pipe.
Updating Image.File to play with pipes
The _read_data method of Image.File pulls the data out of the
BODYFILE and sticks it in the instance as a string, pdata object, or
a linked list of pdata objects. The existing code reads and builds
the list in one clean sweep back-to-front. I belive this keeps the
pdata.data chunks out of memory, quickly (sub)committing then
deactivating (_p_changed = None) them.
Since we can no longer safely assume 'seek' is valid for BODYFILE, I
tried to read and build the list front-to-back. This kept the data
in memory, even though I tried to deactivate the objects quickly.
As a tradeoff, I read the data front-to-back then built the list
back-to-front taking another pass to reverse the list so it is in
the correct order.
Memory usage appears to be steady, meaning the whole file is not
loaded into the working set. This also prevents unecessary reading
into a temporary FieldStorage file during an FTP upload.
Web based uploads...
...suck. I do not recommend doing a web based upload for files
> 1mb. First, a content-length is known, so we don't get the
advantage of pipelining the data directly from the socket, a
temporary file must be created, written and read. Second, I believe
the content is encoded so the transferred bitcount is much higher
than using FTP.
Plus, most browsers today do not support a progress bar for posts,
so there is no indication of status, causing most people to click
'Upload' multiple times.
I haven't done any optimization for this case, but have tested that
is still works properly.
Cleaning up (leaked file descriptor bug)
I noticed that when uploading 20+ MB a couple of times, I ran out of
hard drive space. This didn't make sense and I looked into what
files were open by Zope. Doing an 'lsof' I found that the temporary
files which are immediatly unlinked after creation, were still open
until the end of the Zope process. These files (created by
tempfile.TemporaryFile) needed to be closed after the end of the
REQUEST and RESPONSE, rather than at the end of the Zope process.
After publishing, the close method of the REQUEST gets called. Here
I added closing of stdin and the FieldStorage created TemporaryFile
The ZServer.HTTPResponse object makes a good attempt of keeping
large results out of memory but does so by creating a temporary file
and copying any written data to it then pushing a file_part_producer
onto the channel output queue.
If the Zope object knows how to produce the data themselves, they
could push producer(s) directly to the channel. I added a single
check in ZServer.HTTPResponse(256) where a temporary file is only
created if the data is larger than the in-memory buffer *and*
doesn't already look like a producer with 'more' as a method.
If the temporary file doesn't exist the rest of the code simply
writes the data to the channel and the channel produces the output
directly from the producer created by the Zope object.
Using a file producer from my Zope object cuts out a file copy, and
those get expensive when one is dealing with 20+ MB files. The
response time is also dramatically reduced because the file copy
step before streaming to the client was removed.
I would like to apply the same concept to Image.File.index_html
where rather than creating a temporary file in the RESPONSE to queue
the contents, create a producer to pull the data directly out of the
backend when it is ready to write. I am experiencing a 10 second
latency (233Mhz laptop) between requesting a 10MB file and receiving
the first byte with the current code. If an output producer is
used, this latency would drop < 1 sec.
I made an attempt to create a pdata_producer but failed because of
ZODB errors reloading the object. I get a traceback like:
2000-10-24T09:19:08 ERROR(200) ZODB Couldn't load state for
'\000\000\000\000\000\000&\370' Traceback (innermost last): File
/usr/local/zope/lib/python/ZODB/Connection.py, line 442, in setstate
AttributeError: 'None' object has no attribute 'load'
My hunch is that the Image, pdata_producer or pdata object gets
deactivated and can't find its DB to load itself. I tried setting a
_p_jar on the pdata_producer, but I don't really know what happens
when the object context leaves publish_module. Since the object
activation happens in the ZServer thread, some voodoo may be needed
to get the proper state in the pdata_producer.... any takers?
I have only tested these changes with FTPServer and HTTPServer, not
PCGIServer or FCGIServer.
I have tested round-trip coherency because of the change in
I haven't completely tested boundary conditions, where
Image.File._read_data makes descisions. The extent has been large
files 10+ MB and small files < 64K.
I haven't tested HTTPRequest.retry which will probably fail because
HTTPRequest.stdin now may be a pipe.
3rd party products which treat BODYFILE as a seekable file object
may fail during FTP uploads.
Most of these efforts are geared towards FTP, as HTTP form uploads
don't seem to be worth the effort.
I haven't taken a look at HTTP PUT, for webdav clients etc...
Similar pipelining could be used, however I doubt they would be
possible without modifing cgi.FieldStorage.
Zope seems to be doing a lot with TempStorage and other ZODB magic I
didn't care about checking out. Some performance improvements could
be included here.
FTP I/O with my changes including my ExternalFile custom output
producer dramatically increases Zopes performance and scalability.
Zope-Dev maillist - [EMAIL PROTECTED]
** No cross posts or HTML encoding! **
(Related lists -