Re: zlib interface semi-broken
I've come up with a good test for issue5210 and uploaded it to the bug tracker. This patch should be ready for inclusion now. -- Obama Nation | My emails do not have attachments; it's a digital signature that your mail program doesn't understand. | http://www.subspacefield.org/~travis/ If you are a spammer, please email j...@subspacefield.org to get blacklisted. pgpXcFtrYha4u.pgp Description: PGP signature -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
In article mailman.9414.1234459585.3487.python-l...@python.org, Travis travis+ml-pyt...@subspacefield.org wrote: So I've submitted a patch to bugs.python.org to add a new member called is_finished to the zlib decompression object. Issue 5210, file 13056, msg 81780 You may also want to bring this up on the python-ideas mailing list for further discussion. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Weinberg's Second Law: If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization. -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
So I've submitted a patch to bugs.python.org to add a new member called is_finished to the zlib decompression object. Issue 5210, file 13056, msg 81780 -- Crypto ergo sum. http://www.subspacefield.org/~travis/ Do unto other faiths as you would have them do unto yours. If you are a spammer, please email j...@subspacefield.org to get blacklisted. -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Travis wrote: On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote: I personally would like it and bz2 to get closer to each other... Well, I like this idea; perhaps this is a good time to discuss the equivalent of some abstract base classes, or interfaces, for compression. As I see it, the fundamental abstractions are the stream-oriented de/compression routines. Given those, one should easily be able to implement one-shot de/compression of strings. In fact, that is the way that zlib is implemented; the base functions are the stream-oriented ones and there is a layer on top of convenience functions that do one-shot compression and decompression. There are a couple of things here to think about. I've wanted to do some low-level (C-coded) search w/o bothering to create strings until a match. I've no idea how to push this down in, but I may be looking for a nice low-level spot to fit. Characteristics for that could be read-only access to small expansion parts w/o copying them out. Also, in case of a match, a (relatively quick) way to mark points as we proceed and a (possibly slower) way to resrore from one or more marked points. Also, another programmer wants to parallelize _large_ bzip file expansion by expanding independent blocks in separate threads (we know how to find safe start points). To get such code to work, we need to find big chunks of computation, and (at least optionally) surround them with GIL release points. So what I suggest is a common framework of three APIs; a sequential compression/decompression API for streams, a layer (potentially generic) on top of those for strings/buffers, and a third API for file-like access. Presumably the file-like access can be implemented on top of the sequential API as well. If we have to be able to start from arbitrary points in bzip files, they have one nasty characteristic: they are bit-serial, and we'll need to start them at arbitrary _bit_ points (not simply byte boundaries). One structure I have used for searching is a result iterator fed by a source iterator, so rather than a read w/ inconvenient boundaries the input side of the thing calls the 'next' method of the provided source. ... I would rather see a pythonic interface to the libraries than a simple-as-can-be wrapper around the C functions I'm on board with you here. My further suggestion is that we start with the sequential de/compression, since it seems like a fundamental primitive. De/compressing strings will be trivial, and the file-like interface is already described by Python. Well, to be explicit, are we talking about Decompresion and Compression simultaneously or do we want to start with one of them first? 2) The de/compression object has routines for reading de/compressed data and states such as end-of-stream or resynchronization points as exceptions, much like the file class can throw EOFError. My problem with this is that client code has to be cognizant of the possible exceptions that might be thrown, and so one cannot easily add new exceptions should the need arise. For example, if we add an exception to indicate a possible resynchronization point, client code may not be capable of handling it as a non-fatal exception. Seems like we may want to say things like, synchronization points are too be silently ignored. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Scott David Daniels scott.dani...@acm.org writes: Seems like we may want to say things like, synchronization points are too be silently ignored. That would completely break some useful possible applications, so should be avoided. -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Paul Rubin wrote: Scott David Daniels scott.dani...@acm.org writes: Seems like we may want to say things like, synchronization points are too be silently ignored. That would completely break some useful possible applications, so should be avoided. No, I mean that we, _the_users_of_the_interface_, may want to say, That is, I'd like that behavior as an option. -Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Paul Rubin wrote: Scott David Daniels scott.dani...@acm.org writes: I suspect that is why such an interface never came up (If you can clone states, then you can say: compress this, then use the resultant state to compress/decompress others. The zlib C interface supports something like that. It is just not exported to the python application. It should be. Right, we are gathering ideas we'd like to see available to the Python programmer in the new zlib / bz2 agglomeration that we are thinking of building / proposing. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Scott David Daniels wrote: ... I've wanted to do some low-level (C-coded) search w/o bothering to create strings until a match Here's a more common use case: signature gathering on the contents. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Scott David Daniels scott.dani...@acm.org writes: Seems like we may want to say things like, synchronization points are too be silently ignored. No, I mean that we, _the_users_of_the_interface_, may want to say, That is, I'd like that behavior as an option. I don't see any reason to want that (rather than letting the application handle it) but I'll take your word for it. -- http://mail.python.org/mailman/listinfo/python-list
zlib interface semi-broken
Hello all, The zlib interface does not indicate when you've hit the end of a compressed stream. The underlying zlib functionality provides for this. With python's zlib, you have to read past the compressed data and into the uncompressed, which gets stored in Decompress.unused_data. As a result, if you've got a network protocol which mixes compressed and non-compressed output, you may find a compressed block ending with no uncompressed data following until you send another command -- which a synchronous (non-pipelined) client will not send, because it is waiting for the [compressed] data from the previous command to be finished. As a result, you get a protocol deadlock. A simple way to fix this would be to add a finished attribute to the Decompress object. However, perhaps this would be a good time to discuss how this library works; it is somewhat awkward and perhaps there are other changes which would make it cleaner. What does the python community think? -- Crypto ergo sum. http://www.subspacefield.org/~travis/ Do unto other faiths as you would have them do unto yours. If you are a spammer, please email j...@subspacefield.org to get blacklisted. -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Travis wrote: The zlib interface does not indicate when you've hit the end of a compressed stream The underlying zlib functionality provides for this. With python's zlib, you have to read past the compressed data and into the uncompressed, which gets stored in Decompress.unused_data. ... [good explanation of why this is problematic] ... A simple way to fix this would be to add a finished attribute to the Decompress object. Perhaps you could submit a patch with such a change? However, perhaps this would be a good time to discuss how this library works; it is somewhat awkward and perhaps there are other changes which would make it cleaner. Well, it might be improvable, I haven't really looked. I personally would like it and bz2 to get closer to each other in interface, rather than to spread out. SO if you are really opening up a can of worms, I vote for two cans. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Travis travis+ml-pyt...@subspacefield.org writes: However, perhaps this would be a good time to discuss how this library works; it is somewhat awkward and perhaps there are other changes which would make it cleaner. What does the python community think? It is missing some other features too, like the ability to preload a dictionary. I'd support extending the interface. -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
Paul Rubin wrote: Travis travis+ml-pyt...@subspacefield.org writes: However, perhaps this would be a good time to discuss how [zlib] works... It is missing some other features too, like the ability to preload a dictionary. I'd support extending the interface. The trick to defining a preload interface is avoiding creating a brittle interface -- the saved preload should be usable across machines and versions. I suspect that is why such an interface never came up (If you can clone states, then you can say: compress this, then use the resultant state to compress/decompress others. Suddenly there is no nasty problem guessing what to parameterize and what to fix in stone. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: zlib interface semi-broken
On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote: A simple way to fix this would be to add a finished attribute to the Decompress object. Perhaps you could submit a patch with such a change? Yes, I will try and get to that this week. However, perhaps this would be a good time to discuss how this library works; it is somewhat awkward and perhaps there are other changes which would make it cleaner. Well, it might be improvable, I haven't really looked. I personally would like it and bz2 to get closer to each other in interface, rather than to spread out. SO if you are really opening up a can of worms, I vote for two cans. Well, I like this idea; perhaps this is a good time to discuss the equivalent of some abstract base classes, or interfaces, for compression. As I see it, the fundamental abstractions are the stream-oriented de/compression routines. Given those, one should easily be able to implement one-shot de/compression of strings. In fact, that is the way that zlib is implemented; the base functions are the stream-oriented ones and there is a layer on top of convenience functions that do one-shot compression and decompression. After examining the bz2 module, I notice that it has a file-like interface called bz2file, which is roughly analogous to the gzip module. That file interface could form a third API, and basically conform to what python expects of files. So what I suggest is a common framework of three APIs; a sequential compression/decompression API for streams, a layer (potentially generic) on top of those for strings/buffers, and a third API for file-like access. Presumably the file-like access can be implemented on top of the sequential API as well. If the sequential de/compression routines are indeed primitive, and sufficient for the implementation of the other two APIs, then that gives us the option of implementing the other upper two layers in pure python, potentially simplifying the amount of extension code that has to be written. I see that as desirable, since it gives us options for writing the upper two layers; in pure python, or by writing extensions to the C code where available. I seem to recall a number of ancilliary functions in zlib, such as those for loading a compression dictionary. There are also options such as flushing the compression in order to be able to resynchronize should part of the archive become garbled. Where these functions are available, they could be implemented, though it would be desirable to give them the same name in each module to allow client code to test for their existence in a compression-agnostic way. For what it's worth, I would rather see a pythonic interface to the libraries than a simple-as-can-be wrapper around the C functions. I personally find it annoying to have to drop down to non-OOP styles in a python program in order to use a C library. It doesn't matter to me whether the OOP layer is added atop the C library in pure python or in the C-to-python binding; that is an implementation detail to me, and I suspect to most python programmers. They don't care, they just want it easy to use from python. If performance turns out to matter, and the underlying compression library supports an upper layer in C, then we have the option for using that code. So my suggestion is that we (the python users) brainstorm on how we want the API to look, and not focus on the underlying library except insofar as it informs our discussion of the proper APIs - for example, features such as flushing state, setting compression levels/windows, or for resynchronization points. My further suggestion is that we start with the sequential de/compression, since it seems like a fundamental primitive. De/compressing strings will be trivial, and the file-like interface is already described by Python. So my first suggestion on the stream de/compression API thread is: The sequential de/compression needs to be capable of returning more than just the de/compressed data. It should at least be capable of returning end-of-stream conditions and possibly other states as well. I see a few ways of implementing this: 1) The de/compression object holds state in various members such as data input buffers, data output buffers, and a state for indicating states such as synchronization points or end-of-stream states. Member functions are called and primarily manipulate the data members of the object. 2) The de/compression object has routines for reading de/compressed data and states such as end-of-stream or resynchronization points as exceptions, much like the file class can throw EOFError. My problem with this is that client code has to be cognizant of the possible exceptions that might be thrown, and so one cannot easily add new exceptions should the need arise. For example, if we add an exception to indicate a possible resynchronization point, client code may not be capable of handling it as a non-fatal exception. Thoughts? --
Re: zlib interface semi-broken
Scott David Daniels scott.dani...@acm.org writes: I suspect that is why such an interface never came up (If you can clone states, then you can say: compress this, then use the resultant state to compress/decompress others. The zlib C interface supports something like that. It is just not exported to the python application. It should be. -- http://mail.python.org/mailman/listinfo/python-list