Re: zlib interface semi-broken

2009-08-21 Thread Travis
I've come up with a good test for issue5210 and uploaded it to the bug tracker.

This patch should be ready for inclusion now.
-- 
Obama Nation | My emails do not have attachments; it's a digital signature
that your mail program doesn't understand. | 
http://www.subspacefield.org/~travis/ 
If you are a spammer, please email j...@subspacefield.org to get blacklisted.


pgpXcFtrYha4u.pgp
Description: PGP signature
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-21 Thread Aahz
In article mailman.9414.1234459585.3487.python-l...@python.org,
Travis  travis+ml-pyt...@subspacefield.org wrote:

So I've submitted a patch to bugs.python.org to add a new member
called is_finished to the zlib decompression object.

Issue 5210, file 13056, msg 81780

You may also want to bring this up on the python-ideas mailing list for
further discussion.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Weinberg's Second Law: If builders built buildings the way programmers wrote 
programs, then the first woodpecker that came along would destroy civilization.
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-12 Thread Travis
So I've submitted a patch to bugs.python.org to add a new member
called is_finished to the zlib decompression object.

Issue 5210, file 13056, msg 81780
-- 
Crypto ergo sum.  http://www.subspacefield.org/~travis/
Do unto other faiths as you would have them do unto yours.
If you are a spammer, please email j...@subspacefield.org to get blacklisted.
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Scott David Daniels

Travis wrote:

On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:

 I personally would like it and bz2 to get closer to each other...


Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some abstract base classes, or interfaces, for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines.  Given those, one should easily be able to
implement one-shot de/compression of strings.  In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.


There are a couple of things here to think about.  I've wanted to
do some low-level (C-coded) search w/o bothering to create strings
until a match.  I've no idea how to push this down in, but I may be
looking for a nice low-level spot to fit.  Characteristics for that
could be read-only access to small expansion parts w/o copying them
out.  Also, in case of a match, a (relatively quick) way to mark points
as we proceed and a (possibly slower) way to resrore from one or
more marked points.

Also, another programmer wants to parallelize _large_ bzip file
expansion by expanding independent blocks in separate threads (we
know how to find safe start points).  To get such code to work, we
need to find big chunks of computation, and (at least optionally)
surround them with GIL release points.


So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access.  Presumably the file-like access can be implemented
on top of the sequential API as well.

If we have to be able to start from arbitrary points in bzip files, they
have one nasty characteristic: they are bit-serial, and we'll need to
start them at arbitrary _bit_ points (not simply byte boundaries).

One structure I have used for searching is a result iterator fed by
a source iterator, so rather than a read w/ inconvenient boundaries
the input side of the thing calls the 'next' method of the provided
source.


... I would rather see a pythonic interface to the libraries than a

 simple-as-can-be wrapper around the C functions
I'm on board with you here.


My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.

Well, to be explicit, are we talking about Decompresion and Compression
simultaneously or do we want to start with one of them first?


2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError.  My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise.  For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.


Seems like we may want to say things like, synchronization points are
too be silently ignored.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Paul Rubin
Scott David Daniels scott.dani...@acm.org writes:
 Seems like we may want to say things like, synchronization points are
 too be silently ignored.

That would completely break some useful possible applications, so should
be avoided.
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Scott David Daniels

Paul Rubin wrote:

Scott David Daniels scott.dani...@acm.org writes:

Seems like we may want to say things like, synchronization points are
too be silently ignored.


That would completely break some useful possible applications, so should
be avoided.

No, I mean that we, _the_users_of_the_interface_, may want to say, 
That is, I'd like that behavior as an option.

-Scott David Daniels
scott.dani...@acm.org

--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Scott David Daniels

Paul Rubin wrote:

Scott David Daniels scott.dani...@acm.org writes:

I suspect that is why such an interface never came up (If
you can clone states, then you can say: compress this, then use the
resultant state to compress/decompress others. 


The zlib C interface supports something like that.  It is just not
exported to the python application.  It should be.

Right, we are gathering ideas we'd like to see available to the Python
programmer in the new zlib / bz2 agglomeration that we are thinking of
building / proposing.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Scott David Daniels

Scott David Daniels wrote:

... I've wanted to do some low-level (C-coded) search w/o bothering
to create strings until a match

Here's a more common use case: signature gathering on the contents.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-11 Thread Paul Rubin
Scott David Daniels scott.dani...@acm.org writes:
  Seems like we may want to say things like, synchronization points are
  too be silently ignored.
 No, I mean that we, _the_users_of_the_interface_, may want to say, 
 That is, I'd like that behavior as an option.

I don't see any reason to want that (rather than letting the application
handle it) but I'll take your word for it.
--
http://mail.python.org/mailman/listinfo/python-list


zlib interface semi-broken

2009-02-10 Thread Travis
Hello all,

The zlib interface does not indicate when you've hit the end of a compressed 
stream.

The underlying zlib functionality provides for this.

With python's zlib, you have to read past the compressed data and into
the uncompressed, which gets stored in Decompress.unused_data.

As a result, if you've got a network protocol which mixes compressed
and non-compressed output, you may find a compressed block ending with
no uncompressed data following until you send another command -- which
a synchronous (non-pipelined) client will not send, because it is waiting
for the [compressed] data from the previous command to be finished.

As a result, you get a protocol deadlock.

A simple way to fix this would be to add a finished attribute to the
Decompress object.

However, perhaps this would be a good time to discuss how this library
works; it is somewhat awkward and perhaps there are other changes which
would make it cleaner.

What does the python community think?
-- 
Crypto ergo sum.  http://www.subspacefield.org/~travis/
Do unto other faiths as you would have them do unto yours.
If you are a spammer, please email j...@subspacefield.org to get blacklisted.
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-10 Thread Scott David Daniels

Travis wrote:

The zlib interface does not indicate when you've hit the

 end of a compressed stream

The underlying zlib functionality provides for this.

With python's zlib, you have to read past the compressed data and into
the uncompressed, which gets stored in Decompress.unused_data.
... [good explanation of why this is problematic] ...
A simple way to fix this would be to add a finished attribute to the
Decompress object.

Perhaps you could submit a patch with such a change?


However, perhaps this would be a good time to discuss how this library
works; it is somewhat awkward and perhaps there are other changes which
would make it cleaner.

Well, it might be improvable, I haven't really looked.  I personally
would like it and bz2 to get closer to each other in interface, rather
than to spread out.  SO if you are really opening up a can of worms,
I vote for two cans.

--Scott David Daniels
scott.dani...@acm.org



--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-10 Thread Paul Rubin
Travis travis+ml-pyt...@subspacefield.org writes:
 However, perhaps this would be a good time to discuss how this library
 works; it is somewhat awkward and perhaps there are other changes which
 would make it cleaner.
 
 What does the python community think?

It is missing some other features too, like the ability to preload
a dictionary.  I'd support extending the interface.  
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-10 Thread Scott David Daniels

Paul Rubin wrote:

Travis travis+ml-pyt...@subspacefield.org writes:

However, perhaps this would be a good time to discuss how [zlib] works...

It is missing some other features too, like the ability to preload
a dictionary.  I'd support extending the interface.


The trick to defining a preload interface is avoiding creating a brittle
interface -- the saved preload should be usable across machines and
versions.  I suspect that is why such an interface never came up (If
you can clone states, then you can say: compress this, then use the
resultant state to compress/decompress others.  Suddenly there is no
nasty problem guessing what to parameterize and what to fix in stone.


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: zlib interface semi-broken

2009-02-10 Thread Travis
On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:
 A simple way to fix this would be to add a finished attribute to the
 Decompress object.
 Perhaps you could submit a patch with such a change?

Yes, I will try and get to that this week.

 However, perhaps this would be a good time to discuss how this library
 works; it is somewhat awkward and perhaps there are other changes which
 would make it cleaner.
 Well, it might be improvable, I haven't really looked.  I personally
 would like it and bz2 to get closer to each other in interface, rather
 than to spread out.  SO if you are really opening up a can of worms,
 I vote for two cans.

Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some abstract base classes, or interfaces, for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines.  Given those, one should easily be able to
implement one-shot de/compression of strings.  In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.

After examining the bz2 module, I notice that it has a file-like
interface called bz2file, which is roughly analogous to the gzip
module.  That file interface could form a third API, and basically
conform to what python expects of files.

So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access.  Presumably the file-like access can be implemented
on top of the sequential API as well.

If the sequential de/compression routines are indeed primitive, and
sufficient for the implementation of the other two APIs, then that
gives us the option of implementing the other upper two layers in
pure python, potentially simplifying the amount of extension code that
has to be written.  I see that as desirable, since it gives us options
for writing the upper two layers; in pure python, or by writing
extensions to the C code where available.

I seem to recall a number of ancilliary functions in zlib, such as
those for loading a compression dictionary.  There are also options
such as flushing the compression in order to be able to resynchronize
should part of the archive become garbled.  Where these functions are
available, they could be implemented, though it would be desirable to
give them the same name in each module to allow client code to test
for their existence in a compression-agnostic way.

For what it's worth, I would rather see a pythonic interface to the
libraries than a simple-as-can-be wrapper around the C functions.  I
personally find it annoying to have to drop down to non-OOP styles in
a python program in order to use a C library.  It doesn't matter to me
whether the OOP layer is added atop the C library in pure python or in
the C-to-python binding; that is an implementation detail to me, and I
suspect to most python programmers.  They don't care, they just want
it easy to use from python.  If performance turns out to matter, and
the underlying compression library supports an upper layer in C,
then we have the option for using that code.

So my suggestion is that we (the python users) brainstorm on how we
want the API to look, and not focus on the underlying library except
insofar as it informs our discussion of the proper APIs - for example,
features such as flushing state, setting compression levels/windows,
or for resynchronization points.

My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.

So my first suggestion on the stream de/compression API thread is:

The sequential de/compression needs to be capable of returning
more than just the de/compressed data.  It should at least be
capable of returning end-of-stream conditions and possibly
other states as well.  I see a few ways of implementing this:

1) The de/compression object holds state in various members such as
data input buffers, data output buffers, and a state for indicating
states such as synchronization points or end-of-stream states.  Member
functions are called and primarily manipulate the data members of the
object.

2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError.  My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise.  For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.

Thoughts?
-- 

Re: zlib interface semi-broken

2009-02-10 Thread Paul Rubin
Scott David Daniels scott.dani...@acm.org writes:
 I suspect that is why such an interface never came up (If
 you can clone states, then you can say: compress this, then use the
 resultant state to compress/decompress others. 

The zlib C interface supports something like that.  It is just not
exported to the python application.  It should be.
--
http://mail.python.org/mailman/listinfo/python-list