Re: next version of content-encoding / gzip design doc

2004-03-09 Thread Jon Kay
Here's yet another design version, following many helpful suggestions
from Henrik.


  Gzip Content-Encoding in Squid Design


Version Choice

The goal will be to get these changes into Squid3 HEAD.


Content-Encoding Protocol

The content-encoding protocol is describedi

Header field cases from client:

If Accept-Encoding field is present in client request

If there is a cached response aleady available, and it
 contains a Content-Encoding field with encodings that are a
 subset of what the client accepts

Then forward response to client unchanged

Else (no cached response with right content-encoding)

If uncoded response isn't available

Then forward client request to server/cache

If server/cache response contains Content-Encoding field

Then forward new response to client

Else (server/cache response doesn't have Content-Encoding)

Then encode client response
Send encoded response to client

Else (uncoded server response already available)

Then encode uncoded response
Send encoded response to client

Else (no Accept-Encoding in client request)

If uncoded server response already available

Forward unchanged to client

Else if coded server response already available

Then decode server response
send decoded response to client

Else (no response available yet)

Then forward request to client or cache, and behave unchanged
with respect to this protocol.

There will be no explicit links between objects that are different
links to the same coding.  Instead, StoreKeys of coded objects will be
chosen particularly as  MD5(OriginalStoreKey,Content-Encoding). This
would allow one to derive the StoreKeys of all possible encodings
including original if only knowing the original StoreKey and not the
requested URL.

Searching for an uncoded version of an object is done by generating an
uncoded StoreKey and looking for an object with that key.  It's needed
upon cache miss (see protocol above).

Upon original or encoded object update or PURGE, delete all the
possible encoding variants. As the encodings are applied locally the
possible combinations are known and finite so there is no problem on
purging all at once.  If the number of encodings grows nontrivially,
we may need to add an additional mechanism to keep that check under
control.

Original-update deletion will be triggered on swapout of a new
original object (when it gets a public key).

Etags: Encoded objects will be given unique new entity tags.

There will be a configuration option to turn off content-encoding.


Content-Encoding Implementation

New HttpHdrContCode module, that parses related HTTP headers, and
arranges for encoding or decoding appropriately.  Includes the
following functions:

  codeParseRequest(): Called from client_side:parseHttpRequest()
  after clientStreamInit() call.  Checks for and parses Allow-Encoding
  headers.  Instantiates content_coding appropriately, and calls
  codeClientStreamInit().
  codeClientStreamInit():  Adds a new node to clientStream with
  codeStreamRead(),  codeStreamCallback(), and codeStreamStatus() functions.
  codeStreamCallback()set up encoding/decoding state depending on
  combination of Content-Encoding and Allow-Encoding fields seen.
  codeStreamRead(): call HttpContentCoder transformation functions
  appropriately.
  codeStreamStatus(): report status to stream.


New HttpContentCoder abstract type, with functions:

  encodeStart()
  encodeEnd()
  encodeChunk()

  decodeStart()
  decodeEnd()
  decodeChunk()


New per-coded-object ContentCoderState, to handle coding state.  It'll
be referenced from the clientStream, and include fields:

  HttpContentCoder *coder
  off_t codedOffset


Objects will be stored both in unencoded and encoded formats.  An
object will stay in the format in which Squid receives it until
requested by a client requesting a different Content-Encoding which
Squid supports (this could be immediate).  Once this happens, the
object will be streamed coded into a different StoreEntry and on to
the client.


Other changes needed:

Add new content_coding field to HttpReply.

New httpHeaderGetContentEncoding(HttpReply *) function in HttpHeader.cc.

A new configuration flag to turn content-encoding off, if desired.

A new object flag, encoded.  Whenever an encoded or decoded object
is created, it's tagged as encoded.  Thus, a locally redecoded
object will be obviously so.

A new store.cc function, storeDeleteCodedCopies(), will do the
deletion of all (un)coded copies described above.


Gzip

A new GzipContentCoder module, which will be an instance of
HttpContentCoder.

Data encoding will be handled by the gzip.org a
href=http://www.gzip.org/zlib/ zlib library/a.

Functions:
  gzEncodeStart: call 

Re: next version of content-encoding / gzip design doc

2004-03-08 Thread Jon Kay
Henrik Nordstrom wrote:

 On Fri, 5 Mar 2004, Jon Kay wrote:

  If Accept-Encoding field is present in client request
 
  If server or cache response contains Content-Encoding field with
  encodings that are a subset of what the client accepts

 This must be relaxed to just contains a Content-Encoding field, ignoring
 if it is acceptable by the client. If not you run into ugly corner cases
 if the server ignores what the client accepts.


OOPS.  I misstated this test.

It SHOULD be:

If Accept-Encoding field is present in client request

If there is a cached response aleady available, and it
   contains a Content-Encoding field with encodings that are a
   subset of what the client accepts

Then forward response to client unchanged

Else (no cached response with right content-encoding)

...

otherwise the same.



Re: next version of content-encoding / gzip design doc

2004-03-04 Thread Jon Kay
Henrik Nordstrom wrote:

 On Wed, 3 Mar 2004, Jon Kay wrote:

  Because current browser implementations treat Content-Encoding much as
  though it was Transfer-Encoding, we will implement Content-Encoding and
  Accept-Encoding as though they were actually the Transfer-Encoding and
  TE described in the HTTP specifications.

 This part I do not understand.

 Coontent-Encoding and Transfer-Encoding is fundamentally different in
 their operation far beyond the hop-by-hop vs end-to-end difference. You
 can not interchange one for the other.

 It is not safe to assume a clients accepts gzip TE only because they
 accept gzip content-encoding. For one thing the message format is
 completely different.

  Etags of replies encoded by Squid will be modified to turn them into
  weak tags if they are not already so.

 Why to you oppose creating new unique ETags?

  There will be a configuration option to turn off content-encoding.

 Granted, and this will default off in the standard distribution, as any
 other option which violates the semantically transparent HTTP proxy
 requirements.

  Content-Encoding Implementation

 No comments there.

  Objects will be stored both in unencoded and encoded formats. An object
  will stay in the format in which Squid receives it until requested by a
  client requesting a different Content-Encoding which Squid supports
  (this could be immediate). Once this happens, the object will be
  streamed coded into a different StoreEntry and on to the client.

 Ok.

  A new store_dup module will be created to manage dup store_entries and
  make sure duplicate entries are invalidated when a new version of an
  object is read. It consists of a circular list of StoreEntry pointers
  named dupnext and dupprev When a new duplicate encoding (or
  decoding) of an object is created, it's added to the list. When any
  StoreEntry is invalidated or updated, all dups are invalidated.

 Looks a little too complex to me.

 Wouldn't something simpler like the following work:

 Modify the store key to account for content encoding.

 Add a internal meta object listing the known content encodings of a given
 object. When a new encoding is added rewrite this object to add the new
 encoding name.

 On cache hits, iterate over the known acceptable encodings until a match
 is found in the cache.

 In recoded objects include a meta header indicating the identity of the
 original object and disregard the recoded object on a cache hit if it no
 longer matches the original.

 From what I can tell the above would also work for adding server-driven
 Content-Encoding negotiation to the proxy to complement the use of Vary
 (which most mod_gzip servers do not support btw).

 Regards
 Henrik



Re: next version of content-encoding / gzip design doc

2004-03-04 Thread Jon Kay
 Coontent-Encoding and Transfer-Encoding is fundamentally different in
 their operation far beyond the hop-by-hop vs end-to-end difference. You
 can not interchange one for the other.

 It is not safe to assume a clients accepts gzip TE only because they
 accept gzip content-encoding. For one thing the message format is
 completely different.

Yes.  I'm going to try a different tack to explanation /
underpinnings.

Now I'm going to outline it by case analysis:

Protocol:

Header field cases from client:

If Accept-Encoding field is present in client request

If server or cache response contains Content-Encoding field with
encodings that are a subset of what the client accepts

Then forward response to client unchanged

Else (no helpful content-encoding field)

If uncoded response isn't available

Then forward client request to server/cache

If server/cache response contains Content-Encoding field

Then forward new response to client
Add this response to duplicate list for the object

Else (server/cache response doesn't have Content-Encoding)

Then encode client response
Add encoded response to duplicate list for the object
Send encoded response to client

Else (uncoded server response already available)

Then encode uncoded response
Add encoded response to duplicate list for the object
Send encoded response to client

Else (no Accept-Encoding in client request)

If uncoded server response already available

Forward unchanged to client

Else if coded server response already available

Then decode server response
add decoded response to duplicate list for the object
send decoded response to client

Else (no response available yet)

Then forward request to client or cache, and behave unchanged
with respect to this protocol.





Re: next version of content-encoding / gzip design doc

2004-03-03 Thread Henrik Nordstrom
On Wed, 3 Mar 2004, Jon Kay wrote:

 Because current browser implementations treat Content-Encoding much as
 though it was Transfer-Encoding, we will implement Content-Encoding and
 Accept-Encoding as though they were actually the Transfer-Encoding and
 TE described in the HTTP specifications.

This part I do not understand.

Coontent-Encoding and Transfer-Encoding is fundamentally different in 
their operation far beyond the hop-by-hop vs end-to-end difference. You 
can not interchange one for the other.

It is not safe to assume a clients accepts gzip TE only because they 
accept gzip content-encoding. For one thing the message format is 
completely different.

 Etags of replies encoded by Squid will be modified to turn them into
 weak tags if they are not already so.

Why to you oppose creating new unique ETags?

 There will be a configuration option to turn off content-encoding.

Granted, and this will default off in the standard distribution, as any 
other option which violates the semantically transparent HTTP proxy 
requirements.

 Content-Encoding Implementation

No comments there.

 Objects will be stored both in unencoded and encoded formats. An object
 will stay in the format in which Squid receives it until requested by a
 client requesting a different Content-Encoding which Squid supports
 (this could be immediate). Once this happens, the object will be
 streamed coded into a different StoreEntry and on to the client.

Ok.

 A new store_dup module will be created to manage dup store_entries and
 make sure duplicate entries are invalidated when a new version of an
 object is read. It consists of a circular list of StoreEntry pointers
 named dupnext and dupprev When a new duplicate encoding (or
 decoding) of an object is created, it's added to the list. When any
 StoreEntry is invalidated or updated, all dups are invalidated.

Looks a little too complex to me.


Wouldn't something simpler like the following work:

Modify the store key to account for content encoding.

Add a internal meta object listing the known content encodings of a given 
object. When a new encoding is added rewrite this object to add the new 
encoding name.

On cache hits, iterate over the known acceptable encodings until a match
is found in the cache.

In recoded objects include a meta header indicating the identity of the
original object and disregard the recoded object on a cache hit if it no
longer matches the original.

From what I can tell the above would also work for adding server-driven
Content-Encoding negotiation to the proxy to complement the use of Vary 
(which most mod_gzip servers do not support btw).

Regards
Henrik



Re: next version of content-encoding / gzip design doc

2004-03-03 Thread garana


Hi there,

I'm back with this task (again).

Jon: you are far more advanced than I am on understanding squid.  I can start helping 
content compression writing GzipCoder, if you want to.

(Already discussed) About TE/Transfer-Encoding vs Accept-Encoding/Content-Encoding:  
Content-Encoding implementation (even if it bends standards) seems to be the 
reasonable choice, since TE/Transfer-Encoding is not available on most common browsers.

If implemented as Content-Encoding the following headers should be altered before 
encoding:

Content-Length: deleted (could be updated to actual gzipped content length, but it is 
too much trouble i guess).
ETag: modified (appending CEgz, for instance?)
Vary: append Accept-Encoding (if not already there)
Connection: if client is 1.1, could be set to keep-alive, but Transfer-Encoding 
chunked should be added/checked.

Hope this provides some light about possible encoding options.

Regards,

-- 
Gonzalo Arana
Ingenieria
UOLSinectis

Florida 537 Piso 6, Buenos Aires, Argentina 
+54-11-4321-9110 ext 2543
http://www.uolsinectis.com.ar/