ons 2011-08-10 klockan 10:11 -0600 skrev Alex Rousskov: > A) Two different URLs correspond to the same raw content bytes. > B) A refresh of the same URL results in the same raw content bytes.
Both are very interesting I think. And I would take a simpler approach. Build on the HTTP Instance Digest defined by Jeff, and always add a suitable instance digest to cached/buffered content (this regardless of the use of Want-Digest). Any received instance digests MUST be verified befora cache reuse. If the received message have the same instance digest as a previously cached instance then abort the retreival and reuse what you have in the cache. In requests you can optionally add an digest based condition similar to If-None-Match but here If-None-Match already serves the purpose quite well, so use of the digest condition should probably be limited to cases where there is no ETag. To optimize bandwidth loss due to unneeded transmission a slow start mechanism can be used where the sending part waits a couple RTTs before starting to transmit the body of a large response where an instance digest is presented. This allows the receiving end to check the received instance digest and abort the request if not interested in receiving the body. I probably would not advice to go the route by message digests & hop-by-hop. The main difference between message digests and instance digests is their meaning in 206 responses. Message digests mainly serve the purpose of very weak integrity protection detecting accidental in-transit modifications to a given message and their use outside that scope is pretty limited. The drawback of the above proposal is that it can not deal well with partial objects where the full representation is not known to the upstream cache. But for that case I think we need to rely on ETag being presented by the server. If that is not sufficient then a new type of digest needs to be defined which can be calculated over ranges of an instance (not the 206 message representation as done in Content-MD5 if applied at message level, which bts is something I disagree was the intention for Content-MD5) Note regarding Content-MD5. It's use in 206 responses have been deprecated in HTTPbis as there is inconsistent implementations and no clear consensus on the meaning of Content-MD5 in 206 responses. > Case (A) has been studied extensively by Jeff Mogul and others. Jeff and > his team came up with a set of HTTP extensions for caches to advertise > "I have content with such and such checksum" information, which is then > used to avoid sending unchanged content to the cache. Here is one of > Jeff's papers: > http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf Trouble with Jeffs proposal and other similar approaches is the added overhead in discovering that there is two objects with identical representations. I do not like the proposal by Jeff as it adds significant amount of latency which is a major bottleneck today, and optimistically sending some digests of other URLs is not practical and adds some nasty security implications (plus that it significantly adds request bandwidth overhead) If case (A) is to be addressed then I would do so in a more relaxed manner like what I describe above. > Case (B) can be viewed as a sub-case of (A), but does not require extra > HTTP exchanges (bad for slow links!), a database of content digests, and > other complications of (A). The basic idea behind optimizing case (B) is > similar though: Case (B) is mainly to optimize the case where servers not support ETag. If servers do send ETag (and do not randomly change them for tracking purposes) then If-None-Match is sufficient for (B). Extending (B) with an Instance-Digest based condition may be interesting to deal with the numerous servers not sending ETag or where ETag is used badly. > 1) Child Squid has URL U cached. This Squid needs to request U from a > parent Squid (because the entity has expired, because the client > requested revalidation, etc.). The child Squid sends a regular request > for U to the parent Squid and tells the parent about the cached content > checksum: > > GET U HTTP/1.1 > Have-Digest: md5=foo > .... Have-Digest: should be an If-something imho. If-None-Digest-Match ? > To tell the child Squid that it can use the cached body, the parent > Squid can violate the HTTP message length rules and send the > regular/true response header without the body, but it is probably better > to just encapsulate the regular/true response header without violating > HTTP. Why not simply use 304 which already exists for the purpose? A 304 provides entity headers and body identifier. > Question: Can we accept a quality implementation of optimization (B) > into Squid? I would rather see one that can be extended to work for (A) than just optimizing (B). The amount of redundant data on the web is very large today. Additionally as already mentioned by Amos, If-None-Match is an already existing mechanism for dealing with (B), and a good first step is fixing our implementation of that. > P.S. Case (B) is also related to Reload-into-IMS and such, but it is > more general and does not violate HTTP. Reload-into-IMS is a bastard because it adds a quite weak validator (If-Modified-Since) to the request when none were send by the client, possibly resulting in stale content being served as fresh from the cache. Adding strong conditions to forwarded requests have a much more limited impact and I have a hard time see this causing any issues, provided the part that adds the condition is prepared to deal with the possible outcomes. Regards Henrik
