On 04/03/11 05:23, Tsantilas Christos wrote:
On 03/02/2011 03:00 AM, Amos Jeffries wrote:
On Tue, 01 Mar 2011 20:29:51 +0200, Tsantilas Christos wrote:
This patch is forgotten too.
Any comments?
Just doing a quick audit now...
......
"
src/client_side.cc:
- the logic is wrong
- in setLogUri() you do not need to call stringHasCntl(uri). The
unescape call will do a one-pass scan and check these as it goes.
also an old bug is now visible, stringHasCntl() does not check the full
set of characters that need encoding :(
Well, does not check for rfc1738_unsafe_chars, is not difficult to fix
it. I can include a fix in this patch if required.
With that fixed its clear that stringHasWhitespace() needs to be
performed *before* any escaping and duplicating is done. Some shuffling
can avoid doing that scan pass entirely on most config setups.
This is true, the rfc1738_escape_unscaped, escapes the whitespace so
user whitespace policy are not applied.
Your patch is not correct too, because the uris must always escaped if
they contain non printable characters, before logged. Your code only
take cares for whitespaces...
Arg. Right.
A better solution requires 2 strdup/malloc if there are both whitespaces
and ctl characters in the uri. I do not know how often are URIs with
whitespaces and ctl characters but I do not like ...
The problem was to find a way to handle the cases you have unescaped
URIs with whitespaces which causes problem on log analysis.
In my original patch if there are cntl characters escape them and escape
the whitespaces too. In the case there are not cntl characters apply
user defined whitespace policy....
Yes the above is not the best, but I think it is fair for most cases.
The overall source code which is related with http client request
parsing is a little confused, and I believe that we are spending a lot
of CPU cycles checking and parsing again and again the same strings.
My opinion is that a better solution is to rewrite or add a new
rfc1738_escape_unescaped library function as follows:
rfc1738_escape_unescaped_uri(const char *uri,
const char *escaped_uri, size_t escaped_uri_len,
int whitespace_policy = URI_WHITESPACE_ENCODE)
uri: the original URI
escaped_uri: the buffer to copy the URI
escaped_uri_len: the length of the buffer
whitespace_policy: method to handle the whitespaces
(URI_WHITESPACE_ALLOW, URI_WHITESPACE_ENCODE etc.... )
No. our local whitespace policy has no place in the encoding algorithm.
It is used by many places for many non-URL things.
The current separation of a whitespace policy function setLogUri() which
happens to use rfc1738 escaping is good. The problem is limits on the
algorithm API.
Attached is a patch for the rfc1738 API which fixes its flags to enable
omitting the character groups separately.
I noticed there was no short-circuit to speed up the scan over the very
common alpha-numeric chars. Added. That should speed up the encoding a lot.
There are two extra performance boosts possible here but outside the
scope of any of this work,
* using malloc instead of calloc. If necessary at all memset() is only
needed on the final unfilled part of the array.
* update the API to receive an optional buffer to be filled, OR to
pass out the alloc'd buffer for the caller to deal with.
The above will save at list 2 full string passes (stringHasWhitespace()
and stringHasCntl() functions) for every URI to be logged and in worst
case 4 string passes and 2 strdup/mallocs
Yes, neither of those are needed at all.
Regards,
Christos
What I think is needed is:
if (!cleanUrl)
...
else {
switch (Config.uri_whitespace) {
After the rfc1738 API is fixed this can become:
int flags = 0;
switch (Config.uri_whitespace) {
case URI_WHITESPACE_ALLOW:
flags |= RFC1738_ESCAPE_NOSPACE;
case URI_WHITESPACE_ENCODE:
flags |= RFC1738_ESCAPE_UNESCAPED;
http->log_uri = xstrndup(rfc1738_do_escape(uri, flags), MAX_URL);
break;
case URI_WHITESPACE_CHOP:
http->log_uri = xstrndup(uri, min(MAX_URL,strcspn(http->log_uri,
w_space)));
break;
case URI_WHITESPACE_DENY:
case URI_WHITESPACE_STRIP:
default:
{
char *q = static_cast<char*>(xmalloc(strlen(uri)*3 + 1));
const char *t = uri;
while (*t) {
if (!xisspace(*t))
*q++ = *t;
t++;
}
*q = '\0';
http->log_uri = xstrndup(rfc1738_escape_unescaped(q), MAX_URL);
xfree(q);
}
}
Amos
--
Please be using
Current Stable Squid 2.7.STABLE9 or 3.1.11
Beta testers wanted for 3.2.0.5
=== modified file 'include/rfc1738.h'
--- include/rfc1738.h 2010-11-02 00:12:43 +0000
+++ include/rfc1738.h 2011-03-04 12:51:18 +0000
@@ -6,33 +6,49 @@
#endif
/* Encoder rfc1738_do_escape flag values. */
-#define RFC1738_ESCAPE_UNSAFE 0
-#define RFC1738_ESCAPE_RESERVED 1
-#define RFC1738_ESCAPE_UNESCAPED -1
-
+#define RFC1738_ESCAPE_CTRLS 1
+#define RFC1738_ESCAPE_UNSAFE 2
+#define RFC1738_ESCAPE_RESERVED 4
+#define RFC1738_ESCAPE_ALL (RFC1738_ESCAPE_UNSAFE|RFC1738_ESCAPE_RESERVED|RFC1738_ESCAPE_CTRLS)
+ // exclusions
+#define RFC1738_ESCAPE_NOSPACE 128
+#define RFC1738_ESCAPE_NOPERCENT 256
+ // Backward compatibility
+#define RFC1738_ESCAPE_UNESCAPED (RFC1738_ESCAPE_UNSAFE|RFC1738_ESCAPE_CTRLS|RFC1738_ESCAPE_NOPERCENT)
/**
* \group rfc1738 RFC 1738 URL-escaping library
*
* Public API is formed of a triplet of encode functions mapping to the rfc1738_do_encode() engine.
*
- * ASCII characters are split into three groups:
+ * ASCII characters are split into four groups:
* \item SAFE Characters which are safe to occur in any URL. For example A,B,C
- * \item UNSAFE Characters which are completely usafe to occur in any URL. For example; backspace, tab, space, newline
+ * \item CTRLS Binary control codes. Dangerous to include in URLs.
+ * \item UNSAFE Characters which are completely usafe to occur in any URL. For example; backspace, tab, space, newline.
* \item RESERVED Characters which are reserved for special meaning and may only occur in certain parts of a URL.
*
* Returns a static buffer containing the RFC 1738 compliant, escaped version of the given url.
*
- * \param flags RFC1738_ESCAPE_UNSAFE Only encode unsafe characters. Ignore reserved.
- * \param flags RFC1738_ESCAPE_RESERVED Encode all unsafe and reserved characters.
- * \param flags RFC1738_ESCAPE_UNESCAPED Encode all unsafe characters which have not already been encoded.
+ * \param flags RFC1738_ESCAPE_CTRLS Encode the blatantly dangerous binary codes.
+ * \param flags RFC1738_ESCAPE_UNSAFE Encode printable unsafe characters (excluding CTRLs).
+ * \param flags RFC1738_ESCAPE_RESERVED Encode reserved characters.
+ * \param flags RFC1738_ESCAPE_ALL Encode all binary CTRL, unsafe and reserved characters.
+ * \param flags RFC1738_ESCAPE_NOSPACE Ignore the space whitespace character.
+ * \param flags RFC1738_ESCAPE_NOPERCENT Ignore the escaping delimiter '%'.
*/
extern char *rfc1738_do_escape(const char *url, int flags);
/* Old API functions */
-#define rfc1738_escape(x) rfc1738_do_escape(x, RFC1738_ESCAPE_UNSAFE)
-#define rfc1738_escape_part(x) rfc1738_do_escape(x, RFC1738_ESCAPE_RESERVED)
-#define rfc1738_escape_unescaped(x) rfc1738_do_escape(x, RFC1738_ESCAPE_UNESCAPED)
+
+ /* Default RFC 1738 escaping. Escape all UNSAFE characters and binary CTRL codes */
+#define rfc1738_escape(x) rfc1738_do_escape(x, RFC1738_ESCAPE_UNSAFE|RFC1738_ESCAPE_CTRLS)
+
+ /* Escape a partial URL. Encoding every binary code, unsafe or reserved character. */
+#define rfc1738_escape_part(x) rfc1738_do_escape(x, RFC1738_ESCAPE_ALL)
+
+ /* Escape a URL. Encoding every unsafe characters but skipping reserved and already-encoded bytes.
+ * Suitable for safely encoding an absolute URL which may be encoded but is not trusted. */
+#define rfc1738_escape_unescaped(x) rfc1738_do_escape(x, RFC1738_ESCAPE_UNSAFE|RFC1738_ESCAPE_CTRLS|RFC1738_ESCAPE_NOPERCENT)
/**
=== modified file 'lib/rfc1738.c'
--- lib/rfc1738.c 2010-11-01 05:44:28 +0000
+++ lib/rfc1738.c 2011-03-04 13:11:40 +0000
@@ -34,7 +34,6 @@
#include "config.h"
#include "rfc1738.h"
-//#include "util.h"
#if HAVE_STDIO_H
#include <stdio.h>
@@ -53,6 +52,7 @@
(char) 0x22, /* " */
(char) 0x23, /* # */
#if 0 /* done in code */
+ (char) 0x20, /* space */
(char) 0x25, /* % */
#endif
(char) 0x7B, /* { */
@@ -64,8 +64,7 @@
(char) 0x5B, /* [ */
(char) 0x5D, /* ] */
(char) 0x60, /* ` */
- (char) 0x27, /* ' */
- (char) 0x20 /* space */
+ (char) 0x27 /* ' */
};
static char rfc1738_reserved_chars[] = {
@@ -97,36 +96,49 @@
buf = (char*)xcalloc(bufsize, 1);
}
for (p = url, q = buf; *p != '\0' && q < (buf + bufsize - 1); p++, q++) {
+
+ /* a-z, A-Z and 0-9 are SAFE. */
+ if ((*p >= 'a' && *p <= 'z') || (*p >= 'A' && *p <= 'Z') || (*p >= '0' && *p <= '9')) {
+ *q = *p;
+ continue;
+ }
+
do_escape = 0;
/* RFC 1738 defines these chars as unsafe */
- for (i = 0; i < sizeof(rfc1738_unsafe_chars); i++) {
- if (*p == rfc1738_unsafe_chars[i]) {
- do_escape = 1;
- break;
+ if ((flags & RFC1738_ESCAPE_UNSAFE)) {
+ for (i = 0;i < sizeof(rfc1738_unsafe_chars); i++) {
+ if (*p == rfc1738_unsafe_chars[i]) {
+ do_escape = 1;
+ break;
+ }
}
+ /* Handle % separately */
+ if (!(flags & RFC1738_ESCAPE_NOPERCENT) && *p == '%')
+ do_escape = 1;
+ /* Handle space separately */
+ else if (!(flags & RFC1738_ESCAPE_NOSPACE) && *p <= ' ')
+ do_escape = 1;
}
- /* Handle % separately */
- if (flags != RFC1738_ESCAPE_UNESCAPED && *p == '%')
- do_escape = 1;
/* RFC 1738 defines these chars as reserved */
- for (i = 0; i < sizeof(rfc1738_reserved_chars) && flags == RFC1738_ESCAPE_RESERVED; i++) {
- if (*p == rfc1738_reserved_chars[i]) {
- do_escape = 1;
- break;
+ if ((flags & RFC1738_ESCAPE_RESERVED) && do_escape == 0) {
+ for (i = 0; i < sizeof(rfc1738_reserved_chars); i++) {
+ if (*p == rfc1738_reserved_chars[i]) {
+ do_escape = 1;
+ break;
+ }
}
}
- /* RFC 1738 says any control chars (0x00-0x1F) are encoded */
- if ((unsigned char) *p <= (unsigned char) 0x1F) {
- do_escape = 1;
- }
- /* RFC 1738 says 0x7f is encoded */
- if (*p == (char) 0x7F) {
- do_escape = 1;
- }
- /* RFC 1738 says any non-US-ASCII are encoded */
- if (((unsigned char) *p >= (unsigned char) 0x80)) {
- do_escape = 1;
+ if ((flags & RFC1738_ESCAPE_CTRLS) && do_escape == 0) {
+ /* RFC 1738 says any control chars (0x00-0x1F) are encoded */
+ if ((unsigned char) *p <= (unsigned char) 0x1F)
+ do_escape = 1;
+ /* RFC 1738 says 0x7f is encoded */
+ else if (*p == (char) 0x7F)
+ do_escape = 1;
+ /* RFC 1738 says any non-US-ASCII are encoded */
+ else if (((unsigned char) *p >= (unsigned char) 0x80))
+ do_escape = 1;
}
/* Do the triplet encoding, or just copy the char */
/* note: we do not need snprintf here as q is appropriately
=== modified file 'lib/tests/testRFC1738.cc'
--- lib/tests/testRFC1738.cc 2010-03-30 17:32:03 +0000
+++ lib/tests/testRFC1738.cc 2011-03-04 12:16:08 +0000
@@ -87,10 +87,6 @@
{
char *result;
-#define RFC1738_ESCAPE_UNSAFE 0
-#define RFC1738_ESCAPE_RESERVED 1
-#define RFC1738_ESCAPE_UNESCAPED -1
-
/* TEST: Escaping only unsafe characters */
/* regular URL (no encoding needed) */