ons 2003-01-29 klockan 12.16 skrev Joel Rowbottom: > I represent a company called Characterisation which is providing an interim > IDN solution - Verisign are also implementing their own system which is > similar. Both require 8-bit clean to be passed from the resolver, which > Squid doesn't do.
One problem is that Squid receives requests over HTTP, and in HTTP only ascii characters is allowed for host names. This is defined both as a standard for URLs and as a standard for the HTTP protocol as such. For Squid to properly parse and understand the HTTP request and URL it must know the character encoding used, and as the only standardized character encoding for URLs within HTTP is limited ASCII and this is what Squid assumes is used. The other problem is that due to the nature of Squid beeing a proxy it not only passes data like a router but also needs to understand how to read the data. To correcly understand and use the data a understanding of the encoding used is often required. This said, see also http://www.squid-cache.org/Versions/v2/2.5/bugs/#squid-2.5.STABLE1-hostnames for a workaround making Squid ignore most of this. However, this is far from perfect and opens a new can of worms until there is a standard on how to handle IDN names. > Surely a proxy request should be transparent, rather than imposing its own > rules? If not proxied then the requests work fine, but if through Squid > then it whinges? Squid aims at beeing semantically transparent for requests within the defined standards. binary FQDN hostnames is not part of any standard or even Internet Draft (including DNS). Note: Squid needs to understand the structure of FQDN hostnames for many purposes. * Parsing of HTTP, to be able to isolate the hostname component and it's structure in domain labels. * Access controls, comparing labels and pattern matching within FQDN names. * Logging * To convert hostnames to DNS labels when resolving into IP addresses > I'd be interested in the "standard" which states "The current Internet > standards is very strict on what is an acceptable hostname and only accepts > A-Z a-z 0-9 and - in Internet hostname labels. Anything outside this is > outside the current Internet standards and will cause interoperability > issues such as the problems seen with such names and Squid." -- an RFC > would be ideal ;) Here is the most obvious ones in this scope, but there is many many more if you care to study the subject: STD0003 Requirements for Internet Hosts RFC2616 Hypertext Transfer Protocol -- HTTP/1.1 RFC1738 Uniform Resource Locators (URL) RFC2396 Uniform Resource Identifiers (URI): Generic Syntax To summarise the current situation: 1. The DNS protocol allows (and has always allowed) any data to be used in DNS labels. This because the DNS protocol as such is application neutral and not limited to resolving Internet host names. However, it is assumed the data is ASCII when comparing domain labels as labels are case-insensitive. 2. All standard documents which refers to Internet host names or Internet domains (including their namespace structure within DNS) limits such names to use a-z 0-9 - case insensitive labels. There is Internet Draft documents discussing various approaches on how to get beyond this, but none of these has yet to my knowledge been assigned or even proposed RFC status. The IDN IETF working group <url:http://www.ietf.org/html.charters/idn-charter.html> is assigned the task of defining international domainnames, but very little progress seems have been made in the last years.. (unfortunately a common symptom for most IETF working groups these days it seems... too much politics involved I think) 3. Further there has been very little activity in addressing how the upper layer Internet Protocols such as HTTP and SMTP should be addressed, but there is two clear paths. Neither involve what you call "8-bit clean". a) Application encoding of UTF-8 characters using the allowable character sets until each protocol has been updated. A range of different encodings have been proposed, but is seems only b) Direct use of UTF-8 within the protocols. Approach 'a' requires no change in the protocols or infrastructure such as proxies (or even DNS servers), only in the user interfaces and how DNS names is registered. To the protocols international domain labels is just strange looking sequences of normal a-z 0-9 - characters (well. there is also a proposal for using %hh URL escaped syntax for URIs, not sure how much attention this will receive however) Approach 'b' requires the whole infrastructure to be updated to use UTF-8 encoding, and each of the Internet Protocols redefined in what are allowable, reserved or forbidden characters for use in Host names in the scope of UTF-8. 4. As a result of very little or no progress in the standardization efforts of the IDN IETF WG most DNS registrars have grown tired and is starting to allow registration of "binary" DNS labels, which happens to work in many browsers by accident using various national character encodings (ISO-8859-X, UTF-8, ...), and not because such names is truly allowed for use on the Internet. While this is not strictly forbidden by DNS standards, the use of such domain names within Internet application protocols such as HTTP or SMTP is. I have a memory of seeing a kind of official document stating that UTF-8 should be used in all new Internet Protocols and that the long term goal is to allow UTF-8 to be used anywhere by updating the existing Internet protocols to support UTF-8 where possible, but now I don't seem to find this document.. Also I do not remember if it was a IETF, IAB or ISOC document.. A good in-depth reading on the subject of International Domain Names and their associated problems within the Internet protocols is RFC2825 A Tangled Web: Issues of I18N, Domain Names, and the Other Internet protocols. You are welcome to correct me if you find any errors in the above. Regards Henrik -- Henrik Nordstrom <[EMAIL PROTECTED]> MARA Systems AB, Sweden
