Re: Why do webforms often refuse non-ASCII characters?

Bríd-Áine Parnell via Unicode Thu, 30 Jan 2025 02:28:23 -0800

Hi all,

Thanks for the replies! This is what I figured. With the airlines, maybe it 
also has to match the machine-readable bit of the passport that runs along the 
bottom, which is always without special characters, even if the name above that 
has fadas, or umlauts or whatever.


So, follow-on question, surely this makes it difficult when moving between 
systems and when trying to identify people? What I mean is, if for example Seán 
is spelled Seán on some systems and Sean on others, wouldn't that mean that 
some attempts to summon up his records would fail? And for financial systems, 
especially with open banking attempting to make data sharing easier, would it 
be difficult to perform know-your-client checks or collect records for credit 
scoring if newer banks/digital banks are using Unicode, but older systems 
aren't?

I also wonder if people start using natural language systems, e.g. natural 
language to SQL, to query databases they might run into issues with how names 
are recorded...


Bríd-Áine Parnell

Doctoral Researcher | Designing Responsible Natural Language Processing

School of Informatics | Edinburgh Futures Institute

________________________________
From: Phil Smith III <li...@akphs.com>
Sent: 29 January 2025 6:24 PM
To: 'Alexander Lange' <alexander.la...@catrinity-font.de>; 
unicode@corp.unicode.org <unicode@corp.unicode.org>; Bríd-Áine Parnell 
<bridaine.parn...@ed.ac.uk>
Subject: RE: Why do webforms often refuse non-ASCII characters?

This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email 
is genuine and the content is safe.

Airlines are a perfect example of #1. Many/most airlines—and at least some 
parts of the shared ecosystem—run their scheduling systems on IBM z/TPF, which 
is a high-performing transactional system on IBM mainframes; it was originally 
called ACP, for Airline Control Program. See 
https://en.wikipedia.org/wiki/Transaction_Processing_Facility



Key here is that IBM mainframes (and TPF) use EBCDIC for encoding. Now, EBCDIC 
has a rich set of code pages—but modulo the much-hated and rarely used DBCS 
(Double-Byte Character Set, which uses shift-in/shift-out characters to go into 
and out of double-byte mode), an EBCDIC code page is 256 characters. So you 
have a byte; supposed it’s x'43' (aka 0x42), and if you’re “in” code page 1047 
(a common U.S. code page), that’s a lowercase a-with-umlaut. If you’re in code 
page 829 (math symbols), it’s a capital A-with-Angstrom. So it’s 
context-dependent.



This is a bit of a mess, needless to say. Back in 1964 when the IBM 360 was 
being developed, the plan was to support both EBCDIC (because of older systems) 
and ASCII. There was even a hardware bit in the program status word (the 
location counter, among other things) that said “Hey, we’re in ASCII mode!” 
However, due to the rushed nature of the project, the ASCII parts got left 
behind and were never resurrected. I have on my list for when I finish my time 
machine to fix that (along with “no null-terminated strings”, “no case 
sensitivity in UNIX filenames”, “forward slashes in DOS and Windows paths”, and 
for God’s sake, “consistent line endings across operating systems”!!)



The point here is that as others have noted, you just cannot assume Unicode 
support across the ecosystem. Worse, you can’t even assume that a given set of 
characters can coexist: if someone has a first name that contains a Cyrillic 
character and a Greek last name, you simply *cannot* represent that in EBCDIC, 
without metadata indicating that the two names use different code pages (which 
I’ve never seen anyone actually do, given the rarity of such use cases). 
Technically, if you were to require that support, you’d want code pages *per 
character*, since someone could have a made-up name that includes characters 
from two disparate EBCDIC code pages.



Thus the limitations on characters will remain for the foreseeable future. 
There’s clearly a Western-centric slant here, but that’s historical. I assume 
that part of being an Asiana Airlines gate agent or FA includes a requirement 
to be able to at least fumble your way through reading basic ASCII names. 
Consider the inverse: a Korean name written in Korean glyphs would completely 
stump the average American Airlines employee. That’s not a justification, just 
a description of how it is.



From: Unicode <unicode-boun...@corp.unicode.org> On Behalf Of Alexander Lange 
via Unicode
Sent: Wednesday, January 29, 2025 12:26 PM
To: unicode@corp.unicode.org
Subject: Re: Why do webforms often refuse non-ASCII characters?



Hi,



I can see three reasons for this:



1. As you say, modern databases can handle this. But not all databases 
currently in use are modern. Especially state agencies, banks and some other 
large corporations are often still using pretty old systems, or need to stay 
compatible to someone else's old system. A famous example are airline tickets: 
They all run through one quirky old system that can't deal with anything but 
ASCII letters, forcing many people to misspell their names when booking a 
flight.



2. The second reason is what I would call "lazy validation". You have to make 
sure your system isn't vulnerable to query injection, code injection, and 
perhaps spoofing, i.e. you have to forbid some characters that have a special 
function in whatever query language, programming language(s) and markup 
language(s) you use. Like e.g. ' for SQL, < and " for HTML and so on. If you 
forget any of these, you have a huge security problem. So the easiest and 
safest way* to ensure this is to whitelist just the characters that you know to 
be safe, and while you can do that correctly based on Unicode's character 
properties, I've also seen people using way too simple regular expressions like 
/[A-Za-z0-9]+/ which cause the problem you described.



* Apart from using libraries that already solve the problem properly, of 
course. But surprisingly many people keep re-implementing existing things for 
some reason.



3. Inconvenience for own staff: Even if a system can handle "special" 
characters, they may still be a hassle to work with. I once visited Japan with 
a friend whose name was Jürgen, and when they typed in our names in their 
system, it took four people discussing for ten minutes about how to insert the 
ü. Also checking if things like names are correct and matching across different 
documents is way harder if people can't easily read them.



Kind regards,
Alexander



On 29.01.2025 16:39, Bríd-Áine Parnell via Unicode wrote:

Hi everyone,



I'm hoping someone can help me out with some information. I'm doing some 
research into the refusal of accents in names (and other multicultural naming 
conventions) in online webforms. For example, in Ireland, there was a campaign 
recently to get the government to mandate acceptance of the fada in Irish 
language names (Seán instead of Sean). The campaign was successful, and the law 
changed in 2022, but it's only a requirement for public bodies, companies do 
not have to comply.



During the campaign, reports were made to the Data Protection Commissioner on 
the right to rectify about some of the companies, including Bank of Ireland and 
Aer Lingus. They defended themselves by saying that their systems couldn't 
accept fadas in names.



I'm assuming that its systems on the back end, such as database systems, that 
can't accept the so-called special characters. My question is, why would this 
be, given that Unicode would seem to solve this, and modern databases can use 
Unicode? Does anyone understand what the value is in continuing to retain 
legacy systems that only accept ASCII or some ISO variants? Or is there a 
different problem happening?



Appreciate any information that might shed light on this.



Thanks,



Bríd-Áine Parnell



Doctoral Researcher | Designing Responsible Natural Language Processing



School of Informatics | Edinburgh Futures Institute

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh 
Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

Re: Why do webforms often refuse non-ASCII characters?

Reply via email to