> On Feb 22, 2017, at 6:05 PM, Mohit Athwani via swift-users 
> <swift-users@swift.org> wrote:
> 
> I don't understand why we need the usedEncoding parameter? I understand that 
> it's a pointer but how do we decide what encoding to use? Do we default to 
> NSUTF8StringEncoding?

The original implementation in Foundation uses some heuristics to try to guess 
the encoding, since there are unfortunately billions of plain text files out 
there that don’t explicitly state their encoding. It’s not open source, so we 
can’t know for sure [except for the people who work at Apple], but I’m sure it 
includes things like:

- Look for a Unicode BOM at the start, in which case it’s probably UTF-16 (or 
maybe UTF-32? I don’t know the details.)
- If not, see whether all bytes are 0x00-0x7F ⟶ in that case use ASCII
- If not, does it contain any byte sequences that are illegal in UTF-8? ⟶ If 
not, use UTF-8
- Otherwise, does it contain any bytes in the range 0x80-0xBF?
        ⟶ If not, ISO-8859-1  (aka ISO-Latin-1) is a good guess
        ⟶ If so, CP-1252 (aka WinLatin1) is a good guess; it’s a nonstandard 
but very common superset of ISO-8859-1 with extra characters in that byte range

There are likely other heuristics too. It used to be important to detect the 
old MacRoman encoding used in pre-OS X apps, but it’s been long enough that 
there shouldn’t be many docs like that in the wild anymore. There are multibyte 
non-Unicode encodings that used to be very common in non-Roman languages, like 
Shift-JIS, but I have no idea how to detect them or if they’re even still 
relevant.

It could also be useful to check whether the start of the file looks like XML 
or HTML, and if so, parse it enough to find where it specifies its encoding. 
(Are there other text formats that include encodings? I’ve seen special 
markings at the top of source files used for emacs or vi, specifying tab widths 
and such, but I don’t know if those can specify encodings too.)

I’m not involved in Swift development, but IMHO a basic implementation that 
just uses the rules I sketched above would be pretty useful, and then people 
with more domain knowledge could enhance that code to add more heuristics later 
on.

—Jens
_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

Reply via email to