Hey Jens, Thanks so much! This is really useful! I'm going to get started on this.
Cheers! Mohit On Wed, Feb 22, 2017 at 9:09 PM, Jens Alfke <j...@mooseyard.com> wrote: > > On Feb 22, 2017, at 6:05 PM, Mohit Athwani via swift-users < > swift-users@swift.org> wrote: > > I don't understand why we need the usedEncoding parameter? I understand > that it's a pointer but how do we decide what encoding to use? Do we > default to NSUTF8StringEncoding? > > > The original implementation in Foundation uses some heuristics to try to > guess the encoding, since there are unfortunately billions of plain text > files out there that don’t explicitly state their encoding. It’s not open > source, so we can’t know for sure [except for the people who work at > Apple], but I’m sure it includes things like: > > - Look for a Unicode BOM at the start, in which case it’s probably UTF-16 > (or maybe UTF-32? I don’t know the details.) > - If not, see whether all bytes are 0x00-0x7F ⟶ in that case use ASCII > - If not, does it contain any byte sequences that are illegal in UTF-8? ⟶ > If not, use UTF-8 > - Otherwise, does it contain any bytes in the range 0x80-0xBF? > ⟶ If not, ISO-8859-1 (aka ISO-Latin-1) is a good guess > ⟶ If so, CP-1252 (aka WinLatin1) is a good guess; it’s a nonstandard but > very common superset of ISO-8859-1 with extra characters in that byte range > > There are likely other heuristics too. It used to be important to detect > the old MacRoman encoding used in pre-OS X apps, but it’s been long enough > that there shouldn’t be many docs like that in the wild anymore. There are > multibyte non-Unicode encodings that used to be very common in non-Roman > languages, like Shift-JIS, but I have no idea how to detect them or if > they’re even still relevant. > > It could also be useful to check whether the start of the file looks like > XML or HTML, and if so, parse it enough to find where it specifies its > encoding. (Are there other text formats that include encodings? I’ve seen > special markings at the top of source files used for emacs or vi, > specifying tab widths and such, but I don’t know if those can specify > encodings too.) > > I’m not involved in Swift development, but IMHO a basic implementation > that just uses the rules I sketched above would be pretty useful, and then > people with more domain knowledge could enhance that code to add more > heuristics later on. > > —Jens >
_______________________________________________ swift-users mailing list swift-users@swift.org https://lists.swift.org/mailman/listinfo/swift-users