Re: utf-8 bom frequency of bytes

Benjamin R. Haskell Thu, 19 Jan 2012 20:43:06 -0800

On Thu, 19 Jan 2012, John Little wrote:

Hi all
I'm revising the function f_readfile in eval.c, to speed it up whenprocessing very long lines. (It presently grows a string every 200bytes by allocating a new one 200 bytes longer, copying the old to thenew, and deallocating the new. F.ex., for a 1 MB line, such as may beused by the yank ring plug in, there's 5000 allocations anddeallocations and about 5 GB of data copies.)
I've noted also that presently its handling of CR and bom removalfails if the characters are read in different calls to fread, so I'mfixing that. One can only decide that the utf-8 bom sequence EF BB BFis present if all three bytes have been read, so I was about to code acheck when the BF is encountered, but it occurred to me that if BF iscommon in UTF-8 text, there'd be a lot of checking the previous bytes.

I don't know the background of 'f_readfile', but why would the BOM beremoved in positions other than at the start of the string? Isn't itonly meaningful as an encoding detection when it's the first thing beingread? Anywhere else U+FEFF is a zero-width, non-breaking space.

I'm also not sure checking for it is expensive enough to worry about theinefficiency of looking backward two buffer positions. The cost ofdisk access is so much greater than comparing it once in memory that oneor two extra occasional comparisons seems insignificant. Seems likepremature optimization.

So, how common is the byte BF in utf-8 text? How common are EF andBB? I've little idea. Perhaps someone on vim_dev has a better idea.

They all seem to appear throughout script blocks, so it's reallydata-dependent. Here are the most common blocks containing codepointswhose UTF-8 encoding contains the given octet:


┌ bb ──────────────────────────────
│   11  Arabic Presentation Forms-A
│   66  Arabic Presentation Forms-B
│   65  CJK Radicals Supplement
│  706  CJK Unified Ideographs
│  166  CJK Unified Ideographs Extension A
│  363  Hangul Syllables
│   14  High Surrogates
│   65  Lao
│   67  Latin Extended Additional
│   79  Low Surrogates
│   11  No_Block
│  163  Private Use Area
│   18  Yi Syllables
└──────────────────────────────────
┌ bf ──────────────────────────────
│   11  Arabic Presentation Forms-A
│  706  CJK Unified Ideographs
│  166  CJK Unified Ideographs Extension A
│   67  Greek Extended
│   51  Halfwidth and Fullwidth Forms
│  363  Hangul Syllables
│   14  High Surrogates
│   16  Ideographic Description Characters
│   35  Kangxi Radicals
│   79  Low Surrogates
│   27  No_Block
│  163  Private Use Area
│   16  Specials
│   67  Tibetan
│   18  Yi Syllables
└──────────────────────────────────
┌ ef ──────────────────────────────
│   80  Alphabetic Presentation Forms
│  688  Arabic Presentation Forms-A
│  144  Arabic Presentation Forms-B
│   32  CJK Compatibility Forms
│  512  CJK Compatibility Ideographs
│   16  Combining Half Marks
│  240  Halfwidth and Fullwidth Forms
│ 2304  Private Use Area
│   32  Small Form Variants
│   16  Specials
│   16  Variation Selectors
│   16  Vertical Forms
└──────────────────────────────────

--
Best,
Ben

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: utf-8 bom frequency of bytes

Raspunde prin e-mail lui