On Thu, 19 Jan 2012, John Little wrote:

Hi all

I'm revising the function f_readfile in eval.c, to speed it up when processing very long lines. (It presently grows a string every 200 bytes by allocating a new one 200 bytes longer, copying the old to the new, and deallocating the new. F.ex., for a 1 MB line, such as may be used by the yank ring plug in, there's 5000 allocations and deallocations and about 5 GB of data copies.)

I've noted also that presently its handling of CR and bom removal fails if the characters are read in different calls to fread, so I'm fixing that. One can only decide that the utf-8 bom sequence EF BB BF is present if all three bytes have been read, so I was about to code a check when the BF is encountered, but it occurred to me that if BF is common in UTF-8 text, there'd be a lot of checking the previous bytes.

I don't know the background of 'f_readfile', but why would the BOM be removed in positions other than at the start of the string? Isn't it only meaningful as an encoding detection when it's the first thing being read? Anywhere else U+FEFF is a zero-width, non-breaking space.

I'm also not sure checking for it is expensive enough to worry about the inefficiency of looking backward two buffer positions. The cost of disk access is so much greater than comparing it once in memory that one or two extra occasional comparisons seems insignificant. Seems like premature optimization.


So, how common is the byte BF in utf-8 text? How common are EF and BB? I've little idea. Perhaps someone on vim_dev has a better idea.

They all seem to appear throughout script blocks, so it's really data-dependent. Here are the most common blocks containing codepoints whose UTF-8 encoding contains the given octet:

┌ bb ──────────────────────────────
│   11  Arabic Presentation Forms-A
│   66  Arabic Presentation Forms-B
│   65  CJK Radicals Supplement
│  706  CJK Unified Ideographs
│  166  CJK Unified Ideographs Extension A
│  363  Hangul Syllables
│   14  High Surrogates
│   65  Lao
│   67  Latin Extended Additional
│   79  Low Surrogates
│   11  No_Block
│  163  Private Use Area
│   18  Yi Syllables
└──────────────────────────────────
┌ bf ──────────────────────────────
│   11  Arabic Presentation Forms-A
│  706  CJK Unified Ideographs
│  166  CJK Unified Ideographs Extension A
│   67  Greek Extended
│   51  Halfwidth and Fullwidth Forms
│  363  Hangul Syllables
│   14  High Surrogates
│   16  Ideographic Description Characters
│   35  Kangxi Radicals
│   79  Low Surrogates
│   27  No_Block
│  163  Private Use Area
│   16  Specials
│   67  Tibetan
│   18  Yi Syllables
└──────────────────────────────────
┌ ef ──────────────────────────────
│   80  Alphabetic Presentation Forms
│  688  Arabic Presentation Forms-A
│  144  Arabic Presentation Forms-B
│   32  CJK Compatibility Forms
│  512  CJK Compatibility Ideographs
│   16  Combining Half Marks
│  240  Halfwidth and Fullwidth Forms
│ 2304  Private Use Area
│   32  Small Form Variants
│   16  Specials
│   16  Variation Selectors
│   16  Vertical Forms
└──────────────────────────────────

--
Best,
Ben

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui