Re: Unicode String Models

Eli Zaretskii via Unicode Sun, 09 Sep 2018 12:22:50 -0700

> From: Philippe Verdy <[email protected]>
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham <[email protected]>, 
>       unicode Unicode Discussion <[email protected]>
> 
>  In Emacs, buffer text is a character string with a gap, actually.
> 
> A text buffer with gaps is a complex structure, not just a plain string.


The difference is very small, and a couple of macros allow you to
almost forget about the gap.

> I doubt it constantly uses a single gap at end (insertions and deletions in 
> the middle would
> constant move large blocks and use excessive CPU and memory bandwidth, with 
> very slow response: users
> do not want to see what they type appearing on the screen at one keystroke 
> every few seconds because each
> typed key causes massive block moves and excessive memory paging from/to disk 
> while this move is being
> performed).

In Emacs, the gap is always where the text is inserted or deleted, be
it in the middle of text or at its end.

> All editors I have seen treat the text as ordered collections of small 
> buffers (these small buffers may still have
> small gaps), which are occasionnally merged or splitted when needed (merging 
> does not cause any
> reallocation but may free one of the buffers), some of them being paged out 
> to tempoary files when memory is
> stressed. There are some heuristics in the editor's code to when mainatenance 
> of the collection is really
> needed and useful for the performance.

My point was to say that Emacs is not one of these editors you
describe.

> But beside this the performance cost of UTF indexing of the codepoints is 
> invisible: each buffer will only need
> to avoid breaking text between codepoint boundaries, if the current encoding 
> of the edited text is an UTF. An
> editor may also avoid breaking buffers in the middle of clusters if they 
> render clusters (including ligatures if
> they are supported): clusters are still small in size in every encoding and 
> reasonnable buffer sizes can hold at
> least hundreds of clusters (even the largest ones which occur rarely). How 
> editors will manage clusters to
> make them editable is dependant of the implementation, buyt even the UTF or 
> codepoints boundaries are not
> enough to handle that. In all cases the logical text buffer is structured 
> with a complex backing store, where
> parts may be paged out (and will also include more than just the current 
> text, notably it will include parts of the
> indexes, possibly in another temporary working file).

You ignore or disregard the need to represent raw bytes in editor
buffers.  That is when the encoding stops being "invisible".

Re: Unicode String Models

Reply via email to