> From: Philippe Verdy <verd...@wanadoo.fr> > Date: Sun, 9 Sep 2018 19:35:47 +0200 > Cc: Richard Wordingham <richard.wording...@ntlworld.com>, > unicode Unicode Discussion <unicode@unicode.org> > > In Emacs, buffer text is a character string with a gap, actually. > > A text buffer with gaps is a complex structure, not just a plain string.
The difference is very small, and a couple of macros allow you to almost forget about the gap. > I doubt it constantly uses a single gap at end (insertions and deletions in > the middle would > constant move large blocks and use excessive CPU and memory bandwidth, with > very slow response: users > do not want to see what they type appearing on the screen at one keystroke > every few seconds because each > typed key causes massive block moves and excessive memory paging from/to disk > while this move is being > performed). In Emacs, the gap is always where the text is inserted or deleted, be it in the middle of text or at its end. > All editors I have seen treat the text as ordered collections of small > buffers (these small buffers may still have > small gaps), which are occasionnally merged or splitted when needed (merging > does not cause any > reallocation but may free one of the buffers), some of them being paged out > to tempoary files when memory is > stressed. There are some heuristics in the editor's code to when mainatenance > of the collection is really > needed and useful for the performance. My point was to say that Emacs is not one of these editors you describe. > But beside this the performance cost of UTF indexing of the codepoints is > invisible: each buffer will only need > to avoid breaking text between codepoint boundaries, if the current encoding > of the edited text is an UTF. An > editor may also avoid breaking buffers in the middle of clusters if they > render clusters (including ligatures if > they are supported): clusters are still small in size in every encoding and > reasonnable buffer sizes can hold at > least hundreds of clusters (even the largest ones which occur rarely). How > editors will manage clusters to > make them editable is dependant of the implementation, buyt even the UTF or > codepoints boundaries are not > enough to handle that. In all cases the logical text buffer is structured > with a complex backing store, where > parts may be paged out (and will also include more than just the current > text, notably it will include parts of the > indexes, possibly in another temporary working file). You ignore or disregard the need to represent raw bytes in editor buffers. That is when the encoding stops being "invisible".