Re: utf-8 bom frequency of bytes

Dominique Pellé Sun, 22 Jan 2012 02:09:59 -0800

John Little wrote:

> I can't help thinking that your linear times are not guaranteed with
> the vagaries of heap fragmentation and memory allocator
> implementation, and that calling into memory allocator code every 200
> bytes of megabytes of data is to be avoided.  Such intuitions are
> infamous, I suppose.


You might be right. I aimed at touching as little as possible
in the patch, as I found the existing implementation difficult
to understand.

My patch also did not address the buggy removal of BOM as
you indicated when BOM sequence 0xEF 0xBB 0xBF spans
2 distinct fread(...) of 200 bytes.

The BOM bug can be reproduced with:

# create a 201 byte line where a BOM spans 2 fread() of 200 bytes.
$ perl -e 'print "x" x 198, chr(0xef), chr(0xbb), chr(0xbf)' > test-bom.txt

$ vim -u NONE -c ":echo readfile('test-bom.txt')"

And it prints
['xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
'<feff>']

(bug: <feff> in the list which should have been removed)

I'm curious to see your patch when ready.

Reading by buffer of 200 bytes makes f_readfile() complicated.
I think we're better off reading byte by byte using fgetc(fd),
which is buffered in libc anyway so performance should be
close to reading by 200 bytes at a time with fread(...)

-- Dominique

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: utf-8 bom frequency of bytes

Raspunde prin e-mail lui