Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-09-11 Fir de Conversatie Ben Fritz


On Sep 10, 10:22 pm, Benjamin Fritz fritzophre...@gmail.com wrote:
 Unfortunately, I could not find a list of widely supported charsets,
 so I just used all the ones in Vim and the IANA registry, as mentioned
 previously. If there is such a list, would it be a good idea to limit
 the automatically detected charsets to those in the list? Along those
 lines, it could be a good idea to automatically use UTF-8 in place of
 UTF-16 and UTF-32. Currently these charsets are selected as-is.


Notably, I should mention:

UTF-32 is not supported at all in Opera. In fact, they removed support
for UTF-32 in version 10: http://www.opera.com/docs/changelogs/windows/1000b1/

UTF-32 and UTF-16 do not seem to be supported by Firefox at all for
xhtml, and I had to manually select the correct encoding for the html
documents.

Google Chrome, Internet Explorer 8, and Safari seem to have no
problems (although IE8 does not support xhtml at all so I could not
test these in that browser).

I'm thinking that I will make the automatic detection from the Vim
encoding default to UTF-8 for these encodings, but will leave the
detection of encoding from charset in case the user specifies one of
them using g:html_use_encoding. The user can also use
g:html_charset_override if they want these to be automatically
detected.

Thoughts? There are some test files available here if you're curious:

http://code.google.com/p/vim-2html-test/source/browse/encoding_test/

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-09-10 Fir de Conversatie Benjamin Fritz
The attached patch against the latest 7.3.3 changeset in Mercurial
adds the requested use of 'fencoding' instead of 'encoding' when it is
set to determine the HTML charset.

Additionally, it will now support a lot more encodings, and
automatically set the file encoding of the new file to match the
charset.

All encodings that are both native to Vim (listed by name in :help
encoding-names) and appear in the IANA registry (
http://www.iana.org/assignments/character-sets ) are supported. Note
that not all of these encodings are supported by major web browsers or
the w3c validator. New options are provided to override specific
encodings in the charset detection, or there is still
g:html_use_encoding to override all automatic detection. It is
probably a good idea to use this option if publishing to a web page.

There may be some charsets that previously were automatically detected
that no longer are, and there are some encodings supported by Vim
which I could not find in the IANA registry.

Unfortunately, I could not find a list of widely supported charsets,
so I just used all the ones in Vim and the IANA registry, as mentioned
previously. If there is such a list, would it be a good idea to limit
the automatically detected charsets to those in the list? Along those
lines, it could be a good idea to automatically use UTF-8 in place of
UTF-16 and UTF-32. Currently these charsets are selected as-is.

So, consider this a beta release. PLEASE test and comment, I expect
some changes may be needed before final submission.

Patch is attached, or the files are available for download at the site
I use for the TOhtml test suite:

http://code.google.com/p/vim-2html-test/downloads/list

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


2html_encoding.patch
Description: Binary data


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-29 Fir de Conversatie Benjamin Fritz
On Sat, Aug 28, 2010 at 10:00 PM, Tony Mechelynck
antoine.mechely...@gmail.com wrote:

 I don't know what is being done ATM, but I'd always include the line

 meta http-equiv=Content-Type content=text/html; charset=whatever /

 (replacing whatever by the charset name) somewhere near the start of the
 head element. You may want to use a synonym, e.g. iso-8859-1 for Latin1,
 but that's just the finishing touch.


Yes, that's mostly what it does now, except it omits the line if it
could not determine the charset, always uses 'encoding' instead of
'fileencoding', and specifies the encoding in the ?xml line instead
when optionally using xhtml. I think using utf-8 as a fallback instead
of leaving it out entirely would be a better idea.

The user can specify the charset now, but then the fileencoding will
be wrong unless the user remembers to manually set it (or if it gets
inherited...'fileencoding' seems to act like a global-local option).

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-29 Fir de Conversatie JiaYanwei
Sorry, it's my omission, I had set 'fileencoding' in '.vimrc'...

ps:
Excuse me to get this message so late. I cannot visit google group
last few days.

On 2010-8-28, 03:37 Ben Fritz fritzophre...@gmail.com wrote:
 On Aug 25, 11:11 pm, JiaYanwei jia...@126.com wrote:



  e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
  'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
  should change the HTML charset to 'iso-8859-1', or save the generated HTML
  file by ':w ++enc=utf-8'.

 Hmm...unless I understand correctly, the sequence is:

 Load text file. File encoding is latin-1, Vim encoding is utf-8.
 Do :TOhtml to create a new html buffer. File encoding defaults to
 empty, Vim encoding is still utf-8.
 :TOhtml sees encoding and sets the charset in the generated markup to
 UTF-8.
 :w the new html buffer. Vim sees empty file encoding, so uses utf-8 as
 the new file's encoding. Thus file encoding matches the html charset.

 You claim that the new html buffer has latin-1 encoding. Am I
 missing something here?

 I still think using fileencoding might be the correct way to do it,
 but doing so would require 2html.vim to set the file encoding of the
 new html buffer explicitly to be equal to the source file.

 This also brings up another shortcoming of 2html, because using
 g:html_use_encoding may change the html charset meta tag, but it does
 NOT change the actual character encoding of the file. It looks like I
 will need to set the fileencoding of the new html buffer to whatever
 corresponds to the supplied user option as a separate fix.

 Any thoughts?

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-28 Fir de Conversatie Tony Mechelynck

On 26/08/10 16:40, Ben Fritz wrote:



On Aug 25, 11:11 pm, JiaYanweijia...@126.com  wrote:

I think this will be more reasonable than before.

If the encoding of edited text file differ form the system/vim encoding, it's
inconvenient to set default HTML charset to be 'encoding'. Thus, after
':TOhtml', we should modify the generated HTML file to make the file encoding
the same as HTML charset.

e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
should change the HTML charset to 'iso-8859-1', or save the generated HTML
file by ':w ++enc=utf-8'. But if the default HTML charset is 'fileencoding',
we should do nothing after ':TOhtml'.



Thanks, I'll take a look. I don't yet have a good handle on
'encoding', 'fileencoding', and any other related options. It looks
like I'm going to need to.


From my understanding, 'fileencoding' is the encoding Vim is supposed

to use to read/write the file. So, it does make sense that we should
use this instead of just 'encoding' for the charset of the generated
html. Does anyone know why TOhtml has used 'encoding' instead? I have
not touched the charset detection code yet, other than to move it from
the 2html.vim file into the autoload/tohtml.vim file.


You got it right, and it does indeed make sense.
One possibility is that anything can be represented in UTF-8, including 
text not yet saved from the latest edit of the file, and possibly 
incompatible with the 'fileencoding' - such text is of course in error, 
and will cause an error if one tries to save it.




You say you need to do nothing to the TOhtml output if we set the
charset to the file encoding. But, don't we also need to ensure that
the file encoding of the new html file is the same as the file
encoding of the source file? The file encoding could be different from
file to file, whereas Vim's encoding is always the same. I can picture
this causing problems, if the charset says one thing, but the file
encoding is different.


HTML metadata can be written in ASCII. If needed, one can use #n; 
entities in text (where n is the decimal representation of the 
Unicode codepoint number; recent browsers accept also #x; where x 
is the letter x as in X-Ray and  is the hex representation) or 
percent-escaping in URLs (where, even in a Latin1 HTML page, 
percent-escaping always escapes each byte of the UTF-8 representation 
separately, with a % sign followed by exactly two hex digits: for 
instance U+00E9 (Latin small letter e with acute) would be represented 
as %C3%A9 and U+4E00 (Chinese number one horizontal-stroke sign) would 
be represented as %E4%B8%80 in a URL, including in the query text if any.




By the way, until this is fixed...you can use the g:html_use_encoding
option to override the normal detection mechanisms, rather than
manually editing the generated HTML file.



Best regards,
Tony.
--
If you put garbage in a computer nothing comes out but garbage.  But
this garbage, having passed through a very expensive machine, is
somehow enobled and none dare criticize it.

--
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-28 Fir de Conversatie Benjamin Fritz
On Sat, Aug 28, 2010 at 4:16 PM, Tony Mechelynck
antoine.mechely...@gmail.com wrote:

 From my understanding, 'fileencoding' is the encoding Vim is supposed

 to use to read/write the file. So, it does make sense that we should
 use this instead of just 'encoding' for the charset of the generated
 html. Does anyone know why TOhtml has used 'encoding' instead? I have
 not touched the charset detection code yet, other than to move it from
 the 2html.vim file into the autoload/tohtml.vim file.

 You got it right, and it does indeed make sense.
 One possibility is that anything can be represented in UTF-8, including text
 not yet saved from the latest edit of the file, and possibly incompatible
 with the 'fileencoding' - such text is of course in error, and will cause an
 error if one tries to save it.


Ok, I think I'll make the edit, then.

Your response gives me an idea to fix something else that's been
bothering me. Currently, if Vim cannot determine the correct charset
to use, it defaults to not including one at all. I think I'll have it
default the charset and file encoding to UTF-8 if neither the
fileencoding nor the encoding option gives a valid charset. The user
should be able to manually leave out the charset and manually set the
encoding if desired.

Here's what I'm thinking in more detail:

For one buffer:
1. If user specified a charset, try to determine 'fileencoding' from
charset. If this fails, warn the user they will need to manually set
the fileencoding.
2. If no charset is specified, try to determine a charset from the
'fileencoding' option. If successful, use the same 'fileencoding' and
the associated charset in the generated buffer.
3. If could not determine charset from 'fileencoding', try again with
'encoding'. If successful, set 'fileencoding' to blank in the new html
buffer and use the charset from the 'encoding' option.
4. If could not determine charset from either 'encoding' or
'fileencoding', default to UTF-8 and warn the user.

Multiple buffers in diff mode will be done similarly, except that we
will determine the charset as above for ALL buffers. If they differ,
set 'fileencoding' to blank and use the charset from 'encoding' (or
UTF-8 if cannot determine charset from 'encoding').

What do you think? Or maybe this is too complicated and I should just
use 'encoding' as done currently?

What do you think?

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-28 Fir de Conversatie Tony Mechelynck

On 29/08/10 04:29, Benjamin Fritz wrote:

On Sat, Aug 28, 2010 at 4:16 PM, Tony Mechelynck
antoine.mechely...@gmail.com  wrote:



 From my understanding, 'fileencoding' is the encoding Vim is supposed


to use to read/write the file. So, it does make sense that we should
use this instead of just 'encoding' for the charset of the generated
html. Does anyone know why TOhtml has used 'encoding' instead? I have
not touched the charset detection code yet, other than to move it from
the 2html.vim file into the autoload/tohtml.vim file.


You got it right, and it does indeed make sense.
One possibility is that anything can be represented in UTF-8, including text
not yet saved from the latest edit of the file, and possibly incompatible
with the 'fileencoding' - such text is of course in error, and will cause an
error if one tries to save it.



Ok, I think I'll make the edit, then.

Your response gives me an idea to fix something else that's been
bothering me. Currently, if Vim cannot determine the correct charset
to use, it defaults to not including one at all. I think I'll have it
default the charset and file encoding to UTF-8 if neither the
fileencoding nor the encoding option gives a valid charset. The user
should be able to manually leave out the charset and manually set the
encoding if desired.

Here's what I'm thinking in more detail:

For one buffer:
1. If user specified a charset, try to determine 'fileencoding' from
charset. If this fails, warn the user they will need to manually set
the fileencoding.
2. If no charset is specified, try to determine a charset from the
'fileencoding' option. If successful, use the same 'fileencoding' and
the associated charset in the generated buffer.
3. If could not determine charset from 'fileencoding', try again with
'encoding'. If successful, set 'fileencoding' to blank in the new html
buffer and use the charset from the 'encoding' option.
4. If could not determine charset from either 'encoding' or
'fileencoding', default to UTF-8 and warn the user.

Multiple buffers in diff mode will be done similarly, except that we
will determine the charset as above for ALL buffers. If they differ,
set 'fileencoding' to blank and use the charset from 'encoding' (or
UTF-8 if cannot determine charset from 'encoding').

What do you think? Or maybe this is too complicated and I should just
use 'encoding' as done currently?

What do you think?



I think you're on the right track. Maybe a little too complicated but 
I'm not sure. I would just use 'fileencoding', or if empty (or if it can 
be ascertained that the current buffer contains characters which are 
invalid for it) then fall back on 'encoding' (by leaving 'fileencoding' 
empty in the tohtml output buffer). But go ahead if you think you can 
refine it more or make it better.


I don't know what is being done ATM, but I'd always include the line

meta http-equiv=Content-Type content=text/html; charset=whatever /

(replacing whatever by the charset name) somewhere near the start of 
the head element. You may want to use a synonym, e.g. iso-8859-1 for 
Latin1, but that's just the finishing touch.



Best regards,
Tony.
--
In defeat, unbeatable; in victory, unbearable.
-- Winston Curchill, of Montgomery

--
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-27 Fir de Conversatie Ben Fritz


On Aug 25, 11:11 pm, JiaYanwei jia...@126.com wrote:

 e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
 'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
 should change the HTML charset to 'iso-8859-1', or save the generated HTML
 file by ':w ++enc=utf-8'.

Hmm...unless I understand correctly, the sequence is:

Load text file. File encoding is latin-1, Vim encoding is utf-8.
Do :TOhtml to create a new html buffer. File encoding defaults to
empty, Vim encoding is still utf-8.
:TOhtml sees encoding and sets the charset in the generated markup to
UTF-8.
:w the new html buffer. Vim sees empty file encoding, so uses utf-8 as
the new file's encoding. Thus file encoding matches the html charset.

You claim that the new html buffer has latin-1 encoding. Am I
missing something here?

I still think using fileencoding might be the correct way to do it,
but doing so would require 2html.vim to set the file encoding of the
new html buffer explicitly to be equal to the source file.

This also brings up another shortcoming of 2html, because using
g:html_use_encoding may change the html charset meta tag, but it does
NOT change the actual character encoding of the file. It looks like I
will need to set the fileencoding of the new html buffer to whatever
corresponds to the supplied user option as a separate fix.

Any thoughts?

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-26 Fir de Conversatie Ben Fritz


On Aug 25, 11:11 pm, JiaYanwei jia...@126.com wrote:
 I think this will be more reasonable than before.

 If the encoding of edited text file differ form the system/vim encoding, it's
 inconvenient to set default HTML charset to be 'encoding'. Thus, after
 ':TOhtml', we should modify the generated HTML file to make the file encoding
 the same as HTML charset.

 e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
 'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
 should change the HTML charset to 'iso-8859-1', or save the generated HTML
 file by ':w ++enc=utf-8'. But if the default HTML charset is 'fileencoding',
 we should do nothing after ':TOhtml'.


Thanks, I'll take a look. I don't yet have a good handle on
'encoding', 'fileencoding', and any other related options. It looks
like I'm going to need to.

From my understanding, 'fileencoding' is the encoding Vim is supposed
to use to read/write the file. So, it does make sense that we should
use this instead of just 'encoding' for the charset of the generated
html. Does anyone know why TOhtml has used 'encoding' instead? I have
not touched the charset detection code yet, other than to move it from
the 2html.vim file into the autoload/tohtml.vim file.

You say you need to do nothing to the TOhtml output if we set the
charset to the file encoding. But, don't we also need to ensure that
the file encoding of the new html file is the same as the file
encoding of the source file? The file encoding could be different from
file to file, whereas Vim's encoding is always the same. I can picture
this causing problems, if the charset says one thing, but the file
encoding is different.

By the way, until this is fixed...you can use the g:html_use_encoding
option to override the normal detection mechanisms, rather than
manually editing the generated HTML file.

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-26 Fir de Conversatie Ben Fritz


On Aug 26, 9:40 am, Ben Fritz fritzophre...@gmail.com wrote:

 From my understanding, 'fileencoding' is the encoding Vim is supposed
 to use to read/write the file. So, it does make sense that we should
 use this instead of just 'encoding' for the charset of the generated
 html. Does anyone know why TOhtml has used 'encoding' instead?


One problem with the supplied patch, is that Vim will use 'encoding'
for a file's encoding, if 'fileencoding' is empty. In my setup, it
looks like 'fileencoding' is nearly always empty.

So, the script will need to fall back to 'encoding' if 'fileencoding'
is empty. Probably it should also re-detect the charset using
'encoding' when 'fileencoding' is not blank but does not resolve to a
valid charset.

Any thoughts? Like I said, I've never needed to mess with 'encoding'
or 'fileencoding' in my daily use of Vim.

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-26 Fir de Conversatie JiaYanwei
Oh, sorry, I forgeted that 'fileencoding' may be empty. This should be 
handled.

I encountered the opposite that 'fileencoding' is often different from 
'encoding' while editing existing files.

Ben Fritz wrote:
 On Aug 26, 9:40 am, Ben Fritz fritzophre...@gmail.com wrote:
 
  From my understanding, 'fileencoding' is the encoding Vim is supposed
  to use to read/write the file. So, it does make sense that we should
  use this instead of just 'encoding' for the charset of the generated
  html. Does anyone know why TOhtml has used 'encoding' instead?
 
 
 One problem with the supplied patch, is that Vim will use 'encoding'
 for a file's encoding, if 'fileencoding' is empty. In my setup, it
 looks like 'fileencoding' is nearly always empty.
 
 So, the script will need to fall back to 'encoding' if 'fileencoding'
 is empty. Probably it should also re-detect the charset using
 'encoding' when 'fileencoding' is not blank but does not resolve to a
 valid charset.
 
 Any thoughts? Like I said, I've never needed to mess with 'encoding'
 or 'fileencoding' in my daily use of Vim.

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

2010-08-25 Fir de Conversatie JiaYanwei

I think this will be more reasonable than before.

If the encoding of edited text file differ form the system/vim encoding, it's 
inconvenient to set default HTML charset to be 'encoding'. Thus, after 
':TOhtml', we should modify the generated HTML file to make the file encoding 
the same as HTML charset.

e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is 
'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
should change the HTML charset to 'iso-8859-1', or save the generated HTML
file by ':w ++enc=utf-8'. But if the default HTML charset is 'fileencoding', 
we should do nothing after ':TOhtml'.

Changes as the attachment.

Best regards, 
Yanwei. 
--  

-- 
You received this message from the vim_dev maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


tohtml.diff
Description: Binary data