Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch
On 09.04.2013 at 23:59, Jürgen Kreileder wrote: > Jakub Narębski writes: > >> On 08.04.2013, Junio C Hamano wrote: >>> j...@blackdown.de (Jürgen Kreileder) writes: >>> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. Signed-off-by: Jürgen Kreileder >> >> I see that this patch does (or tries to do) two _independent_ things, >> and should be split into at least two separate commits: [...] >> First, it extends adding "; charset=$default_text_plain_charset" to >> other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml' >> mimetypes (without changing name of variable... though this is more >> complicated as it is configuration variable and we would want to >> preserve backward compatibility, but at least a comment would be, >> I think, needed). >> >> Originally it applied only to 'text/plain' files, which can be >> displayed inline by web browser, and which need charset in >> 'Content-Type:' HTTP header to be displayed correctly, as they >> do not include such information inside the file. > > What prompted the change is that some browsers (Chrome and Safari) also > display other file types inline: text/x-chdr, text/x-java, text/x-objc, > text/x-sh, ... > > At least on my system these files do get served with charset set! > (ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...) [...] I think this (the reason behind this change) should be explained in the commit message. >> 'text/html' and 'application/xhtml+xml' can include such information >> inside them (meta http-equiv for 'text/html' and for >> 'application/xhtml+xml'). I don't know what browser does when there >> is conflicting information about charset, i.e. which one wins, but >> it is certainly something to consider. > > As shown above, even without my patch, this already can happen! > >> It might be a good change; I don't know what web browser do when >> serving 'text/css', 'text/javascript', 'text/xml' to client to >> view without media type known. There are few options on how to handle this. First, there is an issue of $default_text_plain_charset configuration variable. Any changes to its use (or adding new configuration variables) should be accompanied with changing / adding comment near the place where the default value is set, and changing / adding to the documentation, namely gitweb.conf.txt We can either: 1.) Use $default_text_plain_charset variable both for 'text/plain' and 'text/*' etc. content types in 'blob_plain' (aka 'raw') view for backwards compatibility, just add comment that it applies more than 'text/plain' 2.) Add new configuration variable, e.g. $default_blob_plain_charset and use it in place of $default_text_plain_charset as above, initializing it to $default_text_plain_charset. This practically renames $default_text_plain_charset preserving backward compatibility. Documentation would talk now about new variable name. 3.) Add new configuration variable, e.g. $default_text_other_charset, or $default_blob_plain_charset, or $default_inline_charset, etc. which will be used for files other than 'text/plain'. This would make it possible to turn this new feature on and off. 3.a.) New feature default to on, i.e. to $default_text_plain_charset 3.b.) New feature default to off, i.e. undef Second, there is an issue of file types, like HTML or XML (or XHTML, or sometimes for LaTeX), which have the information about encoding embedded in file. We can: A.) Skip this issue, and always add charset for 'text/*' and '*/*-xml' files B.) Threat those few media types in special way, excluding them from adding charset C.) Make which types to be added charset configurable. Anyway I think this feature is much less urgent. It fixes an annoyance rather than bug, as you can always choose which charset to use to display content in a browser. We can take time to implement it well, especially that it interacts with config (and backward compatibility of config variables). >> BTW I have noticed that we do $prevent_xss dance outside >> blob_contenttype(), in it's only caller i.e. git_blob_plain()... >> which for example means that 'text/html' converted to 'text/plain' >> don't get '; charset=...' added. I guess that it *might* be >> what prompted this part of change... but if it is so, it needs >> to be fixed at source, e.g. by moving $prevent_xss to >> blob_contenttype() subroutine. > > Jep. But I see this is yet another separate issue. >> Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think >> that should be abandoned in favor of 'patch' view; but that is >> a separate issue) and 'patch' views so they use binary-safe output. >> >> Note that in all cases (I think) we use >> >>$cgi->header( >> -type => 'text/plain', >> -charset => 'utf-8', >> ... >>); >> >> promising web browser that output is as whole in 'utf-8' encoding. > > Yes. > >> It is not explained in the
Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch
Jakub Narębski writes: > On 08.04.2013, Junio C Hamano wrote: >> j...@blackdown.de (Jürgen Kreileder) writes: >> >>> Fixes the encoding for several _plain actions and for text/* and */*+xml >>> blobs. >>> >>> Signed-off-by: Jürgen Kreileder > > I see that this patch does (or tries to do) two _independent_ things, > and should be split into at least two separate commits: Agreed. >>> --- >> >> Thanks, will queue but not hold until I hear something from Jakub. >> >>> gitweb/gitweb.perl |8 +++- >>> 1 file changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >>> index 1309196..9cfe5b5 100755 >>> --- a/gitweb/gitweb.perl >>> +++ b/gitweb/gitweb.perl >>> @@ -3823,7 +3823,7 @@ sub blob_contenttype { >>> my ($fd, $file_name, $type) = @_; >>> >>> $type ||= blob_mimetype($fd, $file_name); >>> - if ($type eq 'text/plain' && defined $default_text_plain_charset) { >>> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ >>> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >>> $type .= "; charset=$default_text_plain_charset"; >>> } > > First, it extends adding "; charset=$default_text_plain_charset" to > other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml' > mimetypes (without changing name of variable... though this is more > complicated as it is configuration variable and we would want to > preserve backward compatibility, but at least a comment would be, > I think, needed). > > Originally it applied only to 'text/plain' files, which can be > displayed inline by web browser, and which need charset in > 'Content-Type:' HTTP header to be displayed correctly, as they > do not include such information inside the file. What prompted the change is that some browsers (Chrome and Safari) also display other file types inline: text/x-chdr, text/x-java, text/x-objc, text/x-sh, ... At least on my system these files do get served with charset set! (ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...) $ curl -I "https://git.blackdown.de/old.cgi?p=contactalbum.git;a=blob_plain;f=Classes/ContactAlbum.h;hb=HEAD"; HTTP/1.1 200 OK Date: Tue, 09 Apr 2013 21:47:30 GMT Server: Apache Content-disposition: inline; filename="Classes/ContactAlbum.h" X-Frame-Options: SAMEORIGIN Vary: User-Agent X-UA-Compatible: IE=edge Strict-Transport-Security: max-age=31536000 Content-Type: text/x-chdr; charset=ISO-8859-1 > 'text/html' and 'application/xhtml+xml' can include such information > inside them (meta http-equiv for 'text/html' and for > 'application/xhtml+xml'). I don't know what browser does when there > is conflicting information about charset, i.e. which one wins, but > it is certainly something to consider. As shown above, even without my patch, this already can happen! > It might be a good change; I don't know what web browser do when > serving 'text/css', 'text/javascript', 'text/xml' to client to > view without media type known. > > > BTW I have noticed that we do $prevent_xss dance outside > blob_contenttype(), in it's only caller i.e. git_blob_plain()... > which for example means that 'text/html' converted to 'text/plain' > don't get '; charset=...' added. I guess that it *might* be > what prompted this part of change... but if it is so, it needs > to be fixed at source, e.g. by moving $prevent_xss to > blob_contenttype() subroutine. Jep. [...] > Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think > that should be abandoned in favor of 'patch' view; but that is > a separate issue) and 'patch' views so they use binary-safe output. > > Note that in all cases (I think) we use > >$cgi->header( > -type => 'text/plain', > -charset => 'utf-8', > ... >); > > promising web browser that output is as whole in 'utf-8' encoding. Yes. > It is not explained in the commit message what is the reason for this > change. Is it better handing of a situation where files being diff-ed > being in other encoding (like for example in commit that changes > encoding of files from legacy encoding such like e.g. iso-8859-2 > or cp1250 to utf-8)? I do see encoding problems when comparing utf8 to utf8 files (i.e. no encoding change). For instance: https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8 I don't claim to be an expert in Perl's utf8 handling but I guess this because Perl's internal utf8 form differs from the normal utf8 out from git commands. Switching to :raw fixes that: We write plain utf8 (and, as noted above, charset is set to utf8 already) With the patch: https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8 > But what about -charset => 'utf-8' then? That's a good question. I think I never tried git with anything besides ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files). (UTF16 defini
Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch
On 08.04.2013, Junio C Hamano wrote: > j...@blackdown.de (Jürgen Kreileder) writes: > >> Fixes the encoding for several _plain actions and for text/* and */*+xml >> blobs. >> >> Signed-off-by: Jürgen Kreileder I see that this patch does (or tries to do) two _independent_ things, and should be split into at least two separate commits: >> --- > > Thanks, will queue but not hold until I hear something from Jakub. > >> gitweb/gitweb.perl |8 +++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >> index 1309196..9cfe5b5 100755 >> --- a/gitweb/gitweb.perl >> +++ b/gitweb/gitweb.perl >> @@ -3823,7 +3823,7 @@ sub blob_contenttype { >> my ($fd, $file_name, $type) = @_; >> >> $type ||= blob_mimetype($fd, $file_name); >> -if ($type eq 'text/plain' && defined $default_text_plain_charset) { >> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ >> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; >> } First, it extends adding "; charset=$default_text_plain_charset" to other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml' mimetypes (without changing name of variable... though this is more complicated as it is configuration variable and we would want to preserve backward compatibility, but at least a comment would be, I think, needed). Originally it applied only to 'text/plain' files, which can be displayed inline by web browser, and which need charset in 'Content-Type:' HTTP header to be displayed correctly, as they do not include such information inside the file. 'text/html' and 'application/xhtml+xml' can include such information inside them (meta http-equiv for 'text/html' and for 'application/xhtml+xml'). I don't know what browser does when there is conflicting information about charset, i.e. which one wins, but it is certainly something to consider. It might be a good change; I don't know what web browser do when serving 'text/css', 'text/javascript', 'text/xml' to client to view without media type known. BTW I have noticed that we do $prevent_xss dance outside blob_contenttype(), in it's only caller i.e. git_blob_plain()... which for example means that 'text/html' converted to 'text/plain' don't get '; charset=...' added. I guess that it *might* be what prompted this part of change... but if it is so, it needs to be fixed at source, e.g. by moving $prevent_xss to blob_contenttype() subroutine. About implementation: >> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ >> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; would be better with line split >> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ >> m!^\w[-\w]*/\w[-\w]*\+xml$!) && >> +defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; Also: do we need to be strict with '\w[-\w]*', or would '.*' suffice? >> @@ -7637,7 +7637,9 @@ sub git_blobdiff { >> last if $line =~ m!^\+\+\+!; >> } >> local $/ = undef; >> +binmode STDOUT, ':raw'; >> print <$fd>; >> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd; >> } >> } >> @@ -7884,12 +7886,16 @@ sub git_commitdiff { >> >> } elsif ($format eq 'plain') { >> local $/ = undef; >> +binmode STDOUT, ':raw'; >> print <$fd>; >> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd >> or print "Reading git-diff-tree failed\n"; >> } elsif ($format eq 'patch') { >> local $/ = undef; >> +binmode STDOUT, ':raw'; >> print <$fd>; >> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd >> or print "Reading git-format-patch failed\n"; >> } Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think that should be abandoned in favor of 'patch' view; but that is a separate issue) and 'patch' views so they use binary-safe output. Note that in all cases (I think) we use $cgi->header( -type => 'text/plain', -charset => 'utf-8', ... ); promising web browser that output is as whole in 'utf-8' encoding. It is not explained in the commit message what is the reason for this change. Is it better handing of a situation where files being diff-ed being in other encoding (like for example in commit that changes encoding of files from legacy encoding such like e.g. iso-8859-2 or cp1250 to utf-8)? But what about -charset => 'utf-8' then? About implementation: I think after this change common code crosses threshold for refactoring it into subroutine, for example: sub dump_fh_raw {
Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch
j...@blackdown.de (Jürgen Kreileder) writes: > Fixes the encoding for several _plain actions and for text/* and */*+xml > blobs. > > Signed-off-by: Jürgen Kreileder > --- Thanks, will queue but not hold until I hear something from Jakub. > gitweb/gitweb.perl |8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl > index 1309196..9cfe5b5 100755 > --- a/gitweb/gitweb.perl > +++ b/gitweb/gitweb.perl > @@ -3823,7 +3823,7 @@ sub blob_contenttype { > my ($fd, $file_name, $type) = @_; > > $type ||= blob_mimetype($fd, $file_name); > - if ($type eq 'text/plain' && defined $default_text_plain_charset) { > + if (($type =~ m!^text/\w[-\w]*$! || $type =~ > m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { > $type .= "; charset=$default_text_plain_charset"; > } > > @@ -7637,7 +7637,9 @@ sub git_blobdiff { > last if $line =~ m!^\+\+\+!; > } > local $/ = undef; > + binmode STDOUT, ':raw'; > print <$fd>; > + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi > close $fd; > } > } > @@ -7884,12 +7886,16 @@ sub git_commitdiff { > > } elsif ($format eq 'plain') { > local $/ = undef; > + binmode STDOUT, ':raw'; > print <$fd>; > + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi > close $fd > or print "Reading git-diff-tree failed\n"; > } elsif ($format eq 'patch') { > local $/ = undef; > + binmode STDOUT, ':raw'; > print <$fd>; > + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi > close $fd > or print "Reading git-format-patch failed\n"; > } -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html