Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-10 Thread Jakub Narębski
On 09.04.2013 at 23:59, Jürgen Kreileder wrote:
> Jakub Narębski  writes:
> 
>> On 08.04.2013, Junio C Hamano wrote:
>>> j...@blackdown.de (Jürgen Kreileder) writes:
>>>
 Fixes the encoding for several _plain actions and for text/* and */*+xml 
 blobs. 

 Signed-off-by: Jürgen Kreileder 
>>
>> I see that this patch does (or tries to do) two _independent_ things,
>> and should be split into at least two separate commits:
[...]
>> First, it extends adding "; charset=$default_text_plain_charset" to
>> other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
>> mimetypes (without changing name of variable... though this is more
>> complicated as it is configuration variable and we would want to
>> preserve backward compatibility, but at least a comment would be,
>> I think, needed).
>>  
>> Originally it applied only to 'text/plain' files, which can be
>> displayed inline by web browser, and which need charset in
>> 'Content-Type:' HTTP header to be displayed correctly, as they
>> do not include such information inside the file.
> 
> What prompted the change is that some browsers (Chrome and Safari) also
> display other file types inline: text/x-chdr, text/x-java, text/x-objc,
> text/x-sh, ...
> 
> At least on my system these files do get served with charset set!
> (ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...)
[...]

I think this (the reason behind this change) should be explained in the
commit message.

>> 'text/html' and 'application/xhtml+xml' can include such information
>> inside them (meta http-equiv for 'text/html' and  for
>> 'application/xhtml+xml').  I don't know what browser does when there
>> is conflicting information about charset, i.e. which one wins, but
>> it is certainly something to consider.
> 
> As shown above, even without my patch, this already can happen!
> 
>> It might be a good change; I don't know what web browser do when
>> serving 'text/css', 'text/javascript', 'text/xml' to client to
>> view without media type known.

There are few options on how to handle this.

First, there is an issue of $default_text_plain_charset configuration
variable.  Any changes to its use (or adding new configuration
variables) should be accompanied with changing / adding comment near the
place where the default value is set, and changing / adding to
the documentation, namely gitweb.conf.txt

We can either:
1.) Use $default_text_plain_charset variable both for 'text/plain'
and 'text/*' etc. content types in 'blob_plain' (aka 'raw') view
for backwards compatibility, just add comment that it applies more
than 'text/plain'

2.) Add new configuration variable, e.g. $default_blob_plain_charset
and use it in place of $default_text_plain_charset as above,
initializing it to $default_text_plain_charset.

This practically renames $default_text_plain_charset preserving
backward compatibility.  Documentation would talk now about new
variable name.

3.) Add new configuration variable, e.g. $default_text_other_charset,
or $default_blob_plain_charset, or $default_inline_charset, etc.
which will be used for files other than 'text/plain'.  This would
make it possible to turn this new feature on and off.

3.a.) New feature default to on, i.e. to $default_text_plain_charset
3.b.) New feature default to off, i.e. undef


Second, there is an issue of file types, like HTML or XML (or XHTML,
or sometimes for LaTeX), which have the information about encoding
embedded in file.

We can:

A.) Skip this issue, and always add charset for 'text/*' and
'*/*-xml' files

B.) Threat those few media types in special way, excluding them from
adding charset

C.) Make which types to be added charset configurable.


Anyway I think this feature is much less urgent.  It fixes an annoyance
rather than bug, as you can always choose which charset to use to
display content in a browser.  We can take time to implement it well,
especially that it interacts with config (and backward compatibility
of config variables).

>> BTW I have noticed that we do $prevent_xss dance outside
>> blob_contenttype(), in it's only caller i.e. git_blob_plain()...
>> which for example means that 'text/html' converted to 'text/plain'
>> don't get '; charset=...' added.  I guess that it *might* be
>> what prompted this part of change... but if it is so, it needs
>> to be fixed at source, e.g. by moving $prevent_xss to
>> blob_contenttype() subroutine.
> 
> Jep.

But I see this is yet another separate issue.

>> Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
>> that should be abandoned in favor of 'patch' view; but that is
>> a separate issue) and 'patch' views so they use binary-safe output.
>>
>> Note that in all cases (I think) we use
>>
>>$cgi->header(
>>  -type => 'text/plain',
>>  -charset => 'utf-8',
>>  ...
>>);
>>
>> promising web browser that output is as whole in 'utf-8' encoding.
> 
> Yes.
> 
>> It is not explained in the 

Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-09 Thread Jürgen Kreileder
Jakub Narębski  writes:

> On 08.04.2013, Junio C Hamano wrote:
>> j...@blackdown.de (Jürgen Kreileder) writes:
>> 
>>> Fixes the encoding for several _plain actions and for text/* and */*+xml 
>>> blobs. 
>>>
>>> Signed-off-by: Jürgen Kreileder 
>
> I see that this patch does (or tries to do) two _independent_ things,
> and should be split into at least two separate commits:

Agreed.

>>> ---
>> 
>> Thanks, will queue but not hold until I hear something from Jakub.
>> 
>>>  gitweb/gitweb.perl |8 +++-
>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>>> index 1309196..9cfe5b5 100755
>>> --- a/gitweb/gitweb.perl
>>> +++ b/gitweb/gitweb.perl
>>> @@ -3823,7 +3823,7 @@ sub blob_contenttype {
>>> my ($fd, $file_name, $type) = @_;
>>>  
>>> $type ||= blob_mimetype($fd, $file_name);
>>> -   if ($type eq 'text/plain' && defined $default_text_plain_charset) {
>>> +   if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
>>> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>> $type .= "; charset=$default_text_plain_charset";
>>> }
>
> First, it extends adding "; charset=$default_text_plain_charset" to
> other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
> mimetypes (without changing name of variable... though this is more
> complicated as it is configuration variable and we would want to
> preserve backward compatibility, but at least a comment would be,
> I think, needed).
>  
> Originally it applied only to 'text/plain' files, which can be
> displayed inline by web browser, and which need charset in
> 'Content-Type:' HTTP header to be displayed correctly, as they
> do not include such information inside the file.

What prompted the change is that some browsers (Chrome and Safari) also
display other file types inline: text/x-chdr, text/x-java, text/x-objc,
text/x-sh, ...

At least on my system these files do get served with charset set!
(ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...)

$ curl -I 
"https://git.blackdown.de/old.cgi?p=contactalbum.git;a=blob_plain;f=Classes/ContactAlbum.h;hb=HEAD";
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2013 21:47:30 GMT
Server: Apache
Content-disposition: inline; filename="Classes/ContactAlbum.h"
X-Frame-Options: SAMEORIGIN
Vary: User-Agent
X-UA-Compatible: IE=edge
Strict-Transport-Security: max-age=31536000
Content-Type: text/x-chdr; charset=ISO-8859-1

> 'text/html' and 'application/xhtml+xml' can include such information
> inside them (meta http-equiv for 'text/html' and  for
> 'application/xhtml+xml').  I don't know what browser does when there
> is conflicting information about charset, i.e. which one wins, but
> it is certainly something to consider.

As shown above, even without my patch, this already can happen!

> It might be a good change; I don't know what web browser do when
> serving 'text/css', 'text/javascript', 'text/xml' to client to
> view without media type known.
>
>
> BTW I have noticed that we do $prevent_xss dance outside
> blob_contenttype(), in it's only caller i.e. git_blob_plain()...
> which for example means that 'text/html' converted to 'text/plain'
> don't get '; charset=...' added.  I guess that it *might* be
> what prompted this part of change... but if it is so, it needs
> to be fixed at source, e.g. by moving $prevent_xss to
> blob_contenttype() subroutine.

Jep.

[...]

> Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
> that should be abandoned in favor of 'patch' view; but that is
> a separate issue) and 'patch' views so they use binary-safe output.
>
> Note that in all cases (I think) we use
>
>$cgi->header(
>  -type => 'text/plain',
>  -charset => 'utf-8',
>  ...
>);
>
> promising web browser that output is as whole in 'utf-8' encoding.

Yes.

> It is not explained in the commit message what is the reason for this
> change.  Is it better handing of a situation where files being diff-ed
> being in other encoding (like for example in commit that changes
> encoding of files from legacy encoding such like e.g. iso-8859-2
> or cp1250 to utf-8)?

I do see encoding problems when comparing utf8 to utf8 files (i.e. no
encoding change).  For instance:
https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

I don't claim to be an expert in Perl's utf8 handling but I guess this
because Perl's internal utf8 form differs from the normal utf8 out from
git commands.  Switching to :raw fixes that: We write plain utf8 (and,
as noted above, charset is set to utf8 already)

With the patch: 
https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

> But what about -charset => 'utf-8'  then?

That's a good question.  I think I never tried git with anything besides
ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files).
(UTF16 defini

Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-09 Thread Jakub Narębski
On 08.04.2013, Junio C Hamano wrote:
> j...@blackdown.de (Jürgen Kreileder) writes:
> 
>> Fixes the encoding for several _plain actions and for text/* and */*+xml 
>> blobs. 
>>
>> Signed-off-by: Jürgen Kreileder 

I see that this patch does (or tries to do) two _independent_ things,
and should be split into at least two separate commits:

>> ---
> 
> Thanks, will queue but not hold until I hear something from Jakub.
> 
>>  gitweb/gitweb.perl |8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 1309196..9cfe5b5 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -3823,7 +3823,7 @@ sub blob_contenttype {
>>  my ($fd, $file_name, $type) = @_;
>>  
>>  $type ||= blob_mimetype($fd, $file_name);
>> -if ($type eq 'text/plain' && defined $default_text_plain_charset) {
>> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
>> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>  $type .= "; charset=$default_text_plain_charset";
>>  }

First, it extends adding "; charset=$default_text_plain_charset" to
other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
mimetypes (without changing name of variable... though this is more
complicated as it is configuration variable and we would want to
preserve backward compatibility, but at least a comment would be,
I think, needed).
 
Originally it applied only to 'text/plain' files, which can be
displayed inline by web browser, and which need charset in
'Content-Type:' HTTP header to be displayed correctly, as they
do not include such information inside the file.

'text/html' and 'application/xhtml+xml' can include such information
inside them (meta http-equiv for 'text/html' and  for
'application/xhtml+xml').  I don't know what browser does when there
is conflicting information about charset, i.e. which one wins, but
it is certainly something to consider.

It might be a good change; I don't know what web browser do when
serving 'text/css', 'text/javascript', 'text/xml' to client to
view without media type known.


BTW I have noticed that we do $prevent_xss dance outside
blob_contenttype(), in it's only caller i.e. git_blob_plain()...
which for example means that 'text/html' converted to 'text/plain'
don't get '; charset=...' added.  I guess that it *might* be
what prompted this part of change... but if it is so, it needs
to be fixed at source, e.g. by moving $prevent_xss to
blob_contenttype() subroutine.


About implementation:

>> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
>> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>  $type .= "; charset=$default_text_plain_charset";

would be better with line split

>> +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
>> m!^\w[-\w]*/\w[-\w]*\+xml$!) &&
>> +defined $default_text_plain_charset) {
>>  $type .= "; charset=$default_text_plain_charset";

Also: do we need to be strict with '\w[-\w]*', or would '.*' suffice?

>> @@ -7637,7 +7637,9 @@ sub git_blobdiff {
>>  last if $line =~ m!^\+\+\+!;
>>  }
>>  local $/ = undef;
>> +binmode STDOUT, ':raw';
>>  print <$fd>;
>> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  close $fd;
>>  }
>>  }
>> @@ -7884,12 +7886,16 @@ sub git_commitdiff {
>>  
>>  } elsif ($format eq 'plain') {
>>  local $/ = undef;
>> +binmode STDOUT, ':raw';
>>  print <$fd>;
>> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  close $fd
>>  or print "Reading git-diff-tree failed\n";
>>  } elsif ($format eq 'patch') {
>>  local $/ = undef;
>> +binmode STDOUT, ':raw';
>>  print <$fd>;
>> +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  close $fd
>>  or print "Reading git-format-patch failed\n";
>>  }

Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
that should be abandoned in favor of 'patch' view; but that is
a separate issue) and 'patch' views so they use binary-safe output.

Note that in all cases (I think) we use

   $cgi->header(
 -type => 'text/plain',
 -charset => 'utf-8',
 ...
   );

promising web browser that output is as whole in 'utf-8' encoding.

It is not explained in the commit message what is the reason for this
change.  Is it better handing of a situation where files being diff-ed
being in other encoding (like for example in commit that changes
encoding of files from legacy encoding such like e.g. iso-8859-2
or cp1250 to utf-8)?  But what about  -charset => 'utf-8'  then?


About implementation: I think after this change common code crosses
threshold for refactoring it into subroutine, for example:

  sub dump_fh_raw {
   

Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-08 Thread Junio C Hamano
j...@blackdown.de (Jürgen Kreileder) writes:

> Fixes the encoding for several _plain actions and for text/* and */*+xml 
> blobs. 
>
> Signed-off-by: Jürgen Kreileder 
> ---

Thanks, will queue but not hold until I hear something from Jakub.

>  gitweb/gitweb.perl |8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 1309196..9cfe5b5 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -3823,7 +3823,7 @@ sub blob_contenttype {
>   my ($fd, $file_name, $type) = @_;
>  
>   $type ||= blob_mimetype($fd, $file_name);
> - if ($type eq 'text/plain' && defined $default_text_plain_charset) {
> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
> m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>   $type .= "; charset=$default_text_plain_charset";
>   }
>  
> @@ -7637,7 +7637,9 @@ sub git_blobdiff {
>   last if $line =~ m!^\+\+\+!;
>   }
>   local $/ = undef;
> + binmode STDOUT, ':raw';
>   print <$fd>;
> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>   close $fd;
>   }
>  }
> @@ -7884,12 +7886,16 @@ sub git_commitdiff {
>  
>   } elsif ($format eq 'plain') {
>   local $/ = undef;
> + binmode STDOUT, ':raw';
>   print <$fd>;
> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>   close $fd
>   or print "Reading git-diff-tree failed\n";
>   } elsif ($format eq 'patch') {
>   local $/ = undef;
> + binmode STDOUT, ':raw';
>   print <$fd>;
> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>   close $fd
>   or print "Reading git-format-patch failed\n";
>   }
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html