Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-27 Thread Jakub Narębski
W dniu 2014-05-16 19:05, Junio C Hamano pisze:
 Jakub Narębski jna...@gmail.com writes:
 
 Correct, but is where does it appear the question we are
 primarily interested in, wrt this breakage and its fix?

 That of course depends on how we want to test gitweb output.
 The simplest solution, comparing with known output with perhaps
 fragile / variable elements masked out could be done quickly...
 but changes in output (even if they don't change functionality,
 or don't change visible output) require regenerating test cases
 (expected output) to test against - which might be source of
 errors in test suite.
 
 I agree with your to test it fully, we need extra dependencies,
 but my point is that it does not have to be a full HTML-validating,
 picking the expected attribute via XPATH matching kind of test if
 what we want is only to add a new test to protect this particular
 fix from future breakages.
 
 For example, I think it is sufficient to grep for 'href=...%xx%xx'
 in the output after preparing a sample tree with one entry to show.
 The expected substring either exists (in which case we got it
 right), or it doesn't (in which case we are showing garbage).  Of
 course that depends on the assumption that its output is not too
 heavily contaminated with volatile parts outside our control, as I
 already mentioned in the message you are responding to.
 
 But it all depends on if we wanted to add a new test ;-)

I tried to add such simple test to t9502, but instead of tests
failing with current version, the test setup fails but succeeds
(i.e. test library says that it failed, but manual examination
shows that everything is O.K.).

-- 8 --
From: Jakub Narebski jna...@gmail.com
Subject: [PATCH/RFC] gitweb test: Test proper encoding of non US-ASCII 
filenames in output (WIP)

This t9502 test is intended to test for proper encoding of non
US-ASCII filenames (i.e. UTF-8 filenames) in generated links (which
need some form of URI encoding) and in generated HTML (which needs
HTML encoding / escaping).

For now it tests only 'tree' view (though incidentally it also tests
UTF-8 in commit subject), as this was the action where reportedly
there was bug in link encoding: $t{'name'} coming from the
git ls-tree -z ... command via @ntries array was not marked as
UTF-8, making Perl assume that it is in internal Perl format
i.e. iso-8859-1 encoding and URI-escaping it as if it was in
iso-8859-1 encoding (e.g. Gütekriterien.txt in UTF-8 is
Gütekriterien.txt if treated as iso-8859-1, and it then
encodes to G%C3%83%C2%BCtekriterien.txt instead of correct
G%C3%BCtekriterien.txt).

UNFORTUNATELY test does not fail as it should, even though the issue
was not fixed... OTOH it fails in setup though it is successful.

Reported-by: Michael Wagner accou...@mwagner.org
Signed-off-by: Jakub Narębski jna...@gmail.com
---
 t/t9502-gitweb-standalone-parse-output.sh |   34 +
 1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/t/t9502-gitweb-standalone-parse-output.sh 
b/t/t9502-gitweb-standalone-parse-output.sh
index 86dfee2..37246a3 100755
--- a/t/t9502-gitweb-standalone-parse-output.sh
+++ b/t/t9502-gitweb-standalone-parse-output.sh
@@ -201,4 +201,38 @@ test_expect_success 'xss checks' '
xss a=rssp=foo.gitf=$TAG
 '
 
+link_check () {
+   grep -F   %3C__%C2%A3%C3%A5%C3%AB%C3%AE%C3%B1%C3%B2%C3%BB%C3%BD%C2%B6 
\
+   gitweb.body 
+   ! grep -F %3C__%A3%E5%EB%EE%F1%F2%FB%FD%B6 \
+   gitweb.body
+}
+
+test_expect_success 'prepare UTF-8 output tests' '
+   FILENAME=__£åëîñòûý¶  +;?__ 
+   test_commit Adding $FILENAME $FILENAME $FILENAME contents
+'
+
+test_expect_success 'check URI-escaped UTF-8 filename in query-params link' '
+   cat gitweb_config.perl -\EOF 
+   $feature{pathinfo}{default} = [0];
+   EOF
+   gitweb_run p=.git;a=tree 
+   link_check
+'
+
+test_expect_success 'check URI-escaped UTF-8 filename in path_info link' '
+   cat gitweb_config.perl -\EOF 
+   $feature{pathinfo}{default} = [1];
+   EOF
+   gitweb_run  /.git/tree 
+   link_check
+'
+
+test_expect_success 'check HTML-escaped UTF-8 filename in body' '
+   gitweb_run p=.git;a=tree 
+   grep -F lt;__£åëîñòûý¶  +;?amp;__gt; gitweb.body 
+   ! grep -F  __£åëîñòûý¶  +;?__ gitweb.body
+'
+
 test_done
-- 
1.7.1


 

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-16 Thread Jakub Narębski
On Fri, May 16, 2014 at 3:26 AM, Junio C Hamano gits...@pobox.com wrote:
 Jakub Narębski jna...@gmail.com writes:
 On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote:
 Jakub Narębski jna...@gmail.com writes:

 Writing test for this would not be easy, and require some HTML
 parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
 ... or low level HTML::TreeBuilder, or other low level parser).

 Hmph.  Is it more than just looking for a specific run of %xx we
 would expect to see in the output of the tree view for a repository
 in which there is one tree with non-ASCII name?

 There is if we want to check (in non-fragile way) that said
 specific run is in 'href' *attribute* of 'a' element (link target).

 Correct, but is where does it appear the question we are
 primarily interested in, wrt this breakage and its fix?

That of course depends on how we want to test gitweb output.
The simplest solution, comparing with known output with perhaps
fragile / variable elements masked out could be done quickly...
but changes in output (even if they don't change functionality,
or don't change visible output) require regenerating test cases
(expected output) to test against - which might be source of
errors in test suite.

Another simple solution, grepping for expected strings, also
easy to create, has the disadvantage of being only positive
test - you cannot [easily] test that there are no *wrong* output,
only that right string exists somewhere.

 If gitweb output has some volatile parts that do not depend on the
 contents of the Git test repository (e.g. showing contents of
 /etc/motd, date/time of when the test was run, or the full pathname
 leading to the trash directory), then preparing a tree whose name is
 äéìõû and making sure that the properly encoded version of äéìõû
 appears anywhere in the output may not be sufficient to validate
 that we got the encoding right, as that string may appear in the
 parts that are totally unrelated to the contents being shown and not
 under our control.  But is that really the case?

Well, I guess that any test is better than no test (though OTOH
Heartbleed and goto fail bugs shows the importance of negative
tests).

 Also we may introduce a bug and misspell the attr name and produce
 an anchor element with hpef attribute with the properly encoded URL
 in it, and your parse HTML properly approach would catch it, but
 is that the kind of breakage under discussion?  You hinted at new
 tests for UTF-8 encoding in the other message in the thread earlier,
 and I've been assuming that we were talking about the encoding test,
 not a test to catch s/href/hpef/ kind of breakage.

One of tests possible with HTML parser (e.g. WWW::Mechanize::CGI)
is to check that all [internal] links leads to 200-OK pages, which
accidentally would also be a test against this breakage.

-- 
Jakub Narebski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-16 Thread Junio C Hamano
Jakub Narębski jna...@gmail.com writes:

 Correct, but is where does it appear the question we are
 primarily interested in, wrt this breakage and its fix?

 That of course depends on how we want to test gitweb output.
 The simplest solution, comparing with known output with perhaps
 fragile / variable elements masked out could be done quickly...
 but changes in output (even if they don't change functionality,
 or don't change visible output) require regenerating test cases
 (expected output) to test against - which might be source of
 errors in test suite.

I agree with your to test it fully, we need extra dependencies,
but my point is that it does not have to be a full HTML-validating,
picking the expected attribute via XPATH matching kind of test if
what we want is only to add a new test to protect this particular
fix from future breakages.

For example, I think it is sufficient to grep for 'href=...%xx%xx'
in the output after preparing a sample tree with one entry to show.
The expected substring either exists (in which case we got it
right), or it doesn't (in which case we are showing garbage).  Of
course that depends on the assumption that its output is not too
heavily contaminated with volatile parts outside our control, as I
already mentioned in the message you are responding to.

But it all depends on if we wanted to add a new test ;-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Peter Krefting

Michael Wagner:


Decoding the UTF-8 encoded file name (again with an additional print
statement):

$ REQUEST_METHOD=GET 
QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
 ./gitweb.cgi

work/Gütekriterien.txt
Content-disposition: inline; filename=work/Gütekriterien.txt


You should fix the code path that created that URI, though, as it is 
not what you expected.


%C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde
%C2%BC decodes to U+00BC Vulgar Graction One Quarter

The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from 
looking at which two characters the sequence above yielded, C3 BC, 
which in a URI is represented as %C3%BC.


Your QUERY_STRING should thus be

  p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD

which probably works as expected.

What is happening is that whatever is generating the URI us 
UTF-8-encoding the string twice (i.e., it generates a string with the 
proper C3 BC in it, and then interprets it as iso-8859-1 data and runs 
that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence 
you see above).


--
\\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Jakub Narębski
On Thu, May 15, 2014 at 7:08 AM, Michael Wagner accou...@mwagner.org wrote:
 On Thu, May 15, 2014 at 12:25:45AM +0200, Jakub Narębski wrote:
 On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote:
 Michael Wagner accou...@mwagner.org writes:

 Perl has an internal encoding used to store text strings. Currently, 
 trying to
 view files with UTF-8 encoded names results in an error (either 404 - 
 Cannot
 find file [blob_plain] or XML Parsing Error [blob]). Converting these 
 UTF-8
 encoded file names into Perl's internal format resolves these errors.

 Could you give us an example?  What is important is whether filename
 is passed via path_info or via query string.


 There is a file named Gütekriterien.txt in my repository. Trying to
 view this file as blob_plain produces an 404 error (displaying the
 file name with an additional print statement):

 $ REQUEST_METHOD=GET 
 QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
  ./gitweb.cgi

 work/Gütekriterien.txt
 Status: 404 Not Found

You have URI encoding of ü wrong! ü encodes as %C3%BC, not
as %C3%83%C2%BC (4 bytes?)

  http://www.url-encode-decode.com/

You tested with wrong input.

BTW. there probably should be test for UTF-8 encoding, similar to
the one for XSS in t9502-gitweb-standalone-parse-output
-- 
Jakub Narębski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Junio C Hamano
Peter Krefting pe...@softwolves.pp.se writes:

 What is happening is that whatever is generating the URI us
 UTF-8-encoding the string twice (i.e., it generates a string with the
 proper C3 BC in it, and then interprets it as iso-8859-1 data and runs
 that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence
 you see above).

Thanks for a quick response.  If the input was unnecessarily encoded
one extra time, it is no wonder it needed one unnecessary extra
decoding.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Michael Wagner
On Thu, May 15, 2014 at 10:04:24AM +0100, Peter Krefting wrote:
 Michael Wagner:
 
 Decoding the UTF-8 encoded file name (again with an additional print
 statement):
 
 $ REQUEST_METHOD=GET 
 QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
  ./gitweb.cgi
 
 work/Gütekriterien.txt
 Content-disposition: inline; filename=work/Gütekriterien.txt
 
 You should fix the code path that created that URI, though, as it is not
 what you expected.
 
 %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde
 %C2%BC decodes to U+00BC Vulgar Graction One Quarter
 
 The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from
 looking at which two characters the sequence above yielded, C3 BC, which in
 a URI is represented as %C3%BC.
 
 Your QUERY_STRING should thus be
 
   p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD
 
 which probably works as expected.

Obviously, you are right, thanks.

 
 What is happening is that whatever is generating the URI us UTF-8-encoding
 the string twice (i.e., it generates a string with the proper C3 BC in it,
 and then interprets it as iso-8859-1 data and runs that through a UTF-8
 encoder again, yielding the C3 83 C2 BC sequence you see above).
 

The subroutine git tree generates the tree view. It stores the output
of git ls-tree -z ... in an array named @entries. Printing the content
of this array yields the following result:

00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt

This leads to the doubled encoding. Declaring the encoding in the call
to open yields the following result:

100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt

---

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index a9f57d6..f1414e1 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -7138,7 +7138,7 @@ sub git_tree {
my @entries = ();
{
local $/ = \0;
-   open my $fd, -|, git_cmd(), ls-tree, '-z',
+   open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z',
($show_sizes ? '-l' : ()), @extra_options, $hash
or die_error(500, Open git-ls-tree failed);
@entries = map { chomp; $_ } $fd;

 -- 
 \\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Jakub Narębski
On Thu, May 15, 2014 at 8:48 PM, Michael Wagner accou...@mwagner.org wrote:
 On Thu, May 15, 2014 at 10:04:24AM +0100, Peter Krefting wrote:
 Michael Wagner:

Decoding the UTF-8 encoded file name (again with an additional print
statement):

$ REQUEST_METHOD=GET 
QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
 ./gitweb.cgi

work/Gütekriterien.txt
Content-disposition: inline; filename=work/Gütekriterien.txt

 You should fix the code path that created that URI, though, as it is not
 what you expected.

 %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde
 %C2%BC decodes to U+00BC Vulgar Graction One Quarter

 The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from
 looking at which two characters the sequence above yielded, C3 BC, which in
 a URI is represented as %C3%BC.

 Your QUERY_STRING should thus be

   p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD

 which probably works as expected.

 What is happening is that whatever is generating the URI us UTF-8-encoding
 the string twice (i.e., it generates a string with the proper C3 BC in it,
 and then interprets it as iso-8859-1 data and runs that through a UTF-8
 encoder again, yielding the C3 83 C2 BC sequence you see above).

 The subroutine git tree generates the tree view. It stores the output
 of git ls-tree -z ... in an array named @entries. Printing the content
 of this array yields the following result:

 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt

 This leads to the doubled encoding. Declaring the encoding in the call
 to open yields the following result:

 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt

Good catch.

Writing test for this would not be easy, and require some HTML
parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
... or low level HTML::TreeBuilder, or other low level parser).

 ---

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index a9f57d6..f1414e1 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -7138,7 +7138,7 @@ sub git_tree {
 my @entries = ();
 {
 local $/ = \0;
 -   open my $fd, -|, git_cmd(), ls-tree, '-z',
 +   open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z',
 ($show_sizes ? '-l' : ()), @extra_options, $hash
 or die_error(500, Open git-ls-tree failed);

Or put

   binmode $fd, ':utf8';

like in the rest of the code.

 @entries = map { chomp; $_ } $fd;


Even better solution would be to use

use open IN = ':encoding(utf-8)';

at the beginning of gitweb.perl, once and for all.

Unfortunately the output equivalent requires creating Perl
module for gitweb, to be able to use

use open OUT = ':encoding(utf-8-with-fallback)';

-- 
Jakub Narebski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Jakub Narębski
On Thu, May 15, 2014 at 9:28 PM, Jakub Narębski jna...@gmail.com wrote:
 On Thu, May 15, 2014 at 8:48 PM, Michael Wagner accou...@mwagner.org wrote:
[...]
 The subroutine git tree generates the tree view. It stores the output
 of git ls-tree -z ... in an array named @entries. Printing the content
 of this array yields the following result:

 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 
 Gütekriterien.txt

 This leads to the doubled encoding. Declaring the encoding in the call
 to open yields the following result:

 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 
 Gütekriterien.txt

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index a9f57d6..f1414e1 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -7138,7 +7138,7 @@ sub git_tree {
 my @entries = ();
 {
 local $/ = \0;
 -   open my $fd, -|, git_cmd(), ls-tree, '-z',
 +   open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z',
 ($show_sizes ? '-l' : ()), @extra_options, $hash
 or die_error(500, Open git-ls-tree failed);

 Or put

binmode $fd, ':utf8';

 like in the rest of the code.

 @entries = map { chomp; $_ } $fd;

Though to be exact there isn't any mechanism that ensures that
filenames in tree objects use utf-8 encoding, so perhaps a safer
solution would be to use

   to_utf8($file_name)

(which respects $fallback_encoding) in appropriate places.

-- 
Jakub Narębski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Junio C Hamano
Jakub Narębski jna...@gmail.com writes:

 Writing test for this would not be easy, and require some HTML
 parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
 ... or low level HTML::TreeBuilder, or other low level parser).

Hmph.  Is it more than just looking for a specific run of %xx we
would expect to see in the output of the tree view for a repository
in which there is one tree with non-ASCII name?

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index a9f57d6..f1414e1 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -7138,7 +7138,7 @@ sub git_tree {
 my @entries = ();
 {
 local $/ = \0;
 -   open my $fd, -|, git_cmd(), ls-tree, '-z',
 +   open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z',
 ($show_sizes ? '-l' : ()), @extra_options, $hash
 or die_error(500, Open git-ls-tree failed);

 Or put

binmode $fd, ':utf8';

 like in the rest of the code.

I expect a patch to do so and can forget about this thread myself,
then, OK?

Thanks all for digging this to the root.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Jakub Narębski
On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote:
 Jakub Narębski jna...@gmail.com writes:

 Writing test for this would not be easy, and require some HTML
 parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
 ... or low level HTML::TreeBuilder, or other low level parser).

 Hmph.  Is it more than just looking for a specific run of %xx we
 would expect to see in the output of the tree view for a repository
 in which there is one tree with non-ASCII name?

There is if we want to check (in non-fragile way) that said
specific run is in 'href' *attribute* of 'a' element (link target).

-- 
Jakub Narebski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-15 Thread Junio C Hamano
Jakub Narębski jna...@gmail.com writes:

 On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote:
 Jakub Narębski jna...@gmail.com writes:

 Writing test for this would not be easy, and require some HTML
 parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
 ... or low level HTML::TreeBuilder, or other low level parser).

 Hmph.  Is it more than just looking for a specific run of %xx we
 would expect to see in the output of the tree view for a repository
 in which there is one tree with non-ASCII name?

 There is if we want to check (in non-fragile way) that said
 specific run is in 'href' *attribute* of 'a' element (link target).

Correct, but is where does it appear the question we are
primarily interested in, wrt this breakage and its fix?

If gitweb output has some volatile parts that do not depend on the
contents of the Git test repository (e.g. showing contents of
/etc/motd, date/time of when the test was run, or the full pathname
leading to the trash directory), then preparing a tree whose name is
äéìõû and making sure that the properly encoded version of äéìõû
appears anywhere in the output may not be sufficient to validate
that we got the encoding right, as that string may appear in the
parts that are totally unrelated to the contents being shown and not
under our control.  But is that really the case?

Also we may introduce a bug and misspell the attr name and produce
an anchor element with hpef attribute with the properly encoded URL
in it, and your parse HTML properly approach would catch it, but
is that the kind of breakage under discussion?  You hinted at new
tests for UTF-8 encoding in the other message in the thread earlier,
and I've been assuming that we were talking about the encoding test,
not a test to catch s/href/hpef/ kind of breakage.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-14 Thread Michael Wagner
Perl has an internal encoding used to store text strings. Currently, trying to
view files with UTF-8 encoded names results in an error (either 404 - Cannot
find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8
encoded file names into Perl's internal format resolves these errors.

Signed-off-by: Michael Wagner accou...@mwagner.org
---
 gitweb/gitweb.perl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index a9f57d6..6046977 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params {
}
}
 
-   our $file_name = $input_params{'file_name'};
+   our $file_name = decode(utf-8, $input_params{'file_name'});
if (defined $file_name) {
if (!is_valid_pathname($file_name)) {
die_error(400, Invalid file parameter);
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-14 Thread Junio C Hamano
Michael Wagner accou...@mwagner.org writes:

 Perl has an internal encoding used to store text strings. Currently, trying to
 view files with UTF-8 encoded names results in an error (either 404 - Cannot
 find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8
 encoded file names into Perl's internal format resolves these errors.

 Signed-off-by: Michael Wagner accou...@mwagner.org
 ---

Cc'ing Jakub, who have been the area maintainer, for comments.

One thing I wonder is that, if there are some additional calls to
encode() necessary before we embed $file_name (which are now decoded
to the internal string form, not a byte-sequence that happens to be
in utf-8) in the generated pages, if we were to do this change.

  gitweb/gitweb.perl | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index a9f57d6..6046977 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params {
   }
   }
  
 - our $file_name = $input_params{'file_name'};
 + our $file_name = decode(utf-8, $input_params{'file_name'});
   if (defined $file_name) {
   if (!is_valid_pathname($file_name)) {
   die_error(400, Invalid file parameter);
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-14 Thread Jakub Narębski
On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote:
 Michael Wagner accou...@mwagner.org writes:

 Perl has an internal encoding used to store text strings. Currently, trying 
 to
 view files with UTF-8 encoded names results in an error (either 404 - Cannot
 find file [blob_plain] or XML Parsing Error [blob]). Converting these 
 UTF-8
 encoded file names into Perl's internal format resolves these errors.

Could you give us an example?  What is important is whether filename
is passed via path_info or via query string.

Because in evaluate_uri() there is

 our $path_info = decode_utf8($ENV{PATH_INFO});

and in evaluate_query_params() there is

$input_params{$name} = decode_utf8($cgi-param($symbol));

 Signed-off-by: Michael Wagner accou...@mwagner.org
 ---

 Cc'ing Jakub, who have been the area maintainer, for comments.

 One thing I wonder is that, if there are some additional calls to
 encode() necessary before we embed $file_name (which are now decoded
 to the internal string form, not a byte-sequence that happens to be
 in utf-8) in the generated pages, if we were to do this change.

There should be no problem with output encoding.  esc_path(), which
should be used for filenames, includes to_utf8, which in turn uses
decode($fallback_encoding, $str, Encode::FB_DEFAULT);

  gitweb/gitweb.perl | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index a9f57d6..6046977 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params {
   }
   }

 - our $file_name = $input_params{'file_name'};
 + our $file_name = decode(utf-8, $input_params{'file_name'});
   if (defined $file_name) {
   if (!is_valid_pathname($file_name)) {
   die_error(400, Invalid file parameter);

Hmm... all %input_params should have been properly decoded
already, how it was missed?

Also, branchname (hash_base etc.), search query, filename in file_parent,
project name can be UTF-8 too, so it is at best partial fix.

-- 
Jakub Narębski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

2014-05-14 Thread Michael Wagner
On Thu, May 15, 2014 at 12:25:45AM +0200, Jakub Narębski wrote:
 On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote:
  Michael Wagner accou...@mwagner.org writes:
 
  Perl has an internal encoding used to store text strings. Currently, 
  trying to
  view files with UTF-8 encoded names results in an error (either 404 - 
  Cannot
  find file [blob_plain] or XML Parsing Error [blob]). Converting these 
  UTF-8
  encoded file names into Perl's internal format resolves these errors.
 
 Could you give us an example?  What is important is whether filename
 is passed via path_info or via query string.
 

There is a file named Gütekriterien.txt in my repository. Trying to
view this file as blob_plain produces an 404 error (displaying the
file name with an additional print statement):

$ REQUEST_METHOD=GET 
QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
 ./gitweb.cgi

work/Gütekriterien.txt
Status: 404 Not Found

Decoding the UTF-8 encoded file name (again with an additional print
statement):

$ REQUEST_METHOD=GET 
QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD'
 ./gitweb.cgi

work/Gütekriterien.txt
Content-disposition: inline; filename=work/Gütekriterien.txt

 Because in evaluate_uri() there is
 
  our $path_info = decode_utf8($ENV{PATH_INFO});
 
 and in evaluate_query_params() there is
 
 $input_params{$name} = decode_utf8($cgi-param($symbol));
 
  Signed-off-by: Michael Wagner accou...@mwagner.org
  ---
 
  Cc'ing Jakub, who have been the area maintainer, for comments.
 
  One thing I wonder is that, if there are some additional calls to
  encode() necessary before we embed $file_name (which are now decoded
  to the internal string form, not a byte-sequence that happens to be
  in utf-8) in the generated pages, if we were to do this change.

The generated pages show the correct file names. 

 
 There should be no problem with output encoding.  esc_path(), which
 should be used for filenames, includes to_utf8, which in turn uses
 decode($fallback_encoding, $str, Encode::FB_DEFAULT);
 
   gitweb/gitweb.perl | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
  index a9f57d6..6046977 100755
  --- a/gitweb/gitweb.perl
  +++ b/gitweb/gitweb.perl
  @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params {
}
}
 
  - our $file_name = $input_params{'file_name'};
  + our $file_name = decode(utf-8, $input_params{'file_name'});
if (defined $file_name) {
if (!is_valid_pathname($file_name)) {
die_error(400, Invalid file parameter);
 
 Hmm... all %input_params should have been properly decoded
 already, how it was missed?
 
 Also, branchname (hash_base etc.), search query, filename in file_parent,
 project name can be UTF-8 too, so it is at best partial fix.
 
 -- 
 Jakub Narębski
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html