Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
W dniu 2014-05-16 19:05, Junio C Hamano pisze: Jakub Narębski jna...@gmail.com writes: Correct, but is where does it appear the question we are primarily interested in, wrt this breakage and its fix? That of course depends on how we want to test gitweb output. The simplest solution, comparing with known output with perhaps fragile / variable elements masked out could be done quickly... but changes in output (even if they don't change functionality, or don't change visible output) require regenerating test cases (expected output) to test against - which might be source of errors in test suite. I agree with your to test it fully, we need extra dependencies, but my point is that it does not have to be a full HTML-validating, picking the expected attribute via XPATH matching kind of test if what we want is only to add a new test to protect this particular fix from future breakages. For example, I think it is sufficient to grep for 'href=...%xx%xx' in the output after preparing a sample tree with one entry to show. The expected substring either exists (in which case we got it right), or it doesn't (in which case we are showing garbage). Of course that depends on the assumption that its output is not too heavily contaminated with volatile parts outside our control, as I already mentioned in the message you are responding to. But it all depends on if we wanted to add a new test ;-) I tried to add such simple test to t9502, but instead of tests failing with current version, the test setup fails but succeeds (i.e. test library says that it failed, but manual examination shows that everything is O.K.). -- 8 -- From: Jakub Narebski jna...@gmail.com Subject: [PATCH/RFC] gitweb test: Test proper encoding of non US-ASCII filenames in output (WIP) This t9502 test is intended to test for proper encoding of non US-ASCII filenames (i.e. UTF-8 filenames) in generated links (which need some form of URI encoding) and in generated HTML (which needs HTML encoding / escaping). For now it tests only 'tree' view (though incidentally it also tests UTF-8 in commit subject), as this was the action where reportedly there was bug in link encoding: $t{'name'} coming from the git ls-tree -z ... command via @ntries array was not marked as UTF-8, making Perl assume that it is in internal Perl format i.e. iso-8859-1 encoding and URI-escaping it as if it was in iso-8859-1 encoding (e.g. Gütekriterien.txt in UTF-8 is Gütekriterien.txt if treated as iso-8859-1, and it then encodes to G%C3%83%C2%BCtekriterien.txt instead of correct G%C3%BCtekriterien.txt). UNFORTUNATELY test does not fail as it should, even though the issue was not fixed... OTOH it fails in setup though it is successful. Reported-by: Michael Wagner accou...@mwagner.org Signed-off-by: Jakub Narębski jna...@gmail.com --- t/t9502-gitweb-standalone-parse-output.sh | 34 + 1 files changed, 34 insertions(+), 0 deletions(-) diff --git a/t/t9502-gitweb-standalone-parse-output.sh b/t/t9502-gitweb-standalone-parse-output.sh index 86dfee2..37246a3 100755 --- a/t/t9502-gitweb-standalone-parse-output.sh +++ b/t/t9502-gitweb-standalone-parse-output.sh @@ -201,4 +201,38 @@ test_expect_success 'xss checks' ' xss a=rssp=foo.gitf=$TAG ' +link_check () { + grep -F %3C__%C2%A3%C3%A5%C3%AB%C3%AE%C3%B1%C3%B2%C3%BB%C3%BD%C2%B6 \ + gitweb.body + ! grep -F %3C__%A3%E5%EB%EE%F1%F2%FB%FD%B6 \ + gitweb.body +} + +test_expect_success 'prepare UTF-8 output tests' ' + FILENAME=__£åëîñòûý¶ +;?__ + test_commit Adding $FILENAME $FILENAME $FILENAME contents +' + +test_expect_success 'check URI-escaped UTF-8 filename in query-params link' ' + cat gitweb_config.perl -\EOF + $feature{pathinfo}{default} = [0]; + EOF + gitweb_run p=.git;a=tree + link_check +' + +test_expect_success 'check URI-escaped UTF-8 filename in path_info link' ' + cat gitweb_config.perl -\EOF + $feature{pathinfo}{default} = [1]; + EOF + gitweb_run /.git/tree + link_check +' + +test_expect_success 'check HTML-escaped UTF-8 filename in body' ' + gitweb_run p=.git;a=tree + grep -F lt;__£åëîñòûý¶ +;?amp;__gt; gitweb.body + ! grep -F __£åëîñòûý¶ +;?__ gitweb.body +' + test_done -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Fri, May 16, 2014 at 3:26 AM, Junio C Hamano gits...@pobox.com wrote: Jakub Narębski jna...@gmail.com writes: On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote: Jakub Narębski jna...@gmail.com writes: Writing test for this would not be easy, and require some HTML parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery, ... or low level HTML::TreeBuilder, or other low level parser). Hmph. Is it more than just looking for a specific run of %xx we would expect to see in the output of the tree view for a repository in which there is one tree with non-ASCII name? There is if we want to check (in non-fragile way) that said specific run is in 'href' *attribute* of 'a' element (link target). Correct, but is where does it appear the question we are primarily interested in, wrt this breakage and its fix? That of course depends on how we want to test gitweb output. The simplest solution, comparing with known output with perhaps fragile / variable elements masked out could be done quickly... but changes in output (even if they don't change functionality, or don't change visible output) require regenerating test cases (expected output) to test against - which might be source of errors in test suite. Another simple solution, grepping for expected strings, also easy to create, has the disadvantage of being only positive test - you cannot [easily] test that there are no *wrong* output, only that right string exists somewhere. If gitweb output has some volatile parts that do not depend on the contents of the Git test repository (e.g. showing contents of /etc/motd, date/time of when the test was run, or the full pathname leading to the trash directory), then preparing a tree whose name is äéìõû and making sure that the properly encoded version of äéìõû appears anywhere in the output may not be sufficient to validate that we got the encoding right, as that string may appear in the parts that are totally unrelated to the contents being shown and not under our control. But is that really the case? Well, I guess that any test is better than no test (though OTOH Heartbleed and goto fail bugs shows the importance of negative tests). Also we may introduce a bug and misspell the attr name and produce an anchor element with hpef attribute with the properly encoded URL in it, and your parse HTML properly approach would catch it, but is that the kind of breakage under discussion? You hinted at new tests for UTF-8 encoding in the other message in the thread earlier, and I've been assuming that we were talking about the encoding test, not a test to catch s/href/hpef/ kind of breakage. One of tests possible with HTML parser (e.g. WWW::Mechanize::CGI) is to check that all [internal] links leads to 200-OK pages, which accidentally would also be a test against this breakage. -- Jakub Narebski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Jakub Narębski jna...@gmail.com writes: Correct, but is where does it appear the question we are primarily interested in, wrt this breakage and its fix? That of course depends on how we want to test gitweb output. The simplest solution, comparing with known output with perhaps fragile / variable elements masked out could be done quickly... but changes in output (even if they don't change functionality, or don't change visible output) require regenerating test cases (expected output) to test against - which might be source of errors in test suite. I agree with your to test it fully, we need extra dependencies, but my point is that it does not have to be a full HTML-validating, picking the expected attribute via XPATH matching kind of test if what we want is only to add a new test to protect this particular fix from future breakages. For example, I think it is sufficient to grep for 'href=...%xx%xx' in the output after preparing a sample tree with one entry to show. The expected substring either exists (in which case we got it right), or it doesn't (in which case we are showing garbage). Of course that depends on the assumption that its output is not too heavily contaminated with volatile parts outside our control, as I already mentioned in the message you are responding to. But it all depends on if we wanted to add a new test ;-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Michael Wagner: Decoding the UTF-8 encoded file name (again with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Content-disposition: inline; filename=work/Gütekriterien.txt You should fix the code path that created that URI, though, as it is not what you expected. %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde %C2%BC decodes to U+00BC Vulgar Graction One Quarter The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from looking at which two characters the sequence above yielded, C3 BC, which in a URI is represented as %C3%BC. Your QUERY_STRING should thus be p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD which probably works as expected. What is happening is that whatever is generating the URI us UTF-8-encoding the string twice (i.e., it generates a string with the proper C3 BC in it, and then interprets it as iso-8859-1 data and runs that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence you see above). -- \\// Peter - http://www.softwolves.pp.se/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 7:08 AM, Michael Wagner accou...@mwagner.org wrote: On Thu, May 15, 2014 at 12:25:45AM +0200, Jakub Narębski wrote: On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote: Michael Wagner accou...@mwagner.org writes: Perl has an internal encoding used to store text strings. Currently, trying to view files with UTF-8 encoded names results in an error (either 404 - Cannot find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8 encoded file names into Perl's internal format resolves these errors. Could you give us an example? What is important is whether filename is passed via path_info or via query string. There is a file named Gütekriterien.txt in my repository. Trying to view this file as blob_plain produces an 404 error (displaying the file name with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Status: 404 Not Found You have URI encoding of ü wrong! ü encodes as %C3%BC, not as %C3%83%C2%BC (4 bytes?) http://www.url-encode-decode.com/ You tested with wrong input. BTW. there probably should be test for UTF-8 encoding, similar to the one for XSS in t9502-gitweb-standalone-parse-output -- Jakub Narębski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Peter Krefting pe...@softwolves.pp.se writes: What is happening is that whatever is generating the URI us UTF-8-encoding the string twice (i.e., it generates a string with the proper C3 BC in it, and then interprets it as iso-8859-1 data and runs that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence you see above). Thanks for a quick response. If the input was unnecessarily encoded one extra time, it is no wonder it needed one unnecessary extra decoding. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 10:04:24AM +0100, Peter Krefting wrote: Michael Wagner: Decoding the UTF-8 encoded file name (again with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Content-disposition: inline; filename=work/Gütekriterien.txt You should fix the code path that created that URI, though, as it is not what you expected. %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde %C2%BC decodes to U+00BC Vulgar Graction One Quarter The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from looking at which two characters the sequence above yielded, C3 BC, which in a URI is represented as %C3%BC. Your QUERY_STRING should thus be p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD which probably works as expected. Obviously, you are right, thanks. What is happening is that whatever is generating the URI us UTF-8-encoding the string twice (i.e., it generates a string with the proper C3 BC in it, and then interprets it as iso-8859-1 data and runs that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence you see above). The subroutine git tree generates the tree view. It stores the output of git ls-tree -z ... in an array named @entries. Printing the content of this array yields the following result: 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt This leads to the doubled encoding. Declaring the encoding in the call to open yields the following result: 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt --- diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..f1414e1 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -7138,7 +7138,7 @@ sub git_tree { my @entries = (); { local $/ = \0; - open my $fd, -|, git_cmd(), ls-tree, '-z', + open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z', ($show_sizes ? '-l' : ()), @extra_options, $hash or die_error(500, Open git-ls-tree failed); @entries = map { chomp; $_ } $fd; -- \\// Peter - http://www.softwolves.pp.se/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 8:48 PM, Michael Wagner accou...@mwagner.org wrote: On Thu, May 15, 2014 at 10:04:24AM +0100, Peter Krefting wrote: Michael Wagner: Decoding the UTF-8 encoded file name (again with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Content-disposition: inline; filename=work/Gütekriterien.txt You should fix the code path that created that URI, though, as it is not what you expected. %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde %C2%BC decodes to U+00BC Vulgar Graction One Quarter The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from looking at which two characters the sequence above yielded, C3 BC, which in a URI is represented as %C3%BC. Your QUERY_STRING should thus be p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD which probably works as expected. What is happening is that whatever is generating the URI us UTF-8-encoding the string twice (i.e., it generates a string with the proper C3 BC in it, and then interprets it as iso-8859-1 data and runs that through a UTF-8 encoder again, yielding the C3 83 C2 BC sequence you see above). The subroutine git tree generates the tree view. It stores the output of git ls-tree -z ... in an array named @entries. Printing the content of this array yields the following result: 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt This leads to the doubled encoding. Declaring the encoding in the call to open yields the following result: 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt Good catch. Writing test for this would not be easy, and require some HTML parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery, ... or low level HTML::TreeBuilder, or other low level parser). --- diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..f1414e1 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -7138,7 +7138,7 @@ sub git_tree { my @entries = (); { local $/ = \0; - open my $fd, -|, git_cmd(), ls-tree, '-z', + open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z', ($show_sizes ? '-l' : ()), @extra_options, $hash or die_error(500, Open git-ls-tree failed); Or put binmode $fd, ':utf8'; like in the rest of the code. @entries = map { chomp; $_ } $fd; Even better solution would be to use use open IN = ':encoding(utf-8)'; at the beginning of gitweb.perl, once and for all. Unfortunately the output equivalent requires creating Perl module for gitweb, to be able to use use open OUT = ':encoding(utf-8-with-fallback)'; -- Jakub Narebski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 9:28 PM, Jakub Narębski jna...@gmail.com wrote: On Thu, May 15, 2014 at 8:48 PM, Michael Wagner accou...@mwagner.org wrote: [...] The subroutine git tree generates the tree view. It stores the output of git ls-tree -z ... in an array named @entries. Printing the content of this array yields the following result: 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt This leads to the doubled encoding. Declaring the encoding in the call to open yields the following result: 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a 520 Gütekriterien.txt diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..f1414e1 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -7138,7 +7138,7 @@ sub git_tree { my @entries = (); { local $/ = \0; - open my $fd, -|, git_cmd(), ls-tree, '-z', + open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z', ($show_sizes ? '-l' : ()), @extra_options, $hash or die_error(500, Open git-ls-tree failed); Or put binmode $fd, ':utf8'; like in the rest of the code. @entries = map { chomp; $_ } $fd; Though to be exact there isn't any mechanism that ensures that filenames in tree objects use utf-8 encoding, so perhaps a safer solution would be to use to_utf8($file_name) (which respects $fallback_encoding) in appropriate places. -- Jakub Narębski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Jakub Narębski jna...@gmail.com writes: Writing test for this would not be easy, and require some HTML parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery, ... or low level HTML::TreeBuilder, or other low level parser). Hmph. Is it more than just looking for a specific run of %xx we would expect to see in the output of the tree view for a repository in which there is one tree with non-ASCII name? diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..f1414e1 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -7138,7 +7138,7 @@ sub git_tree { my @entries = (); { local $/ = \0; - open my $fd, -|, git_cmd(), ls-tree, '-z', + open my $fd, -|encoding(UTF-8), git_cmd(), ls-tree, '-z', ($show_sizes ? '-l' : ()), @extra_options, $hash or die_error(500, Open git-ls-tree failed); Or put binmode $fd, ':utf8'; like in the rest of the code. I expect a patch to do so and can forget about this thread myself, then, OK? Thanks all for digging this to the root. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote: Jakub Narębski jna...@gmail.com writes: Writing test for this would not be easy, and require some HTML parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery, ... or low level HTML::TreeBuilder, or other low level parser). Hmph. Is it more than just looking for a specific run of %xx we would expect to see in the output of the tree view for a repository in which there is one tree with non-ASCII name? There is if we want to check (in non-fragile way) that said specific run is in 'href' *attribute* of 'a' element (link target). -- Jakub Narebski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Jakub Narębski jna...@gmail.com writes: On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano gits...@pobox.com wrote: Jakub Narębski jna...@gmail.com writes: Writing test for this would not be easy, and require some HTML parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery, ... or low level HTML::TreeBuilder, or other low level parser). Hmph. Is it more than just looking for a specific run of %xx we would expect to see in the output of the tree view for a repository in which there is one tree with non-ASCII name? There is if we want to check (in non-fragile way) that said specific run is in 'href' *attribute* of 'a' element (link target). Correct, but is where does it appear the question we are primarily interested in, wrt this breakage and its fix? If gitweb output has some volatile parts that do not depend on the contents of the Git test repository (e.g. showing contents of /etc/motd, date/time of when the test was run, or the full pathname leading to the trash directory), then preparing a tree whose name is äéìõû and making sure that the properly encoded version of äéìõû appears anywhere in the output may not be sufficient to validate that we got the encoding right, as that string may appear in the parts that are totally unrelated to the contents being shown and not under our control. But is that really the case? Also we may introduce a bug and misspell the attr name and produce an anchor element with hpef attribute with the properly encoded URL in it, and your parse HTML properly approach would catch it, but is that the kind of breakage under discussion? You hinted at new tests for UTF-8 encoding in the other message in the thread earlier, and I've been assuming that we were talking about the encoding test, not a test to catch s/href/hpef/ kind of breakage. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Perl has an internal encoding used to store text strings. Currently, trying to view files with UTF-8 encoded names results in an error (either 404 - Cannot find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8 encoded file names into Perl's internal format resolves these errors. Signed-off-by: Michael Wagner accou...@mwagner.org --- gitweb/gitweb.perl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..6046977 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params { } } - our $file_name = $input_params{'file_name'}; + our $file_name = decode(utf-8, $input_params{'file_name'}); if (defined $file_name) { if (!is_valid_pathname($file_name)) { die_error(400, Invalid file parameter); -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
Michael Wagner accou...@mwagner.org writes: Perl has an internal encoding used to store text strings. Currently, trying to view files with UTF-8 encoded names results in an error (either 404 - Cannot find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8 encoded file names into Perl's internal format resolves these errors. Signed-off-by: Michael Wagner accou...@mwagner.org --- Cc'ing Jakub, who have been the area maintainer, for comments. One thing I wonder is that, if there are some additional calls to encode() necessary before we embed $file_name (which are now decoded to the internal string form, not a byte-sequence that happens to be in utf-8) in the generated pages, if we were to do this change. gitweb/gitweb.perl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..6046977 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params { } } - our $file_name = $input_params{'file_name'}; + our $file_name = decode(utf-8, $input_params{'file_name'}); if (defined $file_name) { if (!is_valid_pathname($file_name)) { die_error(400, Invalid file parameter); -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote: Michael Wagner accou...@mwagner.org writes: Perl has an internal encoding used to store text strings. Currently, trying to view files with UTF-8 encoded names results in an error (either 404 - Cannot find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8 encoded file names into Perl's internal format resolves these errors. Could you give us an example? What is important is whether filename is passed via path_info or via query string. Because in evaluate_uri() there is our $path_info = decode_utf8($ENV{PATH_INFO}); and in evaluate_query_params() there is $input_params{$name} = decode_utf8($cgi-param($symbol)); Signed-off-by: Michael Wagner accou...@mwagner.org --- Cc'ing Jakub, who have been the area maintainer, for comments. One thing I wonder is that, if there are some additional calls to encode() necessary before we embed $file_name (which are now decoded to the internal string form, not a byte-sequence that happens to be in utf-8) in the generated pages, if we were to do this change. There should be no problem with output encoding. esc_path(), which should be used for filenames, includes to_utf8, which in turn uses decode($fallback_encoding, $str, Encode::FB_DEFAULT); gitweb/gitweb.perl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..6046977 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params { } } - our $file_name = $input_params{'file_name'}; + our $file_name = decode(utf-8, $input_params{'file_name'}); if (defined $file_name) { if (!is_valid_pathname($file_name)) { die_error(400, Invalid file parameter); Hmm... all %input_params should have been properly decoded already, how it was missed? Also, branchname (hash_base etc.), search query, filename in file_parent, project name can be UTF-8 too, so it is at best partial fix. -- Jakub Narębski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names
On Thu, May 15, 2014 at 12:25:45AM +0200, Jakub Narębski wrote: On Wed, May 14, 2014 at 11:57 PM, Junio C Hamano gits...@pobox.com wrote: Michael Wagner accou...@mwagner.org writes: Perl has an internal encoding used to store text strings. Currently, trying to view files with UTF-8 encoded names results in an error (either 404 - Cannot find file [blob_plain] or XML Parsing Error [blob]). Converting these UTF-8 encoded file names into Perl's internal format resolves these errors. Could you give us an example? What is important is whether filename is passed via path_info or via query string. There is a file named Gütekriterien.txt in my repository. Trying to view this file as blob_plain produces an 404 error (displaying the file name with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Status: 404 Not Found Decoding the UTF-8 encoded file name (again with an additional print statement): $ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi work/Gütekriterien.txt Content-disposition: inline; filename=work/Gütekriterien.txt Because in evaluate_uri() there is our $path_info = decode_utf8($ENV{PATH_INFO}); and in evaluate_query_params() there is $input_params{$name} = decode_utf8($cgi-param($symbol)); Signed-off-by: Michael Wagner accou...@mwagner.org --- Cc'ing Jakub, who have been the area maintainer, for comments. One thing I wonder is that, if there are some additional calls to encode() necessary before we embed $file_name (which are now decoded to the internal string form, not a byte-sequence that happens to be in utf-8) in the generated pages, if we were to do this change. The generated pages show the correct file names. There should be no problem with output encoding. esc_path(), which should be used for filenames, includes to_utf8, which in turn uses decode($fallback_encoding, $str, Encode::FB_DEFAULT); gitweb/gitweb.perl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index a9f57d6..6046977 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -1056,7 +1056,7 @@ sub evaluate_and_validate_params { } } - our $file_name = $input_params{'file_name'}; + our $file_name = decode(utf-8, $input_params{'file_name'}); if (defined $file_name) { if (!is_valid_pathname($file_name)) { die_error(400, Invalid file parameter); Hmm... all %input_params should have been properly decoded already, how it was missed? Also, branchname (hash_base etc.), search query, filename in file_parent, project name can be UTF-8 too, so it is at best partial fix. -- Jakub Narębski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html