Bug#820119: restores original characters instead of taking care of every time numeric references coming up
On Sat, 14 Jan 2017 13:55:16 +0100 Laura Arjona Reina wrote: > Victory is right no, i misinterpreted validate as tidy :p what the patch does: if both of (charset is "utf-8") and ([the last 56 chars of a error] is "is not a character number in the document character set\n") are satisfied, then the current loop is terminated (push(@errors, $_) will not be processed in this case) and continue next ones patch for git:///debwww/cron: scripts/validate below: @@ -392,10 +392,13 @@ foreach $file (@files) { if ($#error < 5) { next; } elsif ($error[4] eq 'E' || $error[4] eq 'X') { + next if($charset eq "utf-8" && + substr($error[5],-56) eq + "is not a character number in the document character set\n"); push(@errors, $_); # If the DOCTYPE is bad, bail out last if ($error[5] eq " unrecognized {{DOCTYPE}}; unable to check document\n"); -- victory no need to CC me :-)
Bug#820119: restores original characters instead of taking care of every time numeric references coming up
Hi again. I think my conclusion is silly, I was considering encoding the whole string only. But we can encode the &000130 and leave the emoji in numeric entity. Victory is right, I'll try to think clearer later and merge the patch today. (Now afk, sorry). El 14 de enero de 2017 13:43:24 CET, Laura Arjona Reina escribió: >Hi > >El 13/01/17 a las 11:34, victory escribió: >> >> first, it is stupid to blame about names which are valid. >> it is also stupid that taking care of each occurrences coming up. >> as pages are all utf-8 now, no need to keep such references, >> this patch restores original characters instead of numeric references >> >> patch below: >> Index: english/international/l10n/scripts/gen-files.pl >> === >> --- english/international/l10n/scripts/gen-files.pl (revision 232) >> +++ english/international/l10n/scripts/gen-files.pl (working copy) >> @@ -3,6 +3,7 @@ >> use strict; >> use File::Path; >> use Getopt::Long; >> +use Encode qw(encode); >> >> use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl"; >> >> @@ -117,8 +118,7 @@ >> $name =~ s/\s*<.*//; >> $name =~ s/&(?!#)/&/g; >> $name =~ s/=\?.*?\?=//g; >> -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. >> -$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; >> +$name =~ s/(\d+);/encode("UTF-8",chr($1))/ge; >> $name = 'DDTP' if $name eq 'Debian Description Translation >Project'; >> $name = '' if $name =~ m/\@/; >> return $name; >> >> > >Thanks for all the work in these and other validation/tidy issues in >the website. > >I've done some tests and I'm afraid I cannot merge the patch yet. > >Using perl to encode to UTF8 as you propose makes tidy happy, but >there is another script passed to the files, "validate", that produces >theses errors: > >Line 10, character 12: non SGML character number 130 > >If we use numeric entities, tidy complains for ‚ unless we >suppress the character as we do now. > >For the emoji in translator name, "validate" complains in any case: > >* Using numeric entities: with the current message received: > >"128513" is not a character number in the document character set > >* Encoding to UTF8 as the proposed patch: > >Line 10, character 29: non SGML character number 65533 > >I've produced two small files: > >https://cosas.larjona.net/validate.utf8.html >https://cosas.larjona.net/validate.ncr.html > >and passed the online validator in https://validator.w3.org/ > >I'll try to see if we can use https://validator.w3.org/source/ and get >better "tidy" and "validate" tools from there. > >For now, I've fixed the comment in the gen-files.pl: > >--- english/international/l10n/scripts/gen-files.pl 20 May 2016 >21:15:45 - 1.97 >+++ english/international/l10n/scripts/gen-files.pl 14 Jan 2017 >12:41:06 - >@@ -117,7 +117,10 @@ > $name =~ s/\s*<.*//; > $name =~ s/&(?!#)/&/g; > $name =~ s/=\?.*?\?=//g; >-# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. >+# BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01. >+# but the "tidy" tool that we use complains about them, >+# so we just remove those characters for now, until better >solution >+# see Bug #820119 > $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; > $name = 'DDTP' if $name eq 'Debian Description Translation >Project'; > $name = '' if $name =~ m/\@/; > >Best regards Laura Arjona Reina https://wiki.debian.org/LauraArjona
Bug#820119: restores original characters instead of taking care of every time numeric references coming up
Hi El 13/01/17 a las 11:34, victory escribió: > > first, it is stupid to blame about names which are valid. > it is also stupid that taking care of each occurrences coming up. > as pages are all utf-8 now, no need to keep such references, > this patch restores original characters instead of numeric references > > patch below: > Index: english/international/l10n/scripts/gen-files.pl > === > --- english/international/l10n/scripts/gen-files.pl (revision 232) > +++ english/international/l10n/scripts/gen-files.pl (working copy) > @@ -3,6 +3,7 @@ > use strict; > use File::Path; > use Getopt::Long; > +use Encode qw(encode); > > use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl"; > > @@ -117,8 +118,7 @@ > $name =~ s/\s*<.*//; > $name =~ s/&(?!#)/&/g; > $name =~ s/=\?.*?\?=//g; > -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. > -$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; > +$name =~ s/(\d+);/encode("UTF-8",chr($1))/ge; > $name = 'DDTP' if $name eq 'Debian Description Translation Project'; > $name = '' if $name =~ m/\@/; > return $name; > > Thanks for all the work in these and other validation/tidy issues in the website. I've done some tests and I'm afraid I cannot merge the patch yet. Using perl to encode to UTF8 as you propose makes tidy happy, but there is another script passed to the files, "validate", that produces theses errors: Line 10, character 12: non SGML character number 130 If we use numeric entities, tidy complains for ‚ unless we suppress the character as we do now. For the emoji in translator name, "validate" complains in any case: * Using numeric entities: with the current message received: "128513" is not a character number in the document character set * Encoding to UTF8 as the proposed patch: Line 10, character 29: non SGML character number 65533 I've produced two small files: https://cosas.larjona.net/validate.utf8.html https://cosas.larjona.net/validate.ncr.html and passed the online validator in https://validator.w3.org/ I'll try to see if we can use https://validator.w3.org/source/ and get better "tidy" and "validate" tools from there. For now, I've fixed the comment in the gen-files.pl: --- english/international/l10n/scripts/gen-files.pl 20 May 2016 21:15:45 - 1.97 +++ english/international/l10n/scripts/gen-files.pl 14 Jan 2017 12:41:06 - @@ -117,7 +117,10 @@ $name =~ s/\s*<.*//; $name =~ s/&(?!#)/&/g; $name =~ s/=\?.*?\?=//g; -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. +# BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01. +# but the "tidy" tool that we use complains about them, +# so we just remove those characters for now, until better solution +# see Bug #820119 $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; $name = 'DDTP' if $name eq 'Debian Description Translation Project'; $name = '' if $name =~ m/\@/; Best regards -- Laura Arjona Reina https://wiki.debian.org/LauraArjona
Bug#820119: restores original characters instead of taking care of every time numeric references coming up
first, it is stupid to blame about names which are valid. it is also stupid that taking care of each occurrences coming up. as pages are all utf-8 now, no need to keep such references, this patch restores original characters instead of numeric references patch below: Index: english/international/l10n/scripts/gen-files.pl === --- english/international/l10n/scripts/gen-files.pl (revision 232) +++ english/international/l10n/scripts/gen-files.pl (working copy) @@ -3,6 +3,7 @@ use strict; use File::Path; use Getopt::Long; +use Encode qw(encode); use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl"; @@ -117,8 +118,7 @@ $name =~ s/\s*<.*//; $name =~ s/&(?!#)/&/g; $name =~ s/=\?.*?\?=//g; -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. -$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; +$name =~ s/(\d+);/encode("UTF-8",chr($1))/ge; $name = 'DDTP' if $name eq 'Debian Description Translation Project'; $name = '' if $name =~ m/\@/; return $name; -- victory no need to CC me :-)