Bug#820119: restores original characters instead of taking care of every time numeric references coming up

2017-01-21 Thread victory
On Sat, 14 Jan 2017 13:55:16 +0100
Laura Arjona Reina wrote:

> Victory is right

no, i misinterpreted validate as tidy :p

what the patch does:
  if both of
   (charset is "utf-8")
  and
([the last 56 chars of a error] is
  "is not a character number in the document character set\n")
  are satisfied,
  then the current loop is terminated
(push(@errors, $_) will not be processed in this case)
  and continue next ones
patch for git:///debwww/cron: scripts/validate below:

@@ -392,10 +392,13 @@ foreach $file (@files) {
 if ($#error < 5) {
 
 next;
 
 } elsif ($error[4] eq 'E' || $error[4] eq 'X') {
+   next if($charset eq "utf-8" &&
+   substr($error[5],-56) eq
+   "is not a character number in the document character 
set\n");
 
 push(@errors, $_);
 
 # If the DOCTYPE is bad, bail out
 last if ($error[5] eq " unrecognized {{DOCTYPE}}; unable to check 
document\n");


-- 
victory
no need to CC me :-)



Bug#820119: restores original characters instead of taking care of every time numeric references coming up

2017-01-14 Thread Laura Arjona Reina
Hi again.

I think my conclusion is silly, I was considering encoding the whole string 
only.
But we can encode  the &000130 and leave the emoji in numeric entity.
Victory is right, I'll try to think clearer later and merge the patch today. 
(Now afk, sorry).


El 14 de enero de 2017 13:43:24 CET, Laura Arjona Reina  
escribió:
>Hi
>
>El 13/01/17 a las 11:34, victory escribió:
>> 
>> first, it is stupid to blame about names which are valid.
>> it is also stupid that taking care of each occurrences coming up.
>> as pages are all utf-8 now, no need to keep such references,
>> this patch restores original characters instead of numeric references
>> 
>> patch below:
>> Index: english/international/l10n/scripts/gen-files.pl
>> ===
>> --- english/international/l10n/scripts/gen-files.pl  (revision 232)
>> +++ english/international/l10n/scripts/gen-files.pl  (working copy)
>> @@ -3,6 +3,7 @@
>>  use strict;
>>  use File::Path;
>>  use Getopt::Long;
>> +use Encode qw(encode);
>>  
>>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>>  
>> @@ -117,8 +118,7 @@
>>  $name =~ s/\s*<.*//;
>>  $name =~ s/&(?!#)/&/g;
>>  $name =~ s/=\?.*?\?=//g;
>> -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>> -$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig;
>> +$name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>>  $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
>>  $name = '' if $name =~ m/\@/;
>>  return $name;
>> 
>> 
>
>Thanks for all the work in these and other validation/tidy issues in
>the website.
>
>I've done some tests and I'm afraid I cannot merge the patch yet.
>
>Using perl to encode to UTF8 as you propose makes tidy happy, but
>there is another script passed to the files, "validate", that produces
>theses errors:
>
>Line 10, character 12:  non SGML character number 130
>
>If we use numeric entities, tidy complains for ‚ unless we
>suppress the character as we do now.
>
>For the emoji in translator name, "validate" complains in any case:
>
>* Using numeric entities: with the current message received:
>
>"128513" is not a character number in the document character set
>
>* Encoding to UTF8 as the proposed patch:
>
>Line 10, character 29:  non SGML character number 65533
>
>I've produced two small files:
>
>https://cosas.larjona.net/validate.utf8.html
>https://cosas.larjona.net/validate.ncr.html
>
>and passed the online validator in https://validator.w3.org/
>
>I'll try to see if we can use https://validator.w3.org/source/ and get
>better "tidy" and "validate" tools from there.
>
>For now, I've fixed the comment in the gen-files.pl:
>
>--- english/international/l10n/scripts/gen-files.pl 20 May 2016
>21:15:45 -  1.97
>+++ english/international/l10n/scripts/gen-files.pl 14 Jan 2017
>12:41:06 -
>@@ -117,7 +117,10 @@
> $name =~ s/\s*<.*//;
> $name =~ s/&(?!#)/&/g;
> $name =~ s/=\?.*?\?=//g;
>-# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
>+# BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
>+# but the "tidy" tool that we use complains about them,
>+# so we just remove those characters for now, until better
>solution
>+# see Bug #820119
> $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig;
> $name = 'DDTP' if $name eq 'Debian Description Translation
>Project';
> $name = '' if $name =~ m/\@/;
>
>Best regards

Laura Arjona Reina
https://wiki.debian.org/LauraArjona



Bug#820119: restores original characters instead of taking care of every time numeric references coming up

2017-01-14 Thread Laura Arjona Reina
Hi

El 13/01/17 a las 11:34, victory escribió:
> 
> first, it is stupid to blame about names which are valid.
> it is also stupid that taking care of each occurrences coming up.
> as pages are all utf-8 now, no need to keep such references,
> this patch restores original characters instead of numeric references
> 
> patch below:
> Index: english/international/l10n/scripts/gen-files.pl
> ===
> --- english/international/l10n/scripts/gen-files.pl   (revision 232)
> +++ english/international/l10n/scripts/gen-files.pl   (working copy)
> @@ -3,6 +3,7 @@
>  use strict;
>  use File::Path;
>  use Getopt::Long;
> +use Encode qw(encode);
>  
>  use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
>  
> @@ -117,8 +118,7 @@
>  $name =~ s/\s*<.*//;
>  $name =~ s/&(?!#)/&/g;
>  $name =~ s/=\?.*?\?=//g;
> -# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
> -$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig;
> +$name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
>  $name = 'DDTP' if $name eq 'Debian Description Translation Project';
>  $name = '' if $name =~ m/\@/;
>  return $name;
> 
> 

Thanks for all the work in these and other validation/tidy issues in
the website.

I've done some tests and I'm afraid I cannot merge the patch yet.

Using perl to encode to UTF8 as you propose makes tidy happy, but
there is another script passed to the files, "validate", that produces
theses errors:

Line 10, character 12:  non SGML character number 130

If we use numeric entities, tidy complains for ‚ unless we
suppress the character as we do now.

For the emoji in translator name, "validate" complains in any case:

* Using numeric entities: with the current message received:

"128513" is not a character number in the document character set

* Encoding to UTF8 as the proposed patch:

Line 10, character 29:  non SGML character number 65533

I've produced two small files:

https://cosas.larjona.net/validate.utf8.html
https://cosas.larjona.net/validate.ncr.html

and passed the online validator in https://validator.w3.org/

I'll try to see if we can use https://validator.w3.org/source/ and get
better "tidy" and "validate" tools from there.

For now, I've fixed the comment in the gen-files.pl:

--- english/international/l10n/scripts/gen-files.pl 20 May 2016
21:15:45 -  1.97
+++ english/international/l10n/scripts/gen-files.pl 14 Jan 2017
12:41:06 -
@@ -117,7 +117,10 @@
 $name =~ s/\s*<.*//;
 $name =~ s/&(?!#)/&/g;
 $name =~ s/=\?.*?\?=//g;
-# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
+# BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01.
+# but the "tidy" tool that we use complains about them,
+# so we just remove those characters for now, until better
solution
+# see Bug #820119
 $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig;
 $name = 'DDTP' if $name eq 'Debian Description Translation
Project';
 $name = '' if $name =~ m/\@/;

Best regards
-- 
Laura Arjona Reina
https://wiki.debian.org/LauraArjona



Bug#820119: restores original characters instead of taking care of every time numeric references coming up

2017-01-13 Thread victory

first, it is stupid to blame about names which are valid.
it is also stupid that taking care of each occurrences coming up.
as pages are all utf-8 now, no need to keep such references,
this patch restores original characters instead of numeric references

patch below:
Index: english/international/l10n/scripts/gen-files.pl
===
--- english/international/l10n/scripts/gen-files.pl (revision 232)
+++ english/international/l10n/scripts/gen-files.pl (working copy)
@@ -3,6 +3,7 @@
 use strict;
 use File::Path;
 use Getopt::Long;
+use Encode qw(encode);
 
 use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl";
 
@@ -117,8 +118,7 @@
 $name =~ s/\s*<.*//;
 $name =~ s/&(?!#)/&/g;
 $name =~ s/=\?.*?\?=//g;
-# BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01.
-$name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig;
+$name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge;
 $name = 'DDTP' if $name eq 'Debian Description Translation Project';
 $name = '' if $name =~ m/\@/;
 return $name;


-- 
victory
no need to CC me :-)