It is fixable. I’ve dealt with similar transformation issues in the past (with a rather different cause). You can’t use iconv to do the conversion of what the web server outputs, as there are a few bytes that show up that aren’t considered to be valid cp1252 code points. We’ll see if the bit of Perl I’ve attached makes it through; it works on the page content I paste in, even not mangling the actually UTF-8 post_title. It doesn’t include any of the DBD::MySQL bits that you would use to SELECT and UPDATE post_content FROM the wp_posts table (remember to take a full backup first).
> On Dec 2, 2021, at 09:11, Stewart C. Russell via talk <[email protected]> wrote: > > On 2021-12-01 21:53, Jamon Camisso via talk wrote: >> Do any of the casting suggestions on that link that I sent fix it? > > I haven't had a chance to try them yet, but your note about the > transformation being reversible gives me hope that it can be fixed. Seneca
#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(encode decode);
binmode STDIN;
while (my $line = <STDIN>) {
my (@lc) = unpack "C*", $line; # let's work on bytes
my $new = '';
my $chr = '';
my $grp = '';
my $crem = 0;
my $grem = 0;
for my $byte (@lc) {
if ($byte < 128) { # no high bit, same in both encodings
if ($grp) {
$new .= $grp;
$grp = '';
} elsif ($chr) {
my $y = decode('utf-8', $chr, sub {return $_[-1]});
my $z;
if ($y ne '' && length($y) > 1) {
$z = encode('cp1252', $y, sub {return $_[-1]});
} else {
$z = $chr;
}
$new .= $z;
}
$new .= chr($byte);
$chr = '';
} elsif (length($chr) == 0) { # first byte in a character
if (($byte & 0xf0) == 0xf0) {
$crem = 3;
} elsif (($byte & 0xe0) == 0xe0) {
$crem = 2;
} elsif (($byte & 0xc0) == 0xc0) {
$crem = 1;
}
$chr .= chr($byte);
} elsif ($crem > 1) { # middle of a UTF-8 sequence
$chr .= chr($byte);
$crem--;
} else { # end of UTF-8 sequence
$chr .= chr($byte);
my $y = decode('utf-8', $chr, sub {return $_[-1]});
my $z;
if (length($y) > 1) {
$z = encode('cp1252', $y, sub {return $_[-1]});
} else {
$z = $chr;
}
if (length($grp) > 0) { # middle of the second layer of text encoding
$grp .= $z;
$grem--;
if ($grem == 0) { # end of second layer
my $y = decode('utf-8', $grp, sub {return $_[-1]});
my $z = encode('cp1252', $y, sub {return chr($_[-1])});
$grp = '';
$grem = 0;
$new .= $z;
}
} else {
if (ord($y) >= 0x00f0) {
$grem = 3;
} elsif (ord($y) >= 0x00e0) {
$grem = 2;
} elsif (ord($y) >= 0x00c0) {
$grem = 1;
}
$grp .= $z;
}
$chr = '';
$crem = 0;
}
}
print $new;
}
smime.p7s
Description: S/MIME cryptographic signature
--- Post to this mailing list [email protected] Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
