Re: [PHP] PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

2009-02-18 Thread Addmissins Development
mike wrote:
> On Tue, Feb 17, 2009 at 4:26 PM, mike  wrote:
>> i tried that kind of stuff - it did not seem to work.
>>
>> i will try again... if anyone has any ideas i.e. "use iconv to convert
>> to A, then use DOM stuff, then use iconv to move it back to UTF8..."
>> etc. i am all ears.
> 
> Nope - for example this is the input text (apologies if your reader
> isn't utf-8) - simplified chinese
> 
> 足以概括英特尔为此所付出的努力。谈及移动设备,英特尔公司自诩在该领域的创新犹如其户友好性设计及能效等一样出类拔萃。同时,英特尔也一直表示要帮助构建能够
> 
> Output is this:
> 
> 一句“英特尔热衷于移åŠ&u
> 
> What is funny is I don't care about altering the actual content, only
> the content of the "href" and "src" attributes, which are all standard
> latin-based URLs, too.
> 
> Here's the simplest code to create the behavior
> 
> $q = db_query("SELECT id,old FROM testing", "redirects");
> while(list($id, $doc) = db_rows($q)) {
> $new = fix_document($doc);
> $new = db_escape($new);
> db_query("UPDATE testing SET new='$new' WHERE id=$id",
> "redirects");
> }
> db_free($q);
> 
> function fix_document($string) {
> $dom = new DomDocument('1.0', 'UTF-8');
> @$dom->loadHTML($string);
> $dom->preserveWhiteSpace = false;
> return $dom->saveHTML();
> }
> 
> (Note: it is not the db functions, if I do this:
> 
> function fix_document($string) {
> return $string;
> }
> 
> The content is unaltered.
> 
> Anyone with any ideas? Any options to feed to the DOM stuff? It's
> translating the stuff to htmlentities, which I don't want either.
> 

As i understand all non ASCII characters will be converted to html entities.

Try this

function fix_document($string) {
$dom = new DomDocument('1.0', 'UTF-8');
@$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
return html_entity_decode($dom->saveHTML(),ENT_QUOTES,"UTF-8");
}

header("Content-Type: text/html; charset=UTF-8");
echo fix_document('data here');

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

2009-02-17 Thread mike
On Tue, Feb 17, 2009 at 4:26 PM, mike  wrote:
> i tried that kind of stuff - it did not seem to work.
>
> i will try again... if anyone has any ideas i.e. "use iconv to convert
> to A, then use DOM stuff, then use iconv to move it back to UTF8..."
> etc. i am all ears.

Nope - for example this is the input text (apologies if your reader
isn't utf-8) - simplified chinese

足以概括英特尔为此所付出的努力。谈及移动设备,英特尔公司自诩在该领域的创新犹如其户友好性设计及能效等一样出类拔萃。同时,英特尔也一直表示要帮助构建能够

Output is this:

一句“英特尔热衷于移åŠ&u

What is funny is I don't care about altering the actual content, only
the content of the "href" and "src" attributes, which are all standard
latin-based URLs, too.

Here's the simplest code to create the behavior

$q = db_query("SELECT id,old FROM testing", "redirects");
while(list($id, $doc) = db_rows($q)) {
$new = fix_document($doc);
$new = db_escape($new);
db_query("UPDATE testing SET new='$new' WHERE id=$id",
"redirects");
}
db_free($q);

function fix_document($string) {
$dom = new DomDocument('1.0', 'UTF-8');
@$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
return $dom->saveHTML();
}

(Note: it is not the db functions, if I do this:

function fix_document($string) {
return $string;
}

The content is unaltered.

Anyone with any ideas? Any options to feed to the DOM stuff? It's
translating the stuff to htmlentities, which I don't want either.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

2009-02-17 Thread mike
i tried that kind of stuff - it did not seem to work.

i will try again... if anyone has any ideas i.e. "use iconv to convert
to A, then use DOM stuff, then use iconv to move it back to UTF8..."
etc. i am all ears.


On Tue, Feb 17, 2009 at 12:46 PM, Nathan Nobbe  wrote:
> On Tue, Feb 17, 2009 at 12:40 PM, mike  wrote:
>>
>> Pardon the messy code, but I got this working like a charm. Then I
>> went to try it on some Russian content and it broke. The inbound was
>> utf-8 encoded Russian characters, output was something else
>> unintelligible.
>>
>> I found a PHP bug from years ago that sounded related but the user had
>> a workaround.
>>
>> Note that it does not appear that any of the functions break the
>> encoding - it is the ->saveHTML() that doesn't seem to work (I also
>> tried saveXML() and it did not work either?
>>
>> I am totally up for changing out using php's DOM and using another
>> library, basically I just want to traverse the DOM and pick out all > href> and  and possibly any other external references in the
>> documents so I can run them through some link examination and such. I
>> figured I may have to fall back to a regexp, but PHP's DOM was so good
>> with even partial and malformed HTML, I was excited at how easy this
>> was...
>>
>>$dom = new domDocument;
>>@$dom->loadHTML($string);
>>$dom->preserveWhiteSpace = false;
>>$links = $dom->getElementsByTagName('a');
>>foreach($links as $tag) {
>>$before = $tag->getAttribute('href');
>>$after = strip_chars($before);
>>$after = map_url($after);
>>$after = fix_link($after);
>>if($after != false) {
>>echo "\tBEFORE: $before\n";
>>echo "\tAFTER : $after\n\n";
>>$tag->removeAttribute('href');
>>$tag->setAttribute('href', $after);
>>}
>>}
>>return $dom->saveHTML();
>> }
>>
>> I tried things like this:
>>
>> new DomDocument('1.0', 'UTF-8');
>>
>> as well as encoding options for $dom like $dom->encoding = 'utf-8' or
>> something (I tried so many variations I cannot remember anymore)
>>
>> Anyone have any ideas?
>>
>> As long as it can read in the string (which is and should always be
>> UTF-8) and spit out UTF-8, I can make sure any of my functions are
>> UTF-8 safe that handle the data...
>
> from the manual on DOM,
>
> Note: DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode()
> to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
>
> -nathan
>
>

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

2009-02-17 Thread Nathan Nobbe
On Tue, Feb 17, 2009 at 12:40 PM, mike  wrote:

> Pardon the messy code, but I got this working like a charm. Then I
> went to try it on some Russian content and it broke. The inbound was
> utf-8 encoded Russian characters, output was something else
> unintelligible.
>
> I found a PHP bug from years ago that sounded related but the user had
> a workaround.
>
> Note that it does not appear that any of the functions break the
> encoding - it is the ->saveHTML() that doesn't seem to work (I also
> tried saveXML() and it did not work either?
>
> I am totally up for changing out using php's DOM and using another
> library, basically I just want to traverse the DOM and pick out all  href> and  and possibly any other external references in the
> documents so I can run them through some link examination and such. I
> figured I may have to fall back to a regexp, but PHP's DOM was so good
> with even partial and malformed HTML, I was excited at how easy this
> was...
>
>$dom = new domDocument;
>@$dom->loadHTML($string);
>$dom->preserveWhiteSpace = false;
>$links = $dom->getElementsByTagName('a');
>foreach($links as $tag) {
>$before = $tag->getAttribute('href');
>$after = strip_chars($before);
>$after = map_url($after);
>$after = fix_link($after);
>if($after != false) {
>echo "\tBEFORE: $before\n";
>echo "\tAFTER : $after\n\n";
>$tag->removeAttribute('href');
>$tag->setAttribute('href', $after);
>}
>}
>return $dom->saveHTML();
> }
>
> I tried things like this:
>
> new DomDocument('1.0', 'UTF-8');
>
> as well as encoding options for $dom like $dom->encoding = 'utf-8' or
> something (I tried so many variations I cannot remember anymore)
>
> Anyone have any ideas?
>
> As long as it can read in the string (which is and should always be
> UTF-8) and spit out UTF-8, I can make sure any of my functions are
> UTF-8 safe that handle the data...


from the manual on DOM,

*Note*: DOM extension uses UTF-8 encoding. Use
utf8_encode()and
utf8_decode()  to work
with texts in ISO-8859-1 encoding or
Iconvfor other encodings.

-nathan


[PHP] PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

2009-02-17 Thread mike
Pardon the messy code, but I got this working like a charm. Then I
went to try it on some Russian content and it broke. The inbound was
utf-8 encoded Russian characters, output was something else
unintelligible.

I found a PHP bug from years ago that sounded related but the user had
a workaround.

Note that it does not appear that any of the functions break the
encoding - it is the ->saveHTML() that doesn't seem to work (I also
tried saveXML() and it did not work either?

I am totally up for changing out using php's DOM and using another
library, basically I just want to traverse the DOM and pick out all  and  and possibly any other external references in the
documents so I can run them through some link examination and such. I
figured I may have to fall back to a regexp, but PHP's DOM was so good
with even partial and malformed HTML, I was excited at how easy this
was...

$dom = new domDocument;
@$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach($links as $tag) {
$before = $tag->getAttribute('href');
$after = strip_chars($before);
$after = map_url($after);
$after = fix_link($after);
if($after != false) {
echo "\tBEFORE: $before\n";
echo "\tAFTER : $after\n\n";
$tag->removeAttribute('href');
$tag->setAttribute('href', $after);
}
}
return $dom->saveHTML();
}

I tried things like this:

new DomDocument('1.0', 'UTF-8');

as well as encoding options for $dom like $dom->encoding = 'utf-8' or
something (I tried so many variations I cannot remember anymore)

Anyone have any ideas?

As long as it can read in the string (which is and should always be
UTF-8) and spit out UTF-8, I can make sure any of my functions are
UTF-8 safe that handle the data...

Thanks

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php