Re: [PHP] Cleaning pasted Word text
Errr...I'm not sure how this is applicable to my situation. I'm concerned, above all, with converting curly double quotes curly single quotes em and en dashes inverted exclamation points inverted question marks ellipses non-breaking spaces registered trademark symbols bullets left and right guillemets Many of these characters do not exist in the ISO Latin 1 character set, but can nonetheless be inserted by a browser which defaults to MacRoman or Windows Latin 1 (1252) character sets. The big questions, I suppose, are: 1) What character/ASCII code does PHP interpret (left curly quote) as, when pasted into a form? 2) Does it interpret it the same way pasted in on a Mac as on a Windows box? 3) What influence does the page charset meta tag have on such a submission? 4) What influence does the form ACCEPT-CHARSET parameter have? 5) What influence does the browser encoding setting have on such submissions? and finally, 6) If all of these factors can influence the final interpretation of a character, what's the best way to approach handling all possible combinations? All of this would be s much easier if I'd just get my hands on a Windows box for testing. Guess I'll have to do that. I'm just a bit surprised that no one seems to have tackled this problem already...it can't be that uncommon. Then again, I've seen any number of CMS-driven web sites that obviously haven't this sort of conversion, including large news corporation sites. And given the paucity of Mac-friendly programming on the web, it's not too surprising that so few sites attempt to accommodate Mac users. (Testing for Mac compatibility tends to be on par with testing for Netscape 3.0 compatibility...not usually a very high priority, despite IE 5 for the Mac supposedly being more standards-compliant than the Windows version.) spud. On Tuesday, October 29, 2002, at 08:49 PM, Jimmy Brake wrote: for file maker pro (windows/mac) -- word (windows/mac) function make_safe($text) { $text = preg_replace("/(\cM)/", " ", $text); $text = preg_replace("/(\c])/", " ", $text); $text = str_replace("\r\n", " ", $text); $text = str_replace("\x0B", " ", $text); $text = str_replace('"', " ", $text); $text = explode("\n", $text); $text = implode(" ", $text); $text = addslashes(trim($text)); return($text); } function make_safe2($text) { $text = str_replace("\r\n", "\n", $text); $text = preg_replace("/(\cM)/", "\n", $text); $text = preg_replace("/(\c])/", "\n", $text); $text = str_replace("\x0B", "\n", $text); $text = addslashes($text); return($text); } cannot remember I why put in two functions ... but anyhow have fun you will probably not the the implode / explode either On Tue, 2002-10-29 at 16:39, Daniel Guerrier wrote: Paste into notepad, the copy the text from notepad. Notepad should remove the high ASCII text. --- Brent Baisley <[EMAIL PROTECTED]> wrote: I think you have posted before and probably didn't get an answer. I'm not going to give you an answer (because I don't have one), but perhaps I can point you in the right direction. Look at http://www.w3.org/TR/REC-html40/charset.html and see if that helps you. Below is a paragraph I pulled from it. The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream. On Tuesday, October 29, 2002, at 02:20 PM, a.h.s. boy wrote: I'm working on a PHP-based CMS that allows users to post lengthy article texts by submitting through a form. The short version of my quandary is this: How can I create a conversion routine that reliably substitutes HTML-acceptable output for high-ASCII characters pasted into the form (from a variety of operating systems)? -- Brent Baisley Systems Architect Landover Associates, Inc. Search & Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php __ Do you Yahoo!? HotJobs - Search new jobs daily now http://hotjobs.yahoo.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php --- a.h.s. boy spud(at)nothingness.org"as yes is to if,love is to yes" http://www.nothingness.org/ --- -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://w
Re: [PHP] Cleaning pasted Word text
for file maker pro (windows/mac) -- word (windows/mac) function make_safe($text) { $text = preg_replace("/(\cM)/", " ", $text); $text = preg_replace("/(\c])/", " ", $text); $text = str_replace("\r\n", " ", $text); $text = str_replace("\x0B", " ", $text); $text = str_replace('"', " ", $text); $text = explode("\n", $text); $text = implode(" ", $text); $text = addslashes(trim($text)); return($text); } function make_safe2($text) { $text = str_replace("\r\n", "\n", $text); $text = preg_replace("/(\cM)/", "\n", $text); $text = preg_replace("/(\c])/", "\n", $text); $text = str_replace("\x0B", "\n", $text); $text = addslashes($text); return($text); } cannot remember I why put in two functions ... but anyhow have fun you will probably not the the implode / explode either On Tue, 2002-10-29 at 16:39, Daniel Guerrier wrote: > Paste into notepad, the copy the text from notepad. > Notepad should remove the high ASCII text. > --- Brent Baisley <[EMAIL PROTECTED]> wrote: > > I think you have posted before and probably didn't > > get an answer. I'm > > not going to give you an answer (because I don't > > have one), but perhaps > > I can point you in the right direction. > > Look at http://www.w3.org/TR/REC-html40/charset.html > > and see if that > > helps you. Below is a paragraph I pulled from it. > > > > The document character set, however, does not > > suffice to allow user > > agents to correctly interpret HTML documents as they > > are typically > > exchanged -- encoded as a sequence of bytes in a > > file or during a > > network transmission. User agents must also know the > > specific character > > encoding that was used to transform the document > > character stream into a > > byte stream. > > > > > > On Tuesday, October 29, 2002, at 02:20 PM, a.h.s. > > boy wrote: > > > > > I'm working on a PHP-based CMS that allows users > > to post lengthy > > > article texts by submitting through a form. The > > short version of my > > > quandary is this: How can I create a conversion > > routine that reliably > > > substitutes HTML-acceptable output for high-ASCII > > characters pasted > > > into the form (from a variety of operating > > systems)? > > > > > -- > > Brent Baisley > > Systems Architect > > Landover Associates, Inc. > > Search & Advisory Services for Advanced Technology > > Environments > > p: 212.759.6400/800.759.0577 > > > > > > -- > > PHP General Mailing List (http://www.php.net/) > > To unsubscribe, visit: http://www.php.net/unsub.php > > > > > __ > Do you Yahoo!? > HotJobs - Search new jobs daily now > http://hotjobs.yahoo.com/ > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Cleaning pasted Word text
Paste into notepad, the copy the text from notepad. Notepad should remove the high ASCII text. --- Brent Baisley <[EMAIL PROTECTED]> wrote: > I think you have posted before and probably didn't > get an answer. I'm > not going to give you an answer (because I don't > have one), but perhaps > I can point you in the right direction. > Look at http://www.w3.org/TR/REC-html40/charset.html > and see if that > helps you. Below is a paragraph I pulled from it. > > The document character set, however, does not > suffice to allow user > agents to correctly interpret HTML documents as they > are typically > exchanged -- encoded as a sequence of bytes in a > file or during a > network transmission. User agents must also know the > specific character > encoding that was used to transform the document > character stream into a > byte stream. > > > On Tuesday, October 29, 2002, at 02:20 PM, a.h.s. > boy wrote: > > > I'm working on a PHP-based CMS that allows users > to post lengthy > > article texts by submitting through a form. The > short version of my > > quandary is this: How can I create a conversion > routine that reliably > > substitutes HTML-acceptable output for high-ASCII > characters pasted > > into the form (from a variety of operating > systems)? > > > -- > Brent Baisley > Systems Architect > Landover Associates, Inc. > Search & Advisory Services for Advanced Technology > Environments > p: 212.759.6400/800.759.0577 > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > __ Do you Yahoo!? HotJobs - Search new jobs daily now http://hotjobs.yahoo.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Cleaning pasted Word text
Brent -- Thanks for the pointer, but it doesn't really address the problem. I am specifying the character set for the page (ISO-8859-1), and I'm inserting an ACCEPT-CHARSET parameter into the FORM element, but it specifies acceptable charsets as UTF-8, ISO-8859-1, and Windows 1252. The problem isn't accepting or displaying the characters correctly, the problem is figuring out what characters PHP thinks it's looking at! After further investigation, I find that ISO-8859-1 doesn't even use ASCII codes 128-159, so when a user types in a smart quote, it can't _really_ be using Latin 1 (but could be Windows Latin 1). Oddly enough, I've set the page charset to "ISO-8859-1" (which doesn't have a smart quote), and my browser is set to "Use character set specified by server", and it displays a smart quote just fine with chr(147). If I manually change my browser to use "Latin 1", it displays a ? (unknown character symbol). So between browsers, character sets, meta tags, and operating systems, I'm beginning to think that interpreting high-ASCII input is an art rather than a science... spud. On Tuesday, October 29, 2002, at 02:51 PM, Brent Baisley wrote: I think you have posted before and probably didn't get an answer. I'm not going to give you an answer (because I don't have one), but perhaps I can point you in the right direction. Look at http://www.w3.org/TR/REC-html40/charset.html and see if that helps you. Below is a paragraph I pulled from it. --- a.h.s. boy spud(at)nothingness.org"as yes is to if,love is to yes" http://www.nothingness.org/ --- -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Cleaning pasted Word text
I think you have posted before and probably didn't get an answer. I'm not going to give you an answer (because I don't have one), but perhaps I can point you in the right direction. Look at http://www.w3.org/TR/REC-html40/charset.html and see if that helps you. Below is a paragraph I pulled from it. The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream. On Tuesday, October 29, 2002, at 02:20 PM, a.h.s. boy wrote: I'm working on a PHP-based CMS that allows users to post lengthy article texts by submitting through a form. The short version of my quandary is this: How can I create a conversion routine that reliably substitutes HTML-acceptable output for high-ASCII characters pasted into the form (from a variety of operating systems)? -- Brent Baisley Systems Architect Landover Associates, Inc. Search & Advisory Services for Advanced Technology Environments p: 212.759.6400/800.759.0577 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php