Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
2011/2/13 Matthew Mondor mm_li...@pulsar-zone.net I also did a test relating to my previous suggestions about a way to preserve intact invalid input at output, later refered to as UTF-8B by Andy Hefner previously, and it seems possible. There seems to be scarce support around for these encodings, and even less literature about it. I found a couple of references in the Unicode mailing lists and a few blogs entries http://bsittler.livejournal.com/10381.html Searching for DC80 also reveals similar entries, as DC80-DCFF seems to be the favorite range of characters to encode invalid sequences. I think I could easily code this, but there should be some consensus on its utility. Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb___ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list
Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
On Sun, 13 Feb 2011 09:59:37 +0100 Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote: Yes I think that supporting that encoding would be very easy too. The only possibly tricky part is for users of that encoding to as necessary output a more conventional utf-8 stream to some streams, such as for display, possibly with bad sequences converted to latin-1. But it could read data from an UTF-8B exernal format stream and write it back to another UTF-8B stream and be sure that the original data was transparently copied as-is, and not be bothered with decoding/encoding errors on streams with that external format. I'm not sure if ECL should itself treat those invalid octets transparently as LATIN-1 if doing the output on an UTF-8 external-format stream, however. It's possible that without this some problems occur in the debugger, slime, etc, which would be presented with invalid UTF-8 characters in the UTF-16 surrogate range. Seems that, according to Luis Oliveira in the Babel mailing list and to the previous blog entry, the informal specification of UTF-8B is here http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html but I can not seem to reach this page. It also seems down from here. Although archive.org has some archives I also couldn't find that document there. I could find various implementation notes however, such as an implementation for iconv: http://www.mail-archive.com/linux-utf8@nl.linux.org/msg05256.html Also seems of interest: http://hyperreal.org/~est/utf-8b/ In a previous post on this list I also posted example macros with some documentation: http://sourceforge.net/mailarchive/attachment.php?list_name=ecls-listmessage_id=201101241340.p0ODek54021632%40ginseng.pulsar-zone.netcounter=1 Thanks again, -- Matt -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list
Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
Thanks for the detailed report. I made some changes. * The exported symbols come from the EXT package. They are character-coding-error character-coding-error-external-format character-decoding-error character-decoding-error-octets character-encoding-error character-encoding-error-code stream-encoding-error stream-decoding-error * Two restarts are provided USE-VALUE and CONTINUE. They can be used via the ANSI functions with the same name (I think you missed that point regarding USE-VALUE) * Encoding errors are also now created. Before the function had not been plugged into the engine. * I am not likely to provide multi-character restarts for a simple reason: ECL's streams are too simple, not providing arbitrary push-back buffers for bytes. Having a USE-VALUE restart that returns more than one character may lead to unexpected problems with unread-char and other functions -- I do not mean it is impossible but it simply complicates the interface and right now I have no clear idea how to do that. I attached a modified version of your code. Best, Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com (defun custom-read-line (stream key (max 512) (replace-char)) (let ((line (make-array max :element-type 'character :adjustable t :fill-pointer 0))) (flet ((add-char (c) (declare (type character c)) (vector-push c line)) (finalize-line () (let ((len (length line))) (when (and ( len 0) (char= #\Return (aref line (1- len (vector-pop line))) line)) (loop do (let ( ;; No way to determine invalid octet values with old ECL, ;; Return an unknown character code (c #+old-ecl(handler-case (read-char stream) (simple-error () #\UFFFD)) ;; SBCL provides invalid octets which we can import and ;; then issue an ATTEMPT-RESYNC restart to resume #+sbcl(handler-bind ((sb-int:stream-decoding-error #'(lambda (e) ;; Treat invalid UTF-8 octets as ;; ISO-8859 characters. (mapcar #'(lambda (c) (when ( c 127) (add-char (code-char c (sb-int:character-decoding-error-octets e)) (invoke-restart 'sb-int:attempt-resync (read-char stream)) ;; Test with new ECL #+ecl(handler-bind ((ext:character-decoding-error ; Internal #'(lambda (e) (mapcar #'(lambda (c) (format t ~%Code: ~A c) (when ( c 127) ;; Never happens (add-char (code-char c ;; Not advertized interface? (ext:character-decoding-error-octets e)) ;; Either replace the character or ignore (if replace-char (use-value #\?) (continue)) ))) (read-char stream))) ) (when (char= #\Newline c) (return (values (finalize-line) t))) (add-char c)) (defun test (rest args) (with-open-file (stream InvalidUTF8.txt) (loop do (let ((line (handler-case (apply #'custom-read-line stream args) (end-of-file () (loop-finish) (format t ~A~% line) (test) #+ecl (test :replace-char #\?) -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb___ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list
Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
On Sat, 12 Feb 2011 19:07:43 +0100 Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote: Thanks for the detailed report. I made some changes. * The exported symbols come from the EXT package. They are Indeed, SI and EXT appear to be aliases; however when a condition type is printed, SI appears to take precedence over EXT, so for instance: decoding error on stream #input stream /tmp/InvalidUTF8.txt (:EXTERNAL-FORMAT (:UTF-8 :LF)): the octet sequence (233 99) cannot be decoded. [Condition of type SI:STREAM-DECODING-ERROR] Of course that's a detail, though. I see that the symbol is now extern, nice. * Two restarts are provided USE-VALUE and CONTINUE. They can be used via the ANSI functions with the same name (I think you missed that point regarding USE-VALUE) Indeed, I hadn't realized about ANSI USE-VALUE at first, until my second post. I indeed now see a CONTINUE restart as well. * I am not likely to provide multi-character restarts for a simple reason: ECL's streams are too simple, not providing arbitrary push-back buffers for bytes. Having a USE-VALUE restart that returns more than one character may lead to unexpected problems with unread-char and other functions -- I do not mean it is impossible but it simply complicates the interface and right now I have no clear idea how to do that. I agree that it's unnecessary, as long as the code can obtain the invalid sequences and resume reading at that point it should be fine. So I gave a quick try at the new changes; it's much better, although a character is still getting lost after the CONTINUE restart, even if I consume all bytes from the invalid octets supplied. New test code attached. Also, in theory, there's a single invalid byte in a row in that stream, while there are two supplied invalid octets per occurance, but that's a detail if the CONTINUE restart doesn't lose bytes. Thanks, -- Matt Ceci est une phrase écrite en Français utilisant l'encodage UTF-8. Ceci est une phrase écrite en Français utilisant l'encodage ISO-8859-15. (defun custom-read-line (stream key (max 512)) (let ((line (make-array max :element-type 'character :adjustable t :fill-pointer 0))) (macrolet ((add-char (c) `(vector-push ,c line))) (flet ((finalize-line () (loop for c = (vector-pop line) while (member c '(#\Return #\Newline)) finally (vector-push c line)) line)) (loop do (let ( ;; No way to determine invalid octet values with old ECL, ;; Return an unknown character code (c #+old-ecl(handler-case (read-char stream) (simple-error () #\UFFFD)) ;; SBCL provides invalid octets which we can import and ;; then issue an ATTEMPT-RESYNC restart to resume #+sbcl(handler-bind ((sb-int:stream-decoding-error #'(lambda (e) ;; Treat invalid UTF-8 octets as ;; ISO-8859 characters. (mapcar #'(lambda (c) (when ( c 127) (add-char (code-char c (sb-int:character-decoding-error-octets e)) (invoke-restart 'sb-int:attempt-resync (read-char stream)) ;; Test with new ECL #+ecl(handler-bind ((ext:stream-decoding-error #'(lambda (e) (format t ~A~% (mapcar #'restart-name (compute-restarts e))) ;; Treat invalid UTF-8 octets as ;; ISO-8859 characters. (mapcar #'(lambda (c) (format t ~%~A~% c) (add-char (code-char c))) (ext:character-decoding-error-octets e)) (invoke-restart 'continue (read-char stream (when (char= #\Newline c) (return (values (finalize-line) t))) (add-char c))) (defun test () (with-open-file (stream /tmp/InvalidUTF8.txt) (loop do (let ((line (handler-case
Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
2011/2/12 Matthew Mondor mm_li...@pulsar-zone.net a character is still getting lost after the CONTINUE restart, even if I consume all bytes from the invalid octets supplied. New test code attached. I did not realize that from your previous email. This is fixed now (trivial typo in utf_8_decoder) Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb___ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list
Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
On Sat, 12 Feb 2011 23:49:14 +0100 Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote: I did not realize that from your previous email. This is fixed now (trivial typo in utf_8_decoder) I tested and it works fine for invalid UTF-8 bytes to LATIN-1 conversion. I also did a test relating to my previous suggestions about a way to preserve intact invalid input at output, later refered to as UTF-8B by Andy Hefner previously, and it seems possible. The remaining problems with UTF-8B are that it requires support by the UTF-8 encoder because those bytes should be output as-is. Moreover, this may break things if the UTF-8 decoder does not transparently support this input conversion, because for instance, the implementation would otherwise not be able to read what it can write. I got bitten by stuck slime several times when decoding/encoding errors occurred if I was not careful enough, outputting a stream with invalid characters in UTF-8 mode, it seems that slime could not catch the decoding error in that case (i.e. printing output of #xDCxx range characters without passing through the latin-1 conversion function) :) But UTF-8B could in fact be considered another encoding, and if I really need it I might eventually send patches to have it optionally available. The advantages of this mode have been previously mentionned on this list in the earlier Unicode: uncomfortable situation thread. Attached is the attempt nevertheless. It works fine except for litteral-output. Thanks a lot, to me it seems that ECL is on par with SBCL for UTF-8 input handling now. -- Matt (defun custom-read-line (stream key (max 512) (utf-8b nil)) Reads a line from STREAM. Uses #\Return as the line terminator, but strips any trailing #\Return or #\Newline characters. Will read up to MAX characters (defauts to 512). By default, invalid encountered UTF-8 sequences will be imported as LATIN-1 characters, unless UTF-8B, in which case invalid octets will be imported as-is in a special unicode range suitable for later litteral output. (let ((line (make-array max :element-type 'character :adjustable t :fill-pointer 0))) (macrolet ((add-char (c) `(vector-push ,c line)) (add-utf-8b-octet (o) `(vector-push (code-char (logior #xDC00 (logand ,o #xFF))) line))) (flet ((finalize-line () (loop for c = (vector-pop line) while (member c '(#\Return #\Newline)) finally (vector-push c line)) line)) (loop do (let ( ;; No way to determine invalid octet values with old ECL, ;; Return an unknown character code (c #+old-ecl(handler-case (read-char stream) (simple-error () #\UFFFD)) ;; SBCL provides invalid octets which we can import and ;; then issue an ATTEMPT-RESYNC restart to resume #+sbcl(handler-bind ((sb-int:stream-decoding-error #'(lambda (e) ;; Treat invalid UTF-8 octets as ;; ISO-8859 characters. (mapcar #'(lambda (c) (when ( c 127) (add-char (code-char c (sb-int:character-decoding-error-octets e)) (invoke-restart 'sb-int:attempt-resync (read-char stream)) ;; Test with new ECL #+ecl(handler-bind ((ext:stream-decoding-error #'(lambda (e) (mapcar #'(lambda (o) (if (and utf-8b ( o 127)) (add-utf-8b-octet o) (add-char (code-char o (ext:character-decoding-error-octets e)) (invoke-restart 'continue (read-char stream (when (char= #\Newline c) (return (values (finalize-line) t))) (add-char c))) (defun display-utf8b-latin1 (string) Formats a string which may contain UTF8-B encoded invalid sequence octets for screen display, treating those as ISO-8859/LATIN-1 characters. Note that if using this function to output