Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-13 Thread Juan Jose Garcia-Ripoll
2011/2/13 Matthew Mondor mm_li...@pulsar-zone.net

 I also did a test relating to my previous suggestions about a way to
 preserve intact invalid input at output, later refered to as UTF-8B
 by Andy Hefner previously, and it seems possible.


There seems to be scarce support around for these encodings, and even less
literature about it. I found a couple of references in the Unicode mailing
lists and a few blogs entries
http://bsittler.livejournal.com/10381.html
Searching for DC80 also reveals similar entries, as DC80-DCFF seems to be
the favorite range of characters to encode invalid sequences. I think I
could easily code this, but there should be some consensus on its utility.

Juanjo

-- 
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb___
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list


Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-13 Thread Matthew Mondor
On Sun, 13 Feb 2011 09:59:37 +0100
Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote:

Yes I think that supporting that encoding would be very easy too.  The
only possibly tricky part is for users of that encoding to as necessary
output a more conventional utf-8 stream to some streams, such as for
display, possibly with bad sequences converted to latin-1.  But it
could read data from an UTF-8B exernal format stream and write it back
to another UTF-8B stream and be sure that the original data was
transparently copied as-is, and not be bothered with decoding/encoding
errors on streams with that external format.

I'm not sure if ECL should itself treat those invalid octets
transparently as LATIN-1 if doing the output on an UTF-8
external-format stream, however.  It's possible that without this some
problems occur in the debugger, slime, etc, which would be presented
with invalid UTF-8 characters in the UTF-16 surrogate range.

 Seems that, according to Luis Oliveira in the Babel mailing list and to the
 previous blog entry, the informal specification of UTF-8B is here
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
 but I can not seem to reach this page.

It also seems down from here.

Although archive.org has some archives I also couldn't find that
document there.

I could find various implementation notes however, such as an
implementation for iconv:
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg05256.html

Also seems of interest:
http://hyperreal.org/~est/utf-8b/

In a previous post on this list I also posted example macros with some
documentation:
http://sourceforge.net/mailarchive/attachment.php?list_name=ecls-listmessage_id=201101241340.p0ODek54021632%40ginseng.pulsar-zone.netcounter=1

Thanks again,
-- 
Matt

--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
___
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list


Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-12 Thread Juan Jose Garcia-Ripoll
Thanks for the detailed report. I made some changes.

* The exported symbols come from the EXT package. They are

character-coding-error
character-coding-error-external-format
character-decoding-error
character-decoding-error-octets
character-encoding-error
character-encoding-error-code
stream-encoding-error
stream-decoding-error

* Two restarts are provided USE-VALUE and CONTINUE. They can be used via the
ANSI functions with the same name (I think you missed that point regarding
USE-VALUE)

* Encoding errors are also now created. Before the function had not been
plugged into the engine.

* I am not likely to provide multi-character restarts for a simple reason:
ECL's streams are too simple, not providing arbitrary push-back buffers for
bytes. Having a USE-VALUE restart that returns more than one character may
lead to unexpected problems with unread-char and other functions -- I do not
mean it is impossible but it simply complicates the interface and right now
I have no clear idea how to do that.

I attached a modified version of your code.

Best,

Juanjo

-- 
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com

(defun custom-read-line (stream key (max 512) (replace-char))
  (let ((line (make-array max
  :element-type 'character
  :adjustable t
  :fill-pointer 0)))
(flet ((add-char (c)
 (declare (type character c))
 (vector-push c line))
   (finalize-line ()
 (let ((len (length line)))
   (when (and ( len 0)
  (char= #\Return (aref line (1- len
 (vector-pop line)))
 line))
  (loop
  do
   (let (
 ;; No way to determine invalid octet values with old ECL,
 ;; Return an unknown character code
 (c #+old-ecl(handler-case
 (read-char stream)
   (simple-error ()
 #\UFFFD))
;; SBCL provides invalid octets which we can import and
;; then issue an ATTEMPT-RESYNC restart to resume
#+sbcl(handler-bind
  ((sb-int:stream-decoding-error
#'(lambda (e)
;; Treat invalid UTF-8 octets as
;; ISO-8859 characters.
(mapcar #'(lambda (c)
(when ( c 127)
  (add-char (code-char c

(sb-int:character-decoding-error-octets e))
(invoke-restart 'sb-int:attempt-resync
(read-char stream))
;; Test with new ECL
#+ecl(handler-bind
 ((ext:character-decoding-error ; Internal
   #'(lambda (e)
   (mapcar #'(lambda (c)
   (format t ~%Code: ~A c)
   (when ( c 127)
 ;; Never happens
 (add-char (code-char c
   ;; Not advertized interface?
   (ext:character-decoding-error-octets 
e))
   ;; Either replace the character or ignore
   (if replace-char
   (use-value #\?)
   (continue))
   )))
   (read-char stream)))
 )
 (when (char= #\Newline c)
   (return (values (finalize-line) t)))
 (add-char c))

(defun test (rest args)
  (with-open-file (stream InvalidUTF8.txt)
(loop
   do
 (let ((line (handler-case
 (apply #'custom-read-line stream args)
   (end-of-file ()
 (loop-finish)
   (format t ~A~% line)

(test)
#+ecl
(test :replace-char #\?)
--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb___
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list


Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-12 Thread Matthew Mondor
On Sat, 12 Feb 2011 19:07:43 +0100
Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote:

 Thanks for the detailed report. I made some changes.
 
 * The exported symbols come from the EXT package. They are

Indeed, SI and EXT appear to be aliases; however when a condition type
is printed, SI appears to take precedence over EXT, so for instance:

decoding error on stream #input stream /tmp/InvalidUTF8.txt
(:EXTERNAL-FORMAT (:UTF-8 :LF)):
  the octet sequence (233 99) cannot be decoded.
   [Condition of type SI:STREAM-DECODING-ERROR]

Of course that's a detail, though.  I see that the symbol is now
extern, nice.

 * Two restarts are provided USE-VALUE and CONTINUE. They can be used via the
 ANSI functions with the same name (I think you missed that point regarding
 USE-VALUE)

Indeed, I hadn't realized about ANSI USE-VALUE at first, until my
second post.  I indeed now see a CONTINUE restart as well.

 * I am not likely to provide multi-character restarts for a simple reason:
 ECL's streams are too simple, not providing arbitrary push-back buffers for
 bytes. Having a USE-VALUE restart that returns more than one character may
 lead to unexpected problems with unread-char and other functions -- I do not
 mean it is impossible but it simply complicates the interface and right now
 I have no clear idea how to do that.

I agree that it's unnecessary, as long as the code can obtain the
invalid sequences and resume reading at that point it should be fine.

So I gave a quick try at the new changes; it's much better, although a
character is still getting lost after the CONTINUE restart, even if I
consume all bytes from the invalid octets supplied.  New test code
attached.  Also, in theory, there's a single invalid byte in a row in
that stream, while there are two supplied invalid octets per occurance,
but that's a detail if the CONTINUE restart doesn't lose bytes.

Thanks,
-- 
Matt
Ceci est une phrase écrite en Français utilisant l'encodage UTF-8.
Ceci est une phrase écrite en Français utilisant l'encodage ISO-8859-15.

(defun custom-read-line (stream key (max 512))
  (let ((line (make-array max
  :element-type 'character
  :adjustable t
  :fill-pointer 0)))
(macrolet ((add-char (c)
 `(vector-push ,c line)))
  (flet ((finalize-line ()
   (loop
  for c = (vector-pop line)
  while (member c '(#\Return #\Newline))
  finally (vector-push c line))
   line))
(loop
   do
 (let (
   ;; No way to determine invalid octet values with old ECL,
   ;; Return an unknown character code
   (c #+old-ecl(handler-case
   (read-char stream)
 (simple-error ()
   #\UFFFD))
  ;; SBCL provides invalid octets which we can import and
  ;; then issue an ATTEMPT-RESYNC restart to resume
  #+sbcl(handler-bind
((sb-int:stream-decoding-error
  #'(lambda (e)
  ;; Treat invalid UTF-8 octets as
  ;; ISO-8859 characters.
  (mapcar #'(lambda (c)
  (when ( c 127)
(add-char (code-char c
  
(sb-int:character-decoding-error-octets e))
  (invoke-restart 'sb-int:attempt-resync
  (read-char stream))
  ;; Test with new ECL
  #+ecl(handler-bind
   ((ext:stream-decoding-error
 #'(lambda (e)
 (format t ~A~%
 (mapcar #'restart-name
 (compute-restarts e)))
 ;; Treat invalid UTF-8 octets as
 ;; ISO-8859 characters.
 (mapcar #'(lambda (c)
 (format t ~%~A~% c)
 (add-char (code-char c)))
 
(ext:character-decoding-error-octets e))
 (invoke-restart 'continue
 (read-char stream
   (when (char= #\Newline c)
 (return (values (finalize-line) t)))
   (add-char c)))

(defun test ()
  (with-open-file (stream /tmp/InvalidUTF8.txt)
(loop
   do
 (let ((line (handler-case
  

Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-12 Thread Juan Jose Garcia-Ripoll
2011/2/12 Matthew Mondor mm_li...@pulsar-zone.net

 a character is still getting lost after the CONTINUE restart, even if I
 consume all bytes from the invalid octets supplied.  New test code
 attached.


I did not realize that from your previous email. This is fixed now (trivial
typo in utf_8_decoder)

Juanjo

-- 
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb___
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list


Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

2011-02-12 Thread Matthew Mondor
On Sat, 12 Feb 2011 23:49:14 +0100
Juan Jose Garcia-Ripoll juanjose.garciarip...@googlemail.com wrote:

 I did not realize that from your previous email. This is fixed now (trivial
 typo in utf_8_decoder)

I tested and it works fine for invalid UTF-8 bytes to LATIN-1
conversion.

I also did a test relating to my previous suggestions about a way to
preserve intact invalid input at output, later refered to as UTF-8B
by Andy Hefner previously, and it seems possible.

The remaining problems with UTF-8B are that it requires support by the
UTF-8 encoder because those bytes should be output as-is.  Moreover,
this may break things if the UTF-8 decoder does not transparently
support this input conversion, because for instance, the implementation
would otherwise not be able to read what it can write.  I got bitten by
stuck slime several times when decoding/encoding errors occurred if I
was not careful enough, outputting a stream with invalid characters in
UTF-8 mode, it seems that slime could not catch the decoding error in
that case (i.e. printing output of #xDCxx range characters without
passing through the latin-1 conversion function) :)

But UTF-8B could in fact be considered another encoding, and if I
really need it I might eventually send patches to have it optionally
available.  The advantages of this mode have been previously mentionned
on this list in the earlier Unicode: uncomfortable situation thread.
Attached is the attempt nevertheless.  It works fine except for
litteral-output.

Thanks a lot, to me it seems that ECL is on par with SBCL for UTF-8
input handling now.
-- 
Matt

(defun custom-read-line (stream key (max 512) (utf-8b nil))
  Reads a line from STREAM.  Uses #\Return as the line terminator, but
strips any trailing #\Return or #\Newline characters.
Will read up to MAX characters (defauts to 512).
By default, invalid encountered UTF-8 sequences will be imported as
LATIN-1 characters, unless UTF-8B, in which case invalid octets will
be imported as-is in a special unicode range suitable for later litteral
output.
  (let ((line (make-array max
  :element-type 'character
  :adjustable t
  :fill-pointer 0)))
(macrolet ((add-char (c)
 `(vector-push ,c line))
   (add-utf-8b-octet (o)
 `(vector-push (code-char (logior #xDC00 (logand ,o #xFF)))
   line)))
  (flet ((finalize-line ()
   (loop
  for c = (vector-pop line)
  while (member c '(#\Return #\Newline))
  finally (vector-push c line))
   line))
(loop
   do
 (let (
   ;; No way to determine invalid octet values with old ECL,
   ;; Return an unknown character code
   (c #+old-ecl(handler-case
   (read-char stream)
 (simple-error ()
   #\UFFFD))
  ;; SBCL provides invalid octets which we can import and
  ;; then issue an ATTEMPT-RESYNC restart to resume
  #+sbcl(handler-bind
((sb-int:stream-decoding-error
  #'(lambda (e)
  ;; Treat invalid UTF-8 octets as
  ;; ISO-8859 characters.
  (mapcar #'(lambda (c)
  (when ( c 127)
(add-char (code-char c
  
(sb-int:character-decoding-error-octets e))
  (invoke-restart 'sb-int:attempt-resync
  (read-char stream))
  ;; Test with new ECL
  #+ecl(handler-bind
   ((ext:stream-decoding-error
 #'(lambda (e)
 (mapcar #'(lambda (o)
 (if (and utf-8b ( o 127))
 (add-utf-8b-octet o)
 (add-char
  (code-char o
 
(ext:character-decoding-error-octets e))
 (invoke-restart 'continue
 (read-char stream
   (when (char= #\Newline c)
 (return (values (finalize-line) t)))
   (add-char c)))

(defun display-utf8b-latin1 (string)
  Formats a string which may contain UTF8-B encoded invalid sequence
octets for screen display, treating those as ISO-8859/LATIN-1 characters.
Note that if using this function to output