[AUCTeX-devel] preview-latex coding system problem with Japanese LaTeX

Ikumi Keita Fri, 30 Sep 2016 06:18:55 -0700

Dear AUCTeX developers,

I have some problems with preview-latex with regard to the coding system
when I use Japanese LaTeX.  Since the recent TeXLive contains Japanese
LaTeX by default, I suppose that non-Japanese users can experience the
problems if sample file is provided.  So I organize this email as the
following 3 parts:


A. The problems are described with the attached sample files so
   that anyone can actually experience the situation and examine
   what's going on in detail.
B. The reasons of the problems are explained and their tentative fixes
   are proposed by the attached patches.
C. The patches in B. fix problems only partially.  The remaining
   problem is described and call for help is expressed.

A. There are two problems.  I will describe them in order.
A-1. How to reproduce:
(1) Start a new emacs session with
env LC_ALL=ja_JP.SJIS emacs &
    and enable preview-latex.
(2) Open the attached file "preview-error-test.tex", which has many
    \section lines.  They are all commented out initially.
(3) Uncomment any one of them and start preview-latex with C-c C-p C-d.
    Answer with n to "Cache preamble?" question.  Then the error or bad
    result described on the next line of the uncommented \section will
    occur, e.g.
Invalid regexp: "Unmatched ( or \\("
(4) Comment out again that \section line, uncomment another \section
    line, and try C-c C-p C-d again.  Another error will come out.
(5) Repeat the procedure described in (4).

The process (3) will not work if your tex distribution lacks the
Japanese LaTeX command binary "platex".  In that case, please check up
the following list.
o Be sure to install TeXLive.  Other tex distributions usually lack
  Japanese TeX engines.
o If you (or the package manager you are using) didn't select a scheme
  large enough when installing TeXLive, Japanese LaTeX suite is not
  present on your machine.
o Japanese TeX was first included in TeXLive several years ago.  Thus if
  your TeXLive is older than that, Japanese LaTeX is not available.
o If your ghostscript is not configured to handle PS file with Japanese
  font, the character in the preview image may be garbled.  However,
  that is not the point I'm speaking of now.  Rather, it is the error in
  regexp match preventing preview-latex to do the job that I'd like you
  to look at.

A-2. How to reproduce:
(1) This time, start a new emacs session with another locale
env LC_ALL=ja_JP.eucJP emacs &
    and enable preview-latex.
(2) Open the attached file "preview-error-test2.tex" and type C-c C-p
    C-d.  This time, answer with y, not n, to "Cache preamble?"
    question.
(3) Then the preview image will come out at wrong position.

This example requires `platex' binary, too.

B. The reasons and tentative fixes to the problems.
B-1. Shift-JIS encoding problem.
The bad results demonstrated in A-1 are caused by the nature of the
coding system `japanese-shift-jis' (SJIS for short).  SJIS is one of the
major encodings for Japanese text and the standard encoding in the
Japanese edition of windows for historical reasons.  Basically, SJIS
represents one Japanese character by two bytes.  Examples of such
two-byte sequences are, in hexadecimal form:

8E 82

and

81 5B

.  While the first byte of the sequence is always 8-bit (MSB on), the
second is not necessarily so.  In the above two examples, the second
byte of the first example (82) is 8-bit, but the second one (5B) is
7-bit (MSB off).  It is this 7-bit byte that brings the problems in A-1
above.  Unfortunately, this 7-bit byte sometimes coincides with a regexp
meta character.  Thus it is interfered with `regexp-quote' in the
function `preview-error-quote'.  Roughly speaking, 'preview-error-quote'
works along this flow:
1. Encodes string in the given coding system (i.e., SJIS in this
   example).
2. Replaces texts which begin with "^^" with the corresponding byte.
3. Supplies regular expression, for later use to locate the position
   in the buffer for putting the preview image, guarding the meta
   character in the original text by `regexp-quote'.
4. Decodes back the obtained string out of the coding system again.
However, when `regexp-quote' in the item 3 quotes the 7-bit byte in
SJIS, decoding back fails to gain the original character.

The following example illustrates what is going on:
(let* ((s1 (char-to-string (make-char 'japanese-jisx0208 37 63)))
       ;; s1 is multibyte Japanese string.
       ;; Encode s1 in SJIS.
       (s2 (encode-coding-string s1 'shift_jis))
       ;; At this point s2 is "\203^".
       (s3 (regexp-quote s2))
       ;; Now s3 is "\203\\^".
       ;; Then decode back assuming SJIS encoding.
       (s4 (decode-coding-string s3 'shift_jis)))
  (string-equal s1 s4))
=> nil ;; no longer goes back to the original string s1.

The attached patch "preview-latex-fix" is my approach to fix this
problem.  It avoids to handle encoded string and does the relavant
operations on the decoded string consistently.  (In addition, it fixes a
problem that `char-to-string' in the original code does not do the
expected job in unicode-based emacs for chars of #x80 through #xFF.  I
changed to use `byte-to-string' instead when that function is
available.)

B-2. preview-latex drops the necessary command option.
Japanese TeX command sometimes needs "-kanji" option to know the coding
system of the given TeX file.  In AUCTeX, this requirement is usually
covered by the "%(kanjiopt)" construct in the following lines quoted
from tex-jp.el:

(setq TeX-engine-alist-builtin
      (append TeX-engine-alist-builtin
             '((ptex "pTeX" "ptex %(kanjiopt)" "platex %(kanjiopt)" "eptex")
               (jtex "jTeX" "jtex" "jlatex" nil)
               (uptex "upTeX" "euptex" "uplatex" "euptex"))))

This "%(kanjiopt)" is changed to suitable option string like "-kanji
XXX" when necessary.  However, if the answer to the question "Cache
preamble?" is y, preview-latex drops this option, which leads to the
results described in A-2 above.

The reason why the option "-kanji XXX" is missing is that
`TeX-inline-preview-internal' transforms the command line passed to the
OS shell by `(preview-do-replacements command
preview-undump-replacements)' when caching preamble is enabled.  Here
the regular expression in `preview-undump-replacements' is designed to
pick up the very first word of the value of the variable `command',
leaving behind the option "-kanji XXX".

The attached patch "preview-latex-fix2" aims to resolve this problem.
It gives back the latex command options provided in the entry which
`(TeX-engine-alist)' returns so that the command will run smoothly.

C. Call for help
There are still some problems remained.  I think we should have a
integrated framework which can serve for both preview-latex and
tex-jp.el to determine the suitable process coding system.

The coding systems to communicate with Japanese TeX command are not
constant but vary with the environments.  In fact it can only be
determined at run time.  Currently that situation is handled by the
function `japanese-TeX-set-process-coding-system' in tex-jp.el during
the normal runs.  That function is set to the value of
`TeX-after-start-process-function' and called after the TeX process
starts.  In that way, the process coding systems are set to suitable
values under the environment at that point of time.  However, the way
preview-latex handles process coding systems sometimes conflicts with
such setting.  For example, `TeX-inline-preview-internal' overwrites the
process coding system after `japanese-TeX-set-process-coding-system'
does its job.  (Current preview-latex uses the value of
`TeX-japanese-process-output-coding-system', but it is not sufficient to
rely on such constant value.  In fact the default value of
`TeX-japanese-process-output-coding-system' was changed to nil
recently.)  Even my patch "preview-latex-fix" is not sufficient about
this point.  The coding-system argument supplied to
`decode-coding-string' should not simply be `buffer-file-coding-system'.

I would appreciate if anyone who has deeper knowledge of AUCTeX could
help to resolve all these coding system issues in preview-latex.

Best regards,
Ikumi Keita

P.S. I subscribed to auctex-devel ML temporarily, so it is not necessary
to put me on CC: when replying.  I will stay on the ML until the
discussion about this issue is settled.

\documentclass{jarticle}

\begin{document}
% How to see the errors or unexpected result:
% Uncomment the each line of \section macro and enable
% preview-latex with C-c C-p C-d.  Answer with n to "Cache preamble?"
% question.  Then the error or unexpected result described on its next
% line will occur.

% The chars 表, 予 and 能 contain 0x5c backslash in the shift jis encoding.
%\section{表(1)}
% error in process sentinel: Invalid regexp: "Unmatched ( or \\("

%\section{予{a}}
% error in process sentinel: Invalid regexp: "Invalid content of \\{\\}"

%\section{(能)}
% error in process sentinel: Invalid regexp: "Unmatched ) or \\)"

%\section{能\|}
% No error, but the image covers the text only partially.

%\section{あ} %表
% error in process sentinel: Invalid regexp: "Trailing backslash"

% The char ー contains 0x5b [ in the shift jis encoding.
%\section{アース}
% error in process sentinel: Invalid regexp: "Unmatched [ or [^"

% The char 型 contains 0x5e ^ in the shift jis encoding.
%\section{型}
% No error, but the text is misplaced far rightward to the image.

\end{document}

%%% Local Variables:
%%% coding: japanese-shift-jis
%%% mode: japanese-latex
%%% TeX-master: t
%%% TeX-engine: ptex
%%% End:

\documentclass{jarticle}

\begin{document}
% Enable preview-latex by C-c C-p C-d.  Answer with y to "Cache preamble?"
% question.  Then you will see that the image is placed on wrong position
% in the buffer.
preview-latex で \(a^{2}=b^{2}+c^{2}\) のような数式を日本語 LaTeX でも
preview したい。
\end{document}

%%% Local Variables:
%%% coding: euc-jp
%%% mode: japanese-latex
%%% TeX-master: t
%%% TeX-engine: ptex
%%% End:

diff --git a/preview.el.in b/preview.el.in
--- a/preview.el.in
+++ b/preview.el.in
@@ -2613,35 +2613,96 @@
 so the character represented by ^^^ preceding extended characters
 will not get matched, usually."
   (let (output case-fold-search)
-    (when (featurep 'mule)
-      (setq string (encode-coding-string string run-coding-system)))
-    (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
-			 string)
+    ;; Some coding systems (e.g. japanese-shift-jis) use regexp meta
+    ;; characters on encoding.  Such meta characters would be
+    ;; interfered with `regexp-quote' below.  Thus the idea of
+    ;; "encoding entire string beforehand and decoding it at the last
+    ;; stage" does not work for such coding systems.
+    ;; (when (featurep 'mule)
+    ;;   (setq string (encode-coding-string string run-coding-system)))
+    ;; Rather, we work consistently with decoded text.
+    (if (and (featurep 'xemacs) (featurep 'mule)
+	     (eq 'raw-text (coding-system-name
+			    (coding-system-base run-coding-system))))
+	(setq string
+	      (decode-coding-string string
+				    (or (and (featurep 'tex-jp)
+					     japanese-TeX-mode
+					     TeX-japanese-process-output-coding-system)
+					buffer-file-coding-system))))
+
+    ;; Next, bytes with value 0x80 to 0xFF represented with ^^ form
+    ;; are converted to byte sequence, and decoded by suitable coding
+    ;; system.
+    (setq string
+	  (preview--decode-^^ab string
+				(if (featurep 'mule)
+				    buffer-file-coding-system nil)))
+
+    ;; Then, control characters are taken into account.
+    (while (string-match "\\^\\{2,\\}\\([@-_?]\\)" string)
       (setq output
 	    (concat output
 		    (regexp-quote (substring string
 					     0
 					     (- (match-beginning 1) 2)))
-		    (if (match-beginning 2)
-			(concat
-			 "\\(?:" (regexp-quote
-				  (substring string
-					     (- (match-beginning 1) 2)
-					     (match-end 0)))
-			 "\\|"
-			 (char-to-string
-			  (logxor (aref string (match-beginning 2)) 64))
-			 "\\)")
-		      (char-to-string
-		       (string-to-number (match-string 1 string) 16))))
+		    (concat
+		     "\\(?:" (regexp-quote
+			      (substring string
+					 (- (match-beginning 1) 2)
+					 (match-end 0)))
+		     "\\|"
+		     (char-to-string
+		      (logxor (aref string (match-beginning 1)) 64))
+		     "\\)"))
 	    string (substring string (match-end 0))))
     (setq output (concat output (regexp-quote string)))
-    (if (featurep 'mule)
-	(decode-coding-string output
-			      (or (and (boundp 'TeX-japanese-process-output-coding-system)
-				       TeX-japanese-process-output-coding-system)
-				  buffer-file-coding-system))
-      output)))
+    output))
+
+(defun preview--decode-^^ab (string coding-system)
+  "Decode ^^ sequences in STRING with CODING-SYSTEM.
+Sequences of control characters such as ^^I are left untouched.
+
+Return a new string."
+  ;; Since the given string can contain multibyte characters, decoding
+  ;; should be performed seperately on each segment made up entirely
+  ;; with ASCII characters.
+  (let ((result ""))
+    (while (string-match "[\x00-\x7F]+" string)
+      (setq result
+	    (concat result
+		    (substring string 0 (match-beginning 0))
+		    (let ((text (preview--convert-^^ab
+				 (match-string 0 string))))
+		      (if (featurep 'mule)
+			  (decode-coding-string text coding-system)
+			text)))
+	    string (substring string (match-end 0))))
+    (setq result (concat result string))
+    result))
+
+(defun preview--convert-^^ab (string)
+  "Convert ^^ sequences in STRING to raw 8bit.
+Sequences of control characters such as ^^I are left untouched.
+
+Return a new string."
+  (save-match-data
+    (let ((result ""))
+      (while (string-match "\\^\\^[8-9a-f][0-9a-f]" string)
+	(setq result
+	      (concat result
+		      (substring string 0 (match-beginning 0))
+		      (let ((byte (string-to-number
+				   (substring (match-string 0 string) 2) 16)))
+			;; `char-to-string' is not appropriate in
+			;; Emacs >= 23 because it converts #xAB into
+			;; "\u00AB" (multibyte string), not "\xAB"
+			;; (raw 8bit unibyte string).
+			(if (fboundp 'byte-to-string)
+			    (byte-to-string byte) (char-to-string byte))))
+	      string (substring string (match-end 0))))
+      (setq result (concat result string))
+      result)))
 
 (defun preview-parse-messages (open-closure)
   "Turn all preview snippets into overlays.
@@ -3496,9 +3557,10 @@
 	  (setq TeX-sentinel-function 'preview-TeX-inline-sentinel)
 	  (when (featurep 'mule)
 	    (setq preview-coding-system
-		  (or (and (boundp 'TeX-japanese-process-output-coding-system)
-			   TeX-japanese-process-output-coding-system)
-		      (with-current-buffer commandbuff
+		  (with-current-buffer commandbuff
+		    (or (and (featurep 'tex-jp)
+			     japanese-TeX-mode
+			     TeX-japanese-process-output-coding-system)
 			buffer-file-coding-system)))
 	    (when preview-coding-system
 	      (setq preview-coding-system

diff --git a/preview.el.in b/preview.el.in
--- a/preview.el.in
+++ b/preview.el.in
@@ -3542,7 +3542,13 @@
 	 "Preview-LaTeX"
 	 (if (consp (cdr dumped-cons))
 	     (preview-do-replacements
-	      command preview-undump-replacements)
+	      command
+	      (append preview-undump-replacements
+		      ;; Since the command options provided in
+		      ;; (TeX-engine-alist) are dropped, give them
+		      ;; back.
+		      (list (list "\\`\\([^ ]+\\)"
+			    (TeX-command-expand "%(latex)" nil)))))
 	   command) file)))
     (condition-case err
 	(progn

_______________________________________________
auctex-devel mailing list
auctex-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/auctex-devel

[AUCTeX-devel] preview-latex coding system problem with Japanese LaTeX

Reply via email to