Re: [O] Orgmode → ODT: Certain chars break export

2015-02-14 Thread Vaidheeswaran

On Saturday 14 February 2015 02:20 PM, Vaidheeswaran wrote:


Specifically, in the pdftotext case above, I believe the best action
would be to M-x flush-lines that match ^L so that page headers are
stripped.


I was writing from memory.  I should have said this instead:

The best action would be to flush page headers 'surrounding' ^L and to
'splice' the paragraph lines (that are split apart) at the pagebreaks.

Essentially, for right repair, human intervention is a rule rather
than an exception.




Re: [O] Orgmode → ODT: Certain chars break export

2015-02-14 Thread Vaidheeswaran


On Friday 13 February 2015 04:15 PM, Tory S. Anderson wrote:

While we're on the topic of ODT export problems: I was in the process of converting PDF 
to Text to Org to ODT/DocX and discovered that certain characters seem to break exported 
odt documents, which fail with a line and col number. So far the only one I know for sure 
is the  (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle all 
such cases.

You probably don't need it, but I verified with the following file:
http://toryanderson.com/files/breakorg.org

Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa)




I assume that you are using pdftotext.  In that case, you can use the
following argument.

  -nopgbrk  : don't insert page breaks between pages

That said, it is very difficult to say what the right action should be
when encountering ^L or other problematic characters.  Much depends on
the context.  Neither an outright removal, or replacement with a
single SPC, a NEWLINE or a double NEWLINE may be
satisfactory. Specifically, in the pdftotext case above, I believe the
best action would be to M-x flush-lines that match ^L so that page
headers are stripped.



From exporter side of things, the best that one could do is to catch
such exceptional cases and report it to the user for further repair.
i.e., Instead of waiting of LibreOffice to catch this exception and
leave the user in utter confusion, the export backend could catch the
error early in the export process and report a useful message.

A variation of following snippet can be used for catching the error
early.

(add-hook
 'org-export-before-parsing-hook
 (lambda (backend)
   (when (eq backend 'odt)
 (goto-char (point-min))
 (when (re-search-forward
(rx-to-string '(or (in (#x0 . #x8))
   (in (#xB . #xC))
   (in (#xE. #x1F))
   (in (#xD800. #xDFFF))
   (in (#xFFFE . #x))
   (in (#x11 . #x3F nil t)
   (user-error Input file has a problematic char [%s].
   (format #x%x (string-to-char (match-string 0

The following snippet could be used for outright removal of
problematic characters.

(add-hook
 'org-export-before-parsing-hook
 (lambda (backend)
   (when (eq backend 'odt)
 (goto-char (point-min))
 (when (re-search-forward
(rx-to-string '(one-or-more
(or (in (#x0 . #x8))
(in (#xB . #xC))
(in (#xE. #x1F))
(in (#xD800. #xDFFF))
(in (#xFFFE . #x))
(in (#x11 . #x3F) nil t)
   (replace-match  t t)



Note to the developers:

1. xmltok.el has `xmltok-valid-char-p'.
2. From http://www.w3.org/TR/2008/REC-xml-20081126/#charsets

/* any Unicode character, excluding the surrogate blocks, FFFE, and . */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10]



Document authors are encouraged to avoid compatibility characters,
as defined in section 2.3 of [Unicode]. The characters defined in the
following ranges are also discouraged. They are either control
characters or permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].







[O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Tory S. Anderson
While we're on the topic of ODT export problems: I was in the process of 
converting PDF to Text to Org to ODT/DocX and discovered that certain 
characters seem to break exported odt documents, which fail with a line and col 
number. So far the only one I know for sure is the  (Char: C-l (12, #o14, 
#xc)). Hopefully a single fix can handle all such cases. 

You probably don't need it, but I verified with the following file:
http://toryanderson.com/files/breakorg.org

Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa)



Re: [O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Rasmus
torys.ander...@gmail.com (Tory S. Anderson) writes:

 While we're on the topic of ODT export problems: I was in the process
 of converting PDF to Text to Org to ODT/DocX and discovered that
 certain characters seem to break exported odt documents, which fail
 with a line and col number. So far the only one I know for sure is the
  (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle
 all such cases.

 You probably don't need it, but I verified with the following file:
 http://toryanderson.com/files/breakorg.org

The export is fine, but the produced XML is invalid since it contains an
illegal character.  But how to resolve this?  Should ox strip illegal
charterers (if so what are they)?  If so, could they be used for entities?

—Rasmus

-- 
I hear there's rumors on the, uh, Internets. . .




Re: [O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Tory S. Anderson
From a user perspective just stripping the characters seems best to me, but 
finding out what the characters seems obnoxious. Neither a quick search nor 
skimming the ODT doc specification[1][2] seem to give any insight into a set 
of illegal characters. Does elisp have anything similar to Java's 
isWhitespace[3] that could be used to check character features? 

Rasmus ras...@gmx.us writes:

 torys.ander...@gmail.com (Tory S. Anderson) writes:

 While we're on the topic of ODT export problems: I was in the process
 of converting PDF to Text to Org to ODT/DocX and discovered that
 certain characters seem to break exported odt documents, which fail
 with a line and col number. So far the only one I know for sure is the
  (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle
 all such cases.

 You probably don't need it, but I verified with the following file:
 http://toryanderson.com/files/breakorg.org

 The export is fine, but the produced XML is invalid since it contains an
 illegal character.  But how to resolve this?  Should ox strip illegal
 charterers (if so what are they)?  If so, could they be used for entities?

 —Rasmus

Footnotes: 
[1]  https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
[2]  
http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415196_253892949
[3]  http://www.fileformat.info/info/unicode/char/000c/index.htm




Re: [O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Rasmus
torys.ander...@gmail.com (Tory S. Anderson) writes:

 From a user perspective just stripping the characters seems best to
 me, but finding out what the characters seems obnoxious. 

But maybe there is a valid way to represent such characters in XML?  At
the very least entities must be replaced before stripping these...

 Neither a quick search nor skimming the ODT doc specification[1][2] seem
 to give any insight into a set of illegal characters. Does elisp have
 anything similar to Java's isWhitespace[3] that could be used to check
 character features?

It's an XML thing.  When I tried to open the contents.xml with Firefox it
also says broken XML.  But I also don't know which are the characters that
are not supported by XML.

—Rasmus

-- 
This space is left intentionally blank




Re: [O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Tory S. Anderson
There is a helpful wiki page now that you found XML; it even mentions my 
specific character.[1] The main source seems to be at the w3.org spec.[2]

Rasmus ras...@gmx.us writes:

 torys.ander...@gmail.com (Tory S. Anderson) writes:

 From a user perspective just stripping the characters seems best to
 me, but finding out what the characters seems obnoxious. 

 But maybe there is a valid way to represent such characters in XML?  At
 the very least entities must be replaced before stripping these...

 Neither a quick search nor skimming the ODT doc specification[1][2] seem
 to give any insight into a set of illegal characters. Does elisp have
 anything similar to Java's isWhitespace[3] that could be used to check
 character features?

 It's an XML thing.  When I tried to open the contents.xml with Firefox it
 also says broken XML.  But I also don't know which are the characters that
 are not supported by XML.

 —Rasmus

Footnotes: 
[1]  https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.1

[2]  http://www.w3.org/TR/xml11/#charsets




Re: [O] Orgmode → ODT: Certain chars break export

2015-02-13 Thread Rasmus
torys.ander...@gmail.com (Tory S. Anderson) writes:

 There is a helpful wiki page now that you found XML; it even mentions
 my specific character.[1] The main source seems to be at the w3.org
 spec.[2]

I don't understand unicode well enough to propose a solution.

For now you could use a org-export-before-parsing-hook or
org-export-filter-final-output-functions or maybe
org-export-filter-body-functions to solve the issue locally.

—Rasmus

-- 
Governments should be afraid of their people