Hi everybody It seems to me that the method getLineSeparator from PDF2XHTML (package org.apache.tika.parser.pdf) may be improved.
I changed it from: public String getLineSeparator() { try { handler.characters("\n"); } catch(SAXException e) { } return super.getLineSeparator(); } to: public String getLineSeparator() { try { handler.element("br", ""); } catch(SAXException e) { } return super.getLineSeparator(); } the resulting html is more pretty. I hope this post could help someone. see you, Giunad. -- If we have learned one thing from the history of invention and discovery, it is that in the long run - and often in the short one - the most daring prophecies seem laughably conservative. Arthur C. Clarke, The Exploration of Space