[ https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-171. -------------------------------- Resolution: Fixed Fix Version/s: 0.2 Assignee: Jukka Zitting Since we are going to re-roll the 0.2 release from the current trunk, I think it makes sense to resolve this issue as fixed for 0.2 in the current state and perhaps create new issues for the proposed improvements. Resolving as Fixed for 0.2. > New ContentHandler for plain text output that has no problem with missing > white space after XHTML block tags > ------------------------------------------------------------------------------------------------------------ > > Key: TIKA-171 > URL: https://issues.apache.org/jira/browse/TIKA-171 > Project: Tika > Issue Type: Improvement > Components: general > Affects Versions: 0.2 > Reporter: Uwe Schindler > Assignee: Jukka Zitting > Fix For: 0.2 > > Attachments: TIKA-171.patch > > > One problem with mapping document content to plain text is incorrect > whitespace handling: > The normal way to parse documents to plain text is to instantiate a parser > and pass the SAX events from the parser to a > BodyContentHandler(TextContentHandler(Writer)). This appends all output to a > writer (see example on web site). > This works good for thumb parsers that just create a single <p>> tag in XHTML > output whith all content of the document in it (including newlines). > As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates > multiple nodes and a feature-rich XHTML document, the problems begin. The > TextContentHandler just strips all tags away and only characters() events are > forwarded to the Writer. When the original document (e.g. a HTML document) > does not contain additional whitespace and linefeeds (e.g. it is correct and > possible to create a XHTML document with all content in one text line, but > consisting of several paragraphs. In this case </p><p> events between > paragraphs are stripped and there is no whitespace anymore between the two > paragraphs. > My patch contains a new XHTMLToTextContentHandler, that checks the elements > and inserts whitespace to the output depending on the XHTML tag type. HTML > block tags like <p/> get a newline at the end, but HTML inline tags do not > add whitespace. This mapping is done by a simple Set<String> of tag names > extracted from the XHTML 1.0 spec. To make it even better, tables are printed > out with white space and tabs between cells. > With this patch, I am able to correctly index a lot of document with Lucene. > The patch also changes some tests to correctly check for the '\n' at the end > of plain text streams (which are included because of the single <p>-paragraph > around plain text). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.