[ 
https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651501#action_12651501
 ] 

Uwe Schindler commented on TIKA-171:
------------------------------------

Thanks for accepting the patch. Next time, I will try to add extra { } even for 
single-statement then's. Normally, as soon as I start a new line, I do this, 
but simple, one-statement-if's are in my eyes more readable.

I am sorry for emitting tabs, but I am the notepad++-user and for my other 
projects, the tabs are enabled. I hope I find a way to quickly change this 
setting.

bq. Instead of having XHTMLToTextContentHandler pass through only character 
events, how about if it passed all SAX events and simply inserted extra 
ignorableWhitespace events where appropriate? This would make the different 
features more orthogonal and thus easier to combine in new ways. The current 
functionality could still be achieved by combining the class with 
TextContentHandler or WriteOutContentHandler. 

This is a good idea. Using this approach we could also directly include this 
code in XHTMLContentHandler and drop XHTMLToTextContentHandler. If 
BodyContentHandler forwards ignoreable whitespaces then WriteOutContentHandler 
could use it. The ignoreable whitespace is then generated automatically by all 
parsers (if they use XHTMLContentHandler for outputting, but I think they do 
it).

bq. * Emit a double newline at the end of block elements (but only a single 
newline after </tr> or <br/>) to produce an empty line to separate paragraphs 
in text output. This makes the output easier for manual inspection and might 
even help some post-processors (that for some reason don't know how to use 
XHTML) to better detect structure in the text output. 

This can be done easily. I think, just do it!

bq. * Detect if the incoming XHTML document already has such extra whitespace 
and either (partially) replace it with the emitted whitespace or keep it and 
avoid emitting extra whitespace. 

If all parsers would emit thiswhitespace as ignoreable whitespace and we use 
the approach noted before, this would be automatically correct.

bq. * Avoid emitting extra whitespace for empty elements. This way we can keep 
the nicely symmetric property that an empty input stream results in an empty 
text output stream. 

We need a simple flag that is set to true, whenever a non-empty characters 
event is emitted. On each start element we reset it to false. This approach 
would remove simple and empty tags. In end element we would make the block-tag 
checker also dependent on this flag. The problems are tags inside tags without 
text around. So the flag must be stacked to have a flag for each element depth 
(use a BitSet for it like in my other patch? LinkedList is to heavy for 
booleans). I think about some code, maybe I am to complicated! ;-)

Nevertheless, I wanted to keep the handler as simple as possible (there is even 
to much logic for tables in it). To make the text indexable, just declaring all 
such elements as block and emitting '\n' would be enough. For nice output, the 
current approach has one more problem:

HTML tables with block tags in it (e.g. <td><div>...</div></td>) would produce 
"\n\t". In principle this should also be handled. For my other patch 
(TIKA-172), this is the case, as all OpenDocument tables contain paragraphs.


> New ContentHandler for plain text output that has no problem with missing 
> white space after XHTML block tags
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-171
>                 URL: https://issues.apache.org/jira/browse/TIKA-171
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.2
>            Reporter: Uwe Schindler
>         Attachments: TIKA-171.patch
>
>
> One problem with mapping document content to plain text is incorrect 
> whitespace handling:
> The normal way to parse documents to plain text is to instantiate a parser 
> and pass the SAX events from the parser to a 
> BodyContentHandler(TextContentHandler(Writer)). This appends all output to a 
> writer (see example on web site).
> This works good for thumb parsers that just create a single <p>> tag in XHTML 
> output whith all content of the document in it (including newlines).
> As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates 
> multiple nodes and a feature-rich XHTML document, the problems begin. The 
> TextContentHandler just strips all tags away and only characters() events are 
> forwarded to the Writer. When the original document (e.g. a HTML document) 
> does not contain additional whitespace and linefeeds (e.g. it is correct and 
> possible to create a XHTML document with all content in one text line, but 
> consisting of several paragraphs. In this case </p><p> events between 
> paragraphs are stripped and there is no whitespace anymore between the two 
> paragraphs.
> My patch contains a new XHTMLToTextContentHandler, that checks the elements 
> and inserts whitespace to the output depending on the XHTML tag type. HTML 
> block tags like <p/> get a newline at the end, but HTML inline tags do not 
> add whitespace. This mapping is done by a simple Set<String> of tag names 
> extracted from the XHTML 1.0 spec. To make it even better, tables are printed 
> out with white space and tabs between cells.
> With this patch, I am able to correctly index a lot of document with Lucene.
> The patch also changes some tests to correctly check for the '\n' at the end 
> of plain text streams (which are included because of the single <p>-paragraph 
> around plain text).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to