Author: jukka
Date: Wed Jan 7 07:41:38 2009
New Revision: 732370
URL: http://svn.apache.org/viewvc?rev=732370&view=rev
Log:
TIKA-180: XHTMLContentHandler unable to extract text from MSWord file
Use the SafeContentHandler class in XHTMLContentHandler to prevent all current
Tika parsers from outputting invalid XML characters.
Modified:
lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
Modified:
lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
URL:
http://svn.apache.org/viewvc/lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java?rev=732370&r1=732369&r2=732370&view=diff
==============================================================================
---
lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
(original)
+++
lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
Wed Jan 7 07:41:38 2009
@@ -26,7 +26,7 @@
* Content handler decorator that simplifies the task of producing XHTML
* events for Tika content parsers.
*/
-public class XHTMLContentHandler extends ContentHandlerDecorator {
+public class XHTMLContentHandler extends SafeContentHandler {
/**
* The XHTML namespace URI