> I would be remiss to consider implementing parsers in Tika as it really > defeats the purpose of the project: that is, to be a bridging middleware > and > standard interface to parsing libraries, metadata representation, mime > extraction frameworks and content analysis mechanisms.
I am OK with this, but I would wish to have a simple way to configure/Plugin/plugout parser with their complete dependencies. If you write a project using only the PDF and HTML parser, it makes no sense to pollute your classpath with all these other libraries. If it would be possible to correctly determine which library and parser is needed for which document type, there should be away of switching other parsers completely off (so no ClassNotFoundExceptions are generated when the auto detect parser hits an unsupported document type). My problem with this highly sophisticated parser libraries outside of tika are the classpath pollutions, especially when TIKA gives one standalone.jar, most people would use it (because they did not know how to do it in another way), but they do not know what is in it. If they know the correct maven command, they could use the Tika-only JAR together with the external parser jars. I had to ask Jukka, how to get the original dependency JARS in my classpath. Without a binary TIKA release with all JARs as separate files, its not simple to use without the danger of polluting your classpath. And because it often takes longer time until TIKA replaces old dependencies to newer ones (like universal dependencies to XML parsers, jDOM,...) and versions conflict, it's a horror to solve them (especially with the standalone.jar). For me it was a problem to work with old nekohtml, because my projects needed a newer one than supplied with TIKa. For me the biggest horror was Tomcat before version 6, that shipped with very old versions of XML parsers and other Commons tools in one big JAR file. Since them I switched completely to lightweight Jetty with only 3 separate JAR files. Because of this, I wrote the OpenDocument parser. The external parser library Jukka propsed in one of the issues, was not useable, because it was only a blown up DOM-tree representation of the XML structure of OpenDocument files, not useable for any parser without much work. Another good approach would be this "library" for TIKA: http://xml.openoffice.org/sx2ml/ [for TIKA it would only add a XSLT dependency to classpath (already shipped with JAVA 1.5)]. The parsing would be a pipeline: sx2ml -(SAX events)-> (X)HTML parser -(SAX events)-> text (the extra XHTML pipeline is to cleanup the too much style-annotated HTML). The XSLs could be put into one JAR and loaded with Class.getResourceStream(). If you like that better, I can try to rewrite the OpenDocument parser using that, because it is a officially supported part of the openoffice community. Hope you understand me. Uwe