> I would be remiss to consider implementing parsers in Tika as it really
> defeats the purpose of the project: that is, to be a bridging middleware
> and
> standard interface to parsing libraries, metadata representation, mime
> extraction frameworks and content analysis mechanisms.

I am OK with this, but I would wish to have a simple way to
configure/Plugin/plugout parser with their complete dependencies. If you
write a project using only the PDF and HTML parser, it makes no sense to
pollute your classpath with all these other libraries. If it would be
possible to correctly determine which library and parser is needed for which
document type, there should be away of switching other parsers completely
off (so no ClassNotFoundExceptions are generated when the auto detect parser
hits an unsupported document type).

My problem with this highly sophisticated parser libraries outside of tika
are the classpath pollutions, especially when TIKA gives one standalone.jar,
most people would use it (because they did not know how to do it in another
way), but they do not know what is in it. If they know the correct maven
command, they could use the Tika-only JAR together with the external parser
jars. I had to ask Jukka, how to get the original dependency JARS in my
classpath. Without a binary TIKA release with all JARs as separate files,
its not simple to use without the danger of polluting your classpath. And
because it often takes longer time until TIKA replaces old dependencies to
newer ones (like universal dependencies to XML parsers, jDOM,...) and
versions conflict, it's a horror to solve them (especially with the
standalone.jar). For me it was a problem to work with old nekohtml, because
my projects needed a newer one than supplied with TIKa.

For me the biggest horror was Tomcat before version 6, that shipped with
very old versions of XML parsers and other Commons tools in one big JAR
file. Since them I switched completely to lightweight Jetty with only 3
separate JAR files.

Because of this, I wrote the OpenDocument parser. The external parser
library Jukka propsed in one of the issues, was not useable, because it was
only a blown up DOM-tree representation of the XML structure of OpenDocument
files, not useable for any parser without much work.

Another good approach would be this "library" for TIKA:
http://xml.openoffice.org/sx2ml/ [for TIKA it would only add a XSLT
dependency to classpath (already shipped with JAVA 1.5)]. The parsing would
be a pipeline: sx2ml -(SAX events)-> (X)HTML parser -(SAX events)-> text
(the extra XHTML pipeline is to cleanup the too much style-annotated HTML).
The XSLs could be put into one JAR and loaded with
Class.getResourceStream(). If you like that better, I can try to rewrite the
OpenDocument parser using that, because it is a officially supported part of
the openoffice community.

Hope you understand me.

Uwe

Reply via email to