Just to add my lurker's thoughts to this thread, for what it's worth... Nearly all of the issues raised in this thread (and in the other one I've been following on Dublin Core) are to me appropriate to a "middlewarish" metadata framework. Some of the corners that folks are pushing are what I hoped to find when I started playing with Tika, and I'm glad to see the discussion, but I'm a little pessimistic at the same time.
In general I feel a metadata framework should support both a handy default configuration with ready parsers for folks just doing a quick-and-dirty CM system, and it should support very granular and resource-efficient configuration of exactly the parsers you need - and I'd even go further and say there should be a common interface to configure individual parsers to only get the individual metadata keys that you need. I think the framework should have its own set of project-managed parsers to be extended as the project matures, and facilities to wrap external parsers; it isn't complete if it exludes either, but any user's configuration should be able to include or exclude whatever parsers you like easily. The framework I need should negotiate, report and help avoid collisions in both standard (e.g. Dublin-core) and vendor-defined metadata keys. It should support use of a downstream parser to "override" as well as extend behavior of an upstream parser. It should support user-defined (not just parser-defined) namespaces to permit more than one parser to process the same file without overwriting each other's data for commonly named keys. It should also support "synthetic" parsers that take upstream metadata and synthesize new metadata, or perhaps simply inject user-defined metadata such as keywords, processing date, an expiration date or similar. All of these requirements are implemented in a DAM product line I've worked on and are driven by modifiablity use cases from the real world. The middlewarish metadata framework should have a story for establishing multiple distinct metadata-parsing "engine" instances so that for example a single CM or DAM system could supply instances specific to different organizational departments or workflows; for example Creative might need a completely different set of metadata to search on for a parsed PDF than Legal would need, but the assets are being stored in the same ECM system. It's also not uncommon for a customer in a DAM workflow to set up a specific set of parsing preferences for a -single- batch of files to be processed. Finally, it should be an object-oriented interface which hands your Java code an Object (some simple Map of key-values) that can then readily be converted to XML (preferably via JAXB) or whatever else is needed, possibly with some optional framework-supplied transformations downstream from this purely structural XML to other formats. In other words, XHTML should be an option for transformed output for those who need XHTML, it should not be the default output of the framework. All of this to me points to basic design and architecture issues, not to incremental improvement or enhancement. Unfortunately at this stage in Tika I'm not sure that fundamental changes in basic design are possible. As stated in How the ASF works<http://www.apache.org/foundation/how-it-works.html#incubator>, "the friction that is developed during the initial design stage is likely to fragment the community." That's probably also true if one were to propose a major non-initial redesign stage. So I'm not sure it's possible to radically change Tika design at this point to meet my needs; more likely I'll use it opportunistically to find parsers I don't know about or just to steal a bit of code here and there. Perhaps this is the distinction between a "toolkit" and a "framework" - Tika definitely seems more like the former than the latter to me. But maybe others have a clearer vision of how to do things like this with an evolving Tika. Also perhaps others on the list are happy with the use cases that Tika currently satisfies; I don't mean to slight the project - I'm sure it's meeting the needs of many. Hopefully this feedback has some constructive use to the community; I've been keeping a lid on these concerns for awhile but current threads lured me out. - Chris