On Mon, Dec 8, 2008 at 9:58 PM, Christopher Corbell <[EMAIL PROTECTED]> wrote: > Just to add my lurker's thoughts to this thread, for what it's worth... > > Nearly all of the issues raised in this thread (and in the other one I've > been following on Dublin Core) are to me appropriate to a "middlewarish" > metadata framework. Some of the corners that folks are pushing are what I > hoped to find when I started playing with Tika, and I'm glad to see the > discussion, but I'm a little pessimistic at the same time. > > In general I feel a metadata framework should support both a handy default > configuration with ready parsers for folks just doing a quick-and-dirty CM > system, and it should support very granular and resource-efficient > configuration of exactly the parsers you need - and I'd even go further and > say there should be a common interface to configure individual parsers to > only get the individual metadata keys that you need. > > I think the framework should have its own set of project-managed parsers to > be extended as the project matures, and facilities to wrap external parsers; > it isn't complete if it exludes either, but any user's configuration should > be able to include or exclude whatever parsers you like easily. > > The framework I need should negotiate, report and help avoid collisions in > both standard (e.g. Dublin-core) and vendor-defined metadata keys. It > should support use of a downstream parser to "override" as well as extend > behavior of an upstream parser. It should support user-defined (not just > parser-defined) namespaces to permit more than one parser to process the > same file without overwriting each other's data for commonly named keys. It > should also support "synthetic" parsers that take upstream metadata and > synthesize new metadata, or perhaps simply inject user-defined metadata such > as keywords, processing date, an expiration date or similar. All of these > requirements are implemented in a DAM product line I've worked on and are > driven by modifiablity use cases from the real world. > > The middlewarish metadata framework should have a story for establishing > multiple distinct metadata-parsing "engine" instances so that for example a > single CM or DAM system could supply instances specific to different > organizational departments or workflows; for example Creative might need a > completely different set of metadata to search on for a parsed PDF than > Legal would need, but the assets are being stored in the same ECM system. > It's also not uncommon for a customer in a DAM workflow to set up a specific > set of parsing preferences for a -single- batch of files to be processed. > > Finally, it should be an object-oriented interface which hands your Java > code an Object (some simple Map of key-values) that can then readily be > converted to XML (preferably via JAXB) or whatever else is needed, possibly > with some optional framework-supplied transformations downstream from this > purely structural XML to other formats. In other words, XHTML should be an > option for transformed output for those who need XHTML, it should not be the > default output of the framework. > > All of this to me points to basic design and architecture issues, not to > incremental improvement or enhancement. Unfortunately at this stage in Tika > I'm not sure that fundamental changes in basic design are possible. As > stated in How the ASF > works<http://www.apache.org/foundation/how-it-works.html#incubator>, > "the friction that is developed during the initial design stage is likely to > fragment the community." That's probably also true if one were to propose a > major non-initial redesign stage. > > So I'm not sure it's possible to radically change Tika design at this point > to meet my needs; more likely I'll use it opportunistically to find parsers > I don't know about or just to steal a bit of code here and there. Perhaps > this is the distinction between a "toolkit" and a "framework" - Tika > definitely seems more like the former than the latter to me. But maybe > others have a clearer vision of how to do things like this with an evolving > Tika. > > Also perhaps others on the list are happy with the use cases that Tika > currently satisfies; I don't mean to slight the project - I'm sure it's > meeting the needs of many. Hopefully this feedback has some constructive > use to the community; I've been keeping a lid on these concerns for awhile > but current threads lured me out.
Nothing ever changes without people stepping forward to first propose/discuss and then following up with actual contributions - so if you give up without trying, then your pessimistic outcome is assured. Perhpas you're right and your proposals won't be accepted - but give it a try at least. I would suggest picking one concrete proposal - discuss it first, but be prepared to back it up with code/patches - and see how that goes. Niall (fellow lurker) > - Chris >