Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Christopher Corbell Mon, 08 Dec 2008 13:58:58 -0800

Just to add my lurker's thoughts to this thread, for what it's worth...

Nearly all of the issues raised in this thread (and in the other one I've
been following on Dublin Core) are to me appropriate to a "middlewarish"
metadata framework.  Some of the corners that folks are pushing are what I
hoped to find when I started playing with Tika, and I'm glad to see the
discussion, but I'm a little pessimistic at the same time.


In general I feel a metadata framework should support both a handy default
configuration with ready parsers for folks just doing a quick-and-dirty CM
system, and it should support very granular and resource-efficient
configuration of exactly the parsers you need - and I'd even go further and
say there should be a common interface to configure individual parsers to
only get the individual metadata keys that you need.

I think the framework should have its own set of project-managed parsers to
be extended as the project matures, and facilities to wrap external parsers;
it isn't complete if it exludes either, but any user's configuration should
be able to include or exclude whatever parsers you like easily.

The framework I need should negotiate, report and help avoid collisions in
both standard (e.g. Dublin-core) and vendor-defined metadata keys.  It
should support use of a downstream parser to "override" as well as extend
behavior of an upstream parser.  It should support user-defined (not just
parser-defined) namespaces to permit more than one parser to process the
same file without overwriting each other's data for commonly named keys. It
should also support "synthetic" parsers that take upstream metadata and
synthesize new metadata, or perhaps simply inject user-defined metadata such
as keywords, processing date, an expiration date or similar.  All of these
requirements are implemented in a DAM product line I've worked on and are
driven by modifiablity use cases from the real world.

The middlewarish metadata framework should have a story for establishing
multiple distinct metadata-parsing "engine" instances so that for example a
single CM or DAM system could supply instances specific to different
organizational departments or workflows; for example Creative might need a
completely different set of metadata to search on for a parsed PDF than
Legal would need, but the assets are being stored in the same ECM system.
It's also not uncommon for a customer in a DAM workflow to set up a specific
set of parsing preferences for a -single- batch of files to be processed.

Finally, it should be an object-oriented interface which hands your Java
code an Object (some simple Map of key-values) that can then readily be
converted to XML (preferably via JAXB) or whatever else is needed, possibly
with some optional framework-supplied transformations downstream from this
purely structural XML to other formats.  In other words, XHTML should be an
option for transformed output for those who need XHTML, it should not be the
default output of the framework.

All of this to me points to basic design and architecture issues, not to
incremental improvement or enhancement.  Unfortunately at this stage in Tika
I'm not sure that fundamental changes in basic design are possible. As
stated in How the ASF
works<http://www.apache.org/foundation/how-it-works.html#incubator>,
"the friction that is developed during the initial design stage is likely to
fragment the community."  That's probably also true if one were to propose a
major non-initial redesign stage.

So I'm not sure it's possible to radically change Tika design at this point
to meet my needs; more likely I'll use it opportunistically to find parsers
I don't know about or just to steal a bit of code here and there.  Perhaps
this is the distinction between a "toolkit" and a "framework" - Tika
definitely seems more like the former than the latter to me.  But maybe
others have a clearer vision of how to do things like this with an evolving
Tika.

Also perhaps others on the list are happy with the use cases that Tika
currently satisfies; I don't mean to slight the project - I'm sure it's
meeting the needs of many.  Hopefully this feedback has some constructive
use to the community; I've been keeping a lid on these concerns for awhile
but current threads lured me out.

- Chris

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to