On Mon, Dec 8, 2008 at 9:58 PM, Christopher Corbell
<[EMAIL PROTECTED]> wrote:
> Just to add my lurker's thoughts to this thread, for what it's worth...
>
> Nearly all of the issues raised in this thread (and in the other one I've
> been following on Dublin Core) are to me appropriate to a "middlewarish"
> metadata framework.  Some of the corners that folks are pushing are what I
> hoped to find when I started playing with Tika, and I'm glad to see the
> discussion, but I'm a little pessimistic at the same time.
>
> In general I feel a metadata framework should support both a handy default
> configuration with ready parsers for folks just doing a quick-and-dirty CM
> system, and it should support very granular and resource-efficient
> configuration of exactly the parsers you need - and I'd even go further and
> say there should be a common interface to configure individual parsers to
> only get the individual metadata keys that you need.
>
> I think the framework should have its own set of project-managed parsers to
> be extended as the project matures, and facilities to wrap external parsers;
> it isn't complete if it exludes either, but any user's configuration should
> be able to include or exclude whatever parsers you like easily.
>
> The framework I need should negotiate, report and help avoid collisions in
> both standard (e.g. Dublin-core) and vendor-defined metadata keys.  It
> should support use of a downstream parser to "override" as well as extend
> behavior of an upstream parser.  It should support user-defined (not just
> parser-defined) namespaces to permit more than one parser to process the
> same file without overwriting each other's data for commonly named keys. It
> should also support "synthetic" parsers that take upstream metadata and
> synthesize new metadata, or perhaps simply inject user-defined metadata such
> as keywords, processing date, an expiration date or similar.  All of these
> requirements are implemented in a DAM product line I've worked on and are
> driven by modifiablity use cases from the real world.
>
> The middlewarish metadata framework should have a story for establishing
> multiple distinct metadata-parsing "engine" instances so that for example a
> single CM or DAM system could supply instances specific to different
> organizational departments or workflows; for example Creative might need a
> completely different set of metadata to search on for a parsed PDF than
> Legal would need, but the assets are being stored in the same ECM system.
> It's also not uncommon for a customer in a DAM workflow to set up a specific
> set of parsing preferences for a -single- batch of files to be processed.
>
> Finally, it should be an object-oriented interface which hands your Java
> code an Object (some simple Map of key-values) that can then readily be
> converted to XML (preferably via JAXB) or whatever else is needed, possibly
> with some optional framework-supplied transformations downstream from this
> purely structural XML to other formats.  In other words, XHTML should be an
> option for transformed output for those who need XHTML, it should not be the
> default output of the framework.
>
> All of this to me points to basic design and architecture issues, not to
> incremental improvement or enhancement.  Unfortunately at this stage in Tika
> I'm not sure that fundamental changes in basic design are possible. As
> stated in How the ASF
> works<http://www.apache.org/foundation/how-it-works.html#incubator>,
> "the friction that is developed during the initial design stage is likely to
> fragment the community."  That's probably also true if one were to propose a
> major non-initial redesign stage.
>
> So I'm not sure it's possible to radically change Tika design at this point
> to meet my needs; more likely I'll use it opportunistically to find parsers
> I don't know about or just to steal a bit of code here and there.  Perhaps
> this is the distinction between a "toolkit" and a "framework" - Tika
> definitely seems more like the former than the latter to me.  But maybe
> others have a clearer vision of how to do things like this with an evolving
> Tika.
>
> Also perhaps others on the list are happy with the use cases that Tika
> currently satisfies; I don't mean to slight the project - I'm sure it's
> meeting the needs of many.  Hopefully this feedback has some constructive
> use to the community; I've been keeping a lid on these concerns for awhile
> but current threads lured me out.

Nothing ever changes without people stepping forward to first
propose/discuss and then following up with actual contributions - so
if you give up without trying, then your pessimistic outcome is
assured. Perhpas you're right and your proposals won't be accepted -
but give it a try at least. I would suggest picking one concrete
proposal - discuss it first, but be prepared to back it up with
code/patches - and see how that goes.

Niall (fellow lurker)

> - Chris
>

Reply via email to