Thanks for all the replies, that make it clear a number of points for me.

> You are not reinventing the wheel. We only pull out what users have requested.

This is the most important confirmation for me, thanks.

> I’ve toyed w pulling out more than we do, but haven’t found enough interest to pursue it.

It makes sense. After the first version of my code, following John's advice I performed a refactoring and ended up having everything in a couple of classes, the former a plain SAX Handler, the latter a simple rule evaluator that reads a config file where XPath expressions are mapped to metadata item names. In this way I can fill a Metadata with whatever I need, from the text portion of the XMP elements or attributes. Now, basically Tika contribution in this part is only the Metadata structure, so I put the focus on it. It's "flattened", so I though about this feature being a value point, neutral, or a problem.

My first need was to write a simple tool to perform a consistency check in my photo metadata, that I had messed up with lens names and manually fixed - basically I needed to check whether the focal length was compatibile with the lens name. For this task the flattened structure of Tika Metadata was a plus, allowing me to accomplish the task with a few lines of code.

But my next step is to store XMP metadata in a semantic triple store... Given that RDF is the common term of XMP and triple stores, passing through Tika Metadata doesn't make sense. It can't even support metadata that is structured by nature (Jempbox has got support for stuff such history, but e.g. the Photo Supreme DAM uses a specific schema for its hierarchical keywords that also includes attributes (e.g. you can have a keyword that refers to a mountain and have its GPS coordinates too; you can even have relationships between keywords, so it's more a graph thing than a simple tree).

OTOH Tika satisfies my requirements for JPEGs, so I will incorporate it in another project. I think I'll use it also for music and video, even though I'll test it later.

In the end this is consistent with the other users' expectation about XMP, as you said. Rather than the point of being textual, what makes XMP so different is the possible complexity of the data structure _and_ the kind of use you might want to do with it...

For what concerns JPEG, Tika perfectly fits my needs. Music and Video: I'll test later, but I think it will be good as well.

On 22/08/21 21:25, Tim Allison wrote:
Other point on xmps and Tika… xmp can contain jpegs and other binary formats. So it makes sense to handle these in the Tika framework.

On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected] <mailto:[email protected]>> wrote:

    You are not reinventing the wheel. We only pull out what users
    have requested. I’ve toyed w pulling out more than we do, but
    haven’t found enough interest to pursue it.

    I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll
    try to dig up a link when I’m back to a keyboard.

    As for an earlier point on this thread (not made by you) that Tika
    is only for binary formats, I strongly disagree at least for XMP.
    XMP is integral to pdf and psd and as standalone sidecar. We
    should normalize and extract what we can. Obv if you have custom
    needs, yes, break out your own xml parser, but we should do better
    in Tika.

--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]

Reply via email to