Re: Tika for parsing raw XMP

Fabrizio Giudici Tue, 24 Aug 2021 04:34:25 -0700

Thanks for all the replies, that make it clear a number of points for me.

> You are not reinventing the wheel. We only pull out what users haverequested.


This is the most important confirmation for me, thanks.

> I’ve toyed w pulling out more than we do, but haven’t found enoughinterest to pursue it.

It makes sense. After the first version of my code, following John'sadvice I performed a refactoring and ended up having everything in acouple of classes, the former a plain SAX Handler, the latter a simplerule evaluator that reads a config file where XPath expressions aremapped to metadata item names. In this way I can fill a Metadata withwhatever I need, from the text portion of the XMP elements orattributes. Now, basically Tika contribution in this part is only theMetadata structure, so I put the focus on it. It's "flattened", so Ithough about this feature being a value point, neutral, or a problem.

My first need was to write a simple tool to perform a consistency checkin my photo metadata, that I had messed up with lens names and manuallyfixed - basically I needed to check whether the focal length wascompatibile with the lens name. For this task the flattened structure ofTika Metadata was a plus, allowing me to accomplish the task with a fewlines of code.

But my next step is to store XMP metadata in a semantic triple store...Given that RDF is the common term of XMP and triple stores, passingthrough Tika Metadata doesn't make sense. It can't even support metadatathat is structured by nature (Jempbox has got support for stuff suchhistory, but e.g. the Photo Supreme DAM uses a specific schema for itshierarchical keywords that also includes attributes (e.g. you can have akeyword that refers to a mountain and have its GPS coordinates too; youcan even have relationships between keywords, so it's more a graph thingthan a simple tree).

OTOH Tika satisfies my requirements for JPEGs, so I will incorporate itin another project. I think I'll use it also for music and video, eventhough I'll test it later.

In the end this is consistent with the other users' expectation aboutXMP, as you said. Rather than the point of being textual, what makes XMPso different is the possible complexity of the data structure _and_ thekind of use you might want to do with it...

For what concerns JPEG, Tika perfectly fits my needs. Music and Video:I'll test later, but I think it will be good as well.


On 22/08/21 21:25, Tim Allison wrote:

Other point on xmps and Tika… xmp can contain jpegs and other binaryformats. So it makes sense to handle these in the Tika framework.

On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected]<mailto:[email protected]>> wrote:


    You are not reinventing the wheel. We only pull out what users
    have requested. I’ve toyed w pulling out more than we do, but
    haven’t found enough interest to pursue it.

    I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll
    try to dig up a link when I’m back to a keyboard.

    As for an earlier point on this thread (not made by you) that Tika
    is only for binary formats, I strongly disagree at least for XMP.
    XMP is integral to pdf and psd and as standalone sidecar. We
    should normalize and extract what we can. Obv if you have custom
    needs, yes, break out your own xml parser, but we should do better
    in Tika.

--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]

Re: Tika for parsing raw XMP

Reply via email to