Thanks for all the replies, that make it clear a number of points for me.
> You are not reinventing the wheel. We only pull out what users have
requested.
This is the most important confirmation for me, thanks.
> I’ve toyed w pulling out more than we do, but haven’t found enough
interest to pursue it.
It makes sense. After the first version of my code, following John's
advice I performed a refactoring and ended up having everything in a
couple of classes, the former a plain SAX Handler, the latter a simple
rule evaluator that reads a config file where XPath expressions are
mapped to metadata item names. In this way I can fill a Metadata with
whatever I need, from the text portion of the XMP elements or
attributes. Now, basically Tika contribution in this part is only the
Metadata structure, so I put the focus on it. It's "flattened", so I
though about this feature being a value point, neutral, or a problem.
My first need was to write a simple tool to perform a consistency check
in my photo metadata, that I had messed up with lens names and manually
fixed - basically I needed to check whether the focal length was
compatibile with the lens name. For this task the flattened structure of
Tika Metadata was a plus, allowing me to accomplish the task with a few
lines of code.
But my next step is to store XMP metadata in a semantic triple store...
Given that RDF is the common term of XMP and triple stores, passing
through Tika Metadata doesn't make sense. It can't even support metadata
that is structured by nature (Jempbox has got support for stuff such
history, but e.g. the Photo Supreme DAM uses a specific schema for its
hierarchical keywords that also includes attributes (e.g. you can have a
keyword that refers to a mountain and have its GPS coordinates too; you
can even have relationships between keywords, so it's more a graph thing
than a simple tree).
OTOH Tika satisfies my requirements for JPEGs, so I will incorporate it
in another project. I think I'll use it also for music and video, even
though I'll test it later.
In the end this is consistent with the other users' expectation about
XMP, as you said. Rather than the point of being textual, what makes XMP
so different is the possible complexity of the data structure _and_ the
kind of use you might want to do with it...
For what concerns JPEG, Tika perfectly fits my needs. Music and Video:
I'll test later, but I think it will be good as well.
On 22/08/21 21:25, Tim Allison wrote:
Other point on xmps and Tika… xmp can contain jpegs and other binary
formats. So it makes sense to handle these in the Tika framework.
On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected]
<mailto:[email protected]>> wrote:
You are not reinventing the wheel. We only pull out what users
have requested. I’ve toyed w pulling out more than we do, but
haven’t found enough interest to pursue it.
I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll
try to dig up a link when I’m back to a keyboard.
As for an earlier point on this thread (not made by you) that Tika
is only for binary formats, I strongly disagree at least for XMP.
XMP is integral to pdf and psd and as standalone sidecar. We
should normalize and extract what we can. Obv if you have custom
needs, yes, break out your own xml parser, but we should do better
in Tika.
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]