John, thanks for your comment. Correctly understanding the scope of Tika
is part of the things I have to do, so I'm waiting for other Tika people
to confirm.
In my understanding Tika also supports textual files (there are XML
parsers inside, XMP is at least partially supported when embedded e.g.
in a JPEG file, etc...), but I could be wrong.
I know XMP is XML, but the schema is not trivial for what concerns
representation of certain structured properties (see below), so a
speficic Java data model is required and it would be nice to find one
available in a library. I know there are other libraries supporting it
(such as metadata-extractor, which is used by Tika) which I've already
used in the past. Given that I have to deal with multiple file formats
(photo, music, etc...) it would be nice to have a single "umbrella" API
- also because this is an Apache project, with the usual governance
model, so you get a well anticipated warning when it is going to reach
end of life - while many projects out there often get to a stop without
a warning.
Back to the original topic...
At the moment I was able to write a custom parser starting from
AbstractParser and taking advantage of XMPContentHandler. It's quite
rough, but it retrieves most of the obvious tags (including the ones I
need now for a specific task). I need to understand whether I've just
duplicated stuff that is already inside Tika, or whether I have properly
extended Tika about a missing feature, or whether I'm stressing it too far.
A potential problem - which is not urgent now - is that I don't know how
Tika should deal with complex XMP properties such as hierarchic
properties, given that it uses to flatten everything.
On 20/08/21 19:29, John Ulric wrote:
Fabrizio:
I'm not a specialist in Tika, but XMP files are plain XML, and pretty
well standardised, so you probably wouldn't need Tika to read these.
Just use any old XML parser (from JDKs standard library, Saxon …) and
filter out the values you need. I don't know if the Tika team agree,
but I see Tika as a tool to extract information from binary data
primarily.
Cheers
John
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]