On 21/08/21 15:48, Tim Allison wrote:
As you saw, we’re currently parsing embedded xmp w
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
<https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java>
I think I added hooks for custom xmp parsing when embedded in a pdf.
Is your primary issue that Tika is treating unembedded xmp as regular xml?
I think it would be great if we were pulling more info out of xmp
embedded or not and would be happy to review your code.
Thanks. So let me recap, also with the help of some code that I've just
committed.
This is a test XMP that I'm using as a data source. It has been produced
by the DAM app Photo Supreme and, as a typical XMP sidecar, contains
both info that the application has extracted from the original file (a
Sony ARW) and data that I've manually entered:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
This is what I've been able to extract (in form of textual dump) with -
spoiler alert - a quick and dirty custom parser. It's only a subset of
the metadata items in the original XMP, given the roughness of the
parser, but it's a good start for me (and in any case it already
resolved a problem of mine).
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
First approach I've tried:
metadata.set(Metadata.CONTENT_TYPE, "application/xml"); final
ImageMetadataExtractor ime =new ImageMetadataExtractor(metadata); ime.parseRawXMP(bytes);
But this just made me get a small bunch of DC items:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
Second attempt:
try (final InputStream is =new ByteArrayInputStream(bytes))
{
new JempboxExtractor(metadata).parse(is); }
with the trick of wrapping the bytes content inside an xpacket marker.
Basically same results as above:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
If I correctly understand Tika code, basically Jempbox is used to create
a DOM that is later processed, but only DC and MM are copied to
metadata. I see handlers whose name seem to suggest that they copy all
tags, but they are not used by parse().
So in the end I tried is a quick and dirty custom parser that copies all
attributes of the elements in the XMP; this is the relevant code in the
handler:
public void startElement (String uri, String localName, String qName,
Attributes attributes)
{
for (int i =0; i < attributes.getLength(); i++)
{
// FIXME: this assumes QName is using the standard prefix (e.g. 'exif').
More robust code // should instead read the namespace and translate to a
prefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {
metadata.add(key, value); }
catch (PropertyTypeException e)
{
log.error("{}: {}", e.toString(), key); }
}
}
Full code here:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
(Forgive me for the long URLs with the commit id, but in this way I can
make further work on my source repo without jeopardizing the references
of this email.)
Now the basic thing that I'd like to know is that I'm not reinventing
the wheel; in other words, there's no code inside Tika that is
extracting this information from a XMP sidecar. If this is confirmed, I
can proceed on this path.
Thanks.
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]