Re: Tika for parsing raw XMP

Fabrizio Giudici Sat, 21 Aug 2021 14:17:35 -0700

On 21/08/21 15:48, Tim Allison wrote:

As you saw, we’re currently parsing embedded xmp w
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java<https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java>
I think I added hooks for custom xmp parsing when embedded in a pdf.

Is your primary issue that Tika is treating unembedded xmp as regular xml?
I think it would be great if we were pulling more info out of xmpembedded or not and would be happy to review your code.

Thanks. So let me recap, also with the help of some code that I've justcommitted.

This is a test XMP that I'm using as a data source. It has been producedby the DAM app Photo Supreme and, as a typical XMP sidecar, containsboth info that the application has extracted from the original file (aSony ARW) and data that I've manually entered:


https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp

This is what I've been able to extract (in form of textual dump) with -spoiler alert - a quick and dirty custom parser. It's only a subset ofthe metadata items in the original XMP, given the roughness of theparser, but it's a good start for me (and in any case it alreadyresolved a problem of mine).


https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt

First approach I've tried:

metadata.set(Metadata.CONTENT_TYPE, "application/xml"); final 
ImageMetadataExtractor ime =new ImageMetadataExtractor(metadata); ime.parseRawXMP(bytes);

But this just made me get a small bunch of DC items:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt

Second attempt:

try (final InputStream is =new ByteArrayInputStream(bytes))
  {
    new JempboxExtractor(metadata).parse(is); }

with the trick of wrapping the bytes content inside an xpacket marker.Basically same results as above:


https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt

If I correctly understand Tika code, basically Jempbox is used to createa DOM that is later processed, but only DC and MM are copied tometadata. I see handlers whose name seem to suggest that they copy alltags, but they are not used by parse().

So in the end I tried is a quick and dirty custom parser that copies allattributes of the elements in the XMP; this is the relevant code in thehandler:


public void startElement (String uri, String localName, String qName, 
Attributes attributes)
  {
for (int i =0; i < attributes.getLength(); i++)
      {

// FIXME: this assumes QName is using the standard prefix (e.g. 'exif').More robust code // should instead read the namespace and translate to aprefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {

            metadata.add(key, value); }
        catch (PropertyTypeException e)
          {
            log.error("{}: {}", e.toString(), key); }
      }
  }

Full code here:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java

(Forgive me for the long URLs with the commit id, but in this way I canmake further work on my source repo without jeopardizing the referencesof this email.)

Now the basic thing that I'd like to know is that I'm not reinventingthe wheel; in other words, there's no code inside Tika that isextracting this information from a XMP sidecar. If this is confirmed, Ican proceed on this path.


Thanks.

--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - [email protected]

Re: Tika for parsing raw XMP

Reply via email to