Other point on xmps and Tika… xmp can contain jpegs and other binary formats. So it makes sense to handle these in the Tika framework.
On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected]> wrote: > You are not reinventing the wheel. We only pull out what users have > requested. I’ve toyed w pulling out more than we do, but haven’t found > enough interest to pursue it. > > I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to > dig up a link when I’m back to a keyboard. > > As for an earlier point on this thread (not made by you) that Tika is only > for binary formats, I strongly disagree at least for XMP. XMP is integral > to pdf and psd and as standalone sidecar. We should normalize and extract > what we can. Obv if you have custom needs, yes, break out your own xml > parser, but we should do better in Tika. > > On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici < > [email protected]> wrote: > >> On 21/08/21 15:48, Tim Allison wrote: >> >> As you saw, we’re currently parsing embedded xmp w >> >> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java >> >> I think I added hooks for custom xmp parsing when embedded in a pdf. >> >> Is your primary issue that Tika is treating unembedded xmp as regular xml? >> >> I think it would be great if we were pulling more info out of xmp >> embedded or not and would be happy to review your code. >> >> Thanks. So let me recap, also with the help of some code that I've just >> committed. >> >> This is a test XMP that I'm using as a data source. It has been produced >> by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both >> info that the application has extracted from the original file (a Sony ARW) >> and data that I've manually entered: >> >> >> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp >> >> This is what I've been able to extract (in form of textual dump) with - >> spoiler alert - a quick and dirty custom parser. It's only a subset of the >> metadata items in the original XMP, given the roughness of the parser, but >> it's a good start for me (and in any case it already resolved a problem of >> mine). >> >> >> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt >> >> First approach I've tried: >> >> metadata.set(Metadata.CONTENT_TYPE, "application/xml");final >> ImageMetadataExtractor ime = new >> ImageMetadataExtractor(metadata);ime.parseRawXMP(bytes); >> >> But this just made me get a small bunch of DC items: >> >> >> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt >> >> Second attempt: >> >> try (final InputStream is = new ByteArrayInputStream(bytes)) >> { >> new JempboxExtractor(metadata).parse(is); } >> >> with the trick of wrapping the bytes content inside an xpacket marker. >> Basically same results as above: >> >> >> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt >> >> If I correctly understand Tika code, basically Jempbox is used to create >> a DOM that is later processed, but only DC and MM are copied to metadata. I >> see handlers whose name seem to suggest that they copy all tags, but they >> are not used by parse(). >> >> So in the end I tried is a quick and dirty custom parser that copies all >> attributes of the elements in the XMP; this is the relevant code in the >> handler: >> >> public void startElement (String uri, String localName, String qName, >> Attributes attributes) >> { for (int i = 0; i < attributes.getLength(); i++) >> { >> // FIXME: this assumes QName is using the standard prefix (e.g. >> 'exif'). More robust code // should instead read the namespace and >> translate to a prefix. final String key = attributes.getQName(i); >> final String value = attributes.getValue(i); try { >> metadata.add(key, value); } >> catch (PropertyTypeException e) >> { >> log.error("{}: {}", e.toString(), key); } >> } >> } >> >> Full code here: >> >> >> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java >> >> (Forgive me for the long URLs with the commit id, but in this way I can >> make further work on my source repo without jeopardizing the references of >> this email.) >> >> Now the basic thing that I'd like to know is that I'm not reinventing the >> wheel; in other words, there's no code inside Tika that is extracting this >> information from a XMP sidecar. If this is confirmed, I can proceed on this >> path. >> >> Thanks. >> >> >> -- >> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s. >> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - >> [email protected] >> >>
