If you want a bunch of XMPs to work with: https://corpora.tika.apache.org/base/xmps/
On Sun, Aug 22, 2021 at 3:25 PM Tim Allison <[email protected]> wrote: > > Other point on xmps and Tika… xmp can contain jpegs and other binary formats. > So it makes sense to handle these in the Tika framework. > > On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected]> wrote: >> >> You are not reinventing the wheel. We only pull out what users have >> requested. I’ve toyed w pulling out more than we do, but haven’t found >> enough interest to pursue it. >> >> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to dig >> up a link when I’m back to a keyboard. >> >> As for an earlier point on this thread (not made by you) that Tika is only >> for binary formats, I strongly disagree at least for XMP. XMP is integral to >> pdf and psd and as standalone sidecar. We should normalize and extract what >> we can. Obv if you have custom needs, yes, break out your own xml parser, >> but we should do better in Tika. >> >> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici >> <[email protected]> wrote: >>> >>> On 21/08/21 15:48, Tim Allison wrote: >>> >>> As you saw, we’re currently parsing embedded xmp w >>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java >>> >>> I think I added hooks for custom xmp parsing when embedded in a pdf. >>> >>> Is your primary issue that Tika is treating unembedded xmp as regular xml? >>> >>> I think it would be great if we were pulling more info out of xmp embedded >>> or not and would be happy to review your code. >>> >>> Thanks. So let me recap, also with the help of some code that I've just >>> committed. >>> >>> This is a test XMP that I'm using as a data source. It has been produced by >>> the DAM app Photo Supreme and, as a typical XMP sidecar, contains both info >>> that the application has extracted from the original file (a Sony ARW) and >>> data that I've manually entered: >>> >>> >>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp >>> >>> This is what I've been able to extract (in form of textual dump) with - >>> spoiler alert - a quick and dirty custom parser. It's only a subset of the >>> metadata items in the original XMP, given the roughness of the parser, but >>> it's a good start for me (and in any case it already resolved a problem of >>> mine). >>> >>> >>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt >>> >>> First approach I've tried: >>> >>> metadata.set(Metadata.CONTENT_TYPE, "application/xml"); >>> final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata); >>> ime.parseRawXMP(bytes); >>> >>> But this just made me get a small bunch of DC items: >>> >>> >>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt >>> >>> Second attempt: >>> >>> try (final InputStream is = new ByteArrayInputStream(bytes)) >>> { >>> new JempboxExtractor(metadata).parse(is); >>> } >>> >>> with the trick of wrapping the bytes content inside an xpacket marker. >>> Basically same results as above: >>> >>> >>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt >>> >>> If I correctly understand Tika code, basically Jempbox is used to create a >>> DOM that is later processed, but only DC and MM are copied to metadata. I >>> see handlers whose name seem to suggest that they copy all tags, but they >>> are not used by parse(). >>> >>> So in the end I tried is a quick and dirty custom parser that copies all >>> attributes of the elements in the XMP; this is the relevant code in the >>> handler: >>> >>> public void startElement (String uri, String localName, String qName, >>> Attributes attributes) >>> { >>> for (int i = 0; i < attributes.getLength(); i++) >>> { >>> // FIXME: this assumes QName is using the standard prefix (e.g. >>> 'exif'). More robust code >>> // should instead read the namespace and translate to a prefix. >>> final String key = attributes.getQName(i); >>> final String value = attributes.getValue(i); >>> >>> try >>> { >>> metadata.add(key, value); >>> } >>> catch (PropertyTypeException e) >>> { >>> log.error("{}: {}", e.toString(), key); >>> } >>> } >>> } >>> >>> Full code here: >>> >>> >>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java >>> >>> (Forgive me for the long URLs with the commit id, but in this way I can >>> make further work on my source repo without jeopardizing the references of >>> this email.) >>> >>> Now the basic thing that I'd like to know is that I'm not reinventing the >>> wheel; in other words, there's no code inside Tika that is extracting this >>> information from a XMP sidecar. If this is confirmed, I can proceed on this >>> path. >>> >>> Thanks. >>> >>> >>> -- >>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s. >>> "We make Java work. Everywhere." >>> http://tidalwave.it/fabrizio/blog - [email protected]
