Re: Tika for parsing raw XMP

Tim Allison Tue, 24 Aug 2021 04:24:57 -0700

If you want a bunch of XMPs to work with:
https://corpora.tika.apache.org/base/xmps/


On Sun, Aug 22, 2021 at 3:25 PM Tim Allison <[email protected]> wrote:
>
> Other point on xmps and Tika… xmp can contain jpegs and other binary formats. 
> So it makes sense to handle these in the Tika framework.
>
> On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <[email protected]> wrote:
>>
>> You are not reinventing the wheel. We only pull out what users have 
>> requested. I’ve toyed w pulling out more than we do, but haven’t found 
>> enough interest to pursue it.
>>
>> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to dig 
>> up a link when I’m back to a keyboard.
>>
>> As for an earlier point on this thread (not made by you) that Tika is only 
>> for binary formats, I strongly disagree at least for XMP. XMP is integral to 
>> pdf and psd and as standalone sidecar. We should normalize and extract what 
>> we can. Obv if you have custom needs, yes, break out your own xml parser, 
>> but we should do better in Tika.
>>
>> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici 
>> <[email protected]> wrote:
>>>
>>> On 21/08/21 15:48, Tim Allison wrote:
>>>
>>> As you saw, we’re currently parsing embedded xmp w
>>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>>>
>>> I think I added hooks for custom xmp parsing when embedded in a pdf.
>>>
>>> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>>>
>>> I think it would be great if we were pulling more info out of xmp embedded 
>>> or not and would be happy to review your code.
>>>
>>> Thanks. So let me recap, also with the help of some code that I've just 
>>> committed.
>>>
>>> This is a test XMP that I'm using as a data source. It has been produced by 
>>> the DAM app Photo Supreme and, as a typical XMP sidecar, contains both info 
>>> that the application has extracted from the original file (a Sony ARW) and 
>>> data that I've manually entered:
>>>
>>>     
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>>>
>>> This is what I've been able to extract (in form of textual dump) with - 
>>> spoiler alert - a quick and dirty custom parser. It's only a subset of the 
>>> metadata items in the original XMP, given the roughness of the parser, but 
>>> it's a good start for me (and in any case it already resolved a problem of 
>>> mine).
>>>
>>>     
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>>>
>>> First approach I've tried:
>>>
>>> metadata.set(Metadata.CONTENT_TYPE, "application/xml");
>>> final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);
>>> ime.parseRawXMP(bytes);
>>>
>>> But this just made me get a small bunch of DC items:
>>>
>>>     
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>>>
>>> Second attempt:
>>>
>>> try (final InputStream is = new ByteArrayInputStream(bytes))
>>>   {
>>>     new JempboxExtractor(metadata).parse(is);
>>>   }
>>>
>>> with the trick of wrapping the bytes content inside an xpacket marker. 
>>> Basically same results as above:
>>>
>>>     
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>>>
>>> If I correctly understand Tika code, basically Jempbox is used to create a 
>>> DOM that is later processed, but only DC and MM are copied to metadata. I 
>>> see handlers whose name seem to suggest that they copy all tags, but they 
>>> are not used by parse().
>>>
>>> So in the end I tried is a quick and dirty custom parser that copies all 
>>> attributes of the elements in the XMP; this is the relevant code in the 
>>> handler:
>>>
>>> public void startElement (String uri, String localName, String qName, 
>>> Attributes attributes)
>>>   {
>>>     for (int i = 0; i < attributes.getLength(); i++)
>>>       {
>>>         // FIXME: this assumes QName is using the standard prefix (e.g. 
>>> 'exif'). More robust code
>>>         // should instead read the namespace and translate to a prefix.
>>>         final String key = attributes.getQName(i);
>>>         final String value = attributes.getValue(i);
>>>
>>>         try
>>>           {
>>>             metadata.add(key, value);
>>>           }
>>>         catch (PropertyTypeException e)
>>>           {
>>>             log.error("{}: {}", e.toString(), key);
>>>           }
>>>       }
>>>   }
>>>
>>> Full code here:
>>>
>>>     
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>>>
>>> (Forgive me for the long URLs with the commit id, but in this way I can 
>>> make further work on my source repo without jeopardizing the references of 
>>> this email.)
>>>
>>> Now the basic thing that I'd like to know is that I'm not reinventing the 
>>> wheel; in other words, there's no code inside Tika that is extracting this 
>>> information from a XMP sidecar. If this is confirmed, I can proceed on this 
>>> path.
>>>
>>> Thanks.
>>>
>>>
>>> --
>>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
>>> "We make Java work. Everywhere."
>>> http://tidalwave.it/fabrizio/blog - [email protected]

Re: Tika for parsing raw XMP

Reply via email to