Re: TIKA-1678 PDF metadata extraction and UTF-16 encodings in the xmp

Tilman Hausherr Sat, 18 Jul 2015 05:36:46 -0700

Am 15.07.2015 um 13:46 schrieb Allison, Timothy B.:

All,


   Andrew Jackson recently opened TIKA-1678.  Tika tries to use Dublin Core items from 
the xmp, and if that doesn't exist, it takes what it can find from the 
"regular" metadata.

Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly 
(? doubly?) encoded in the xmp : 
\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P

   Should we add a handler at the Tika level to deal with obvious BOM-marked 
strings we're getting from the XMP, or should that be handled by PDFBox?  We're 
still using jempbox...will XMPBox handle these correctly?

XMPBox has a rather strict interpretation of the rules... this makes mewonder (again?) whether it should support a lenient mode.


Can you name a file where that happens?

Tilman


   Thank you!

               Best,

                            Tim



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: TIKA-1678 PDF metadata extraction and UTF-16 encodings in the xmp

Reply via email to