All,

  Andrew Jackson recently opened TIKA-1678.  Tika tries to use Dublin Core 
items from the xmp, and if that doesn't exist, it takes what it can find from 
the "regular" metadata.

Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly 
(? doubly?) encoded in the xmp : 
\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P

  Should we add a handler at the Tika level to deal with obvious BOM-marked 
strings we're getting from the XMP, or should that be handled by PDFBox?  We're 
still using jempbox...will XMPBox handle these correctly?

  Thank you!

              Best,

                           Tim


Reply via email to