I have a PDF with XMP metadata with two rdf:Description tags with
different namespaces. The first namespace is DublinCore the other is
XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata
properly and properly identify the namespaces. However, it appears the
PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted
metadata, specifically the CreateDate. I'm curious if this is expected
behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED
to the value in the XMP metadata absent the presence of a created date in
the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate".
I've attached the XMP packet and a PDF with the XMP metadata. I'm using
Tika 1.24.1 Any help or guidance would be greatly appreciated.

Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is
base64 encoded string "[42!573]". Curious if anyone knows the
significance of this.
<?xpacket begin=" " id="W5M0MpCehiHzreSzNTczkc9d"?>                                                               <x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 1998/08/29-13:53:15        " xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/";><dc:format>application/pdf</dc:format></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/";><xmp:CreateDate>1998-08-29T13:53:15-01:00</xmp:CreateDate><xmp:CreatorTool>Hewlett-Packard MFP</xmp:CreatorTool></rdf:Description></rdf:RDF></x:xmpmeta><?xpacket end="w"?>

Attachment: testPDF_withmetadata.pdf
Description: Adobe PDF document

Reply via email to