I have a PDF with XMP metadata with two rdf:Description tags with different namespaces. The first namespace is DublinCore the other is XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata properly and properly identify the namespaces. However, it appears the PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted metadata, specifically the CreateDate. I'm curious if this is expected behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED to the value in the XMP metadata absent the presence of a created date in the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate". I've attached the XMP packet and a PDF with the XMP metadata. I'm using Tika 1.24.1 Any help or guidance would be greatly appreciated.
Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is base64 encoded string "[42!573]". Curious if anyone knows the significance of this.
<?xpacket begin=" " id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 1998/08/29-13:53:15 " xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>application/pdf</dc:format></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/"><xmp:CreateDate>1998-08-29T13:53:15-01:00</xmp:CreateDate><xmp:CreatorTool>Hewlett-Packard MFP</xmp:CreatorTool></rdf:Description></rdf:RDF></x:xmpmeta><?xpacket end="w"?>
testPDF_withmetadata.pdf
Description: Adobe PDF document