Yep, that's a problem. Thank you! https://issues.apache.org/jira/browse/TIKA-3101
On Mon, May 11, 2020 at 2:24 PM Tim Allison <talli...@apache.org> wrote: > Thank you for letting us know about this and sharing a file. My belief is > that we should be trusting the XMP metadata over the PDFInfo for DC > metadata keys like TikaCoreProperties.CREATED. I'll take a look. > > On Mon, May 11, 2020 at 11:40 AM Tucker B <barb...@gmail.com> wrote: > >> I have a PDF with XMP metadata with two rdf:Description tags with >> different namespaces. The first namespace is DublinCore the other is >> XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata >> properly and properly identify the namespaces. However, it appears the >> PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted >> metadata, specifically the CreateDate. I'm curious if this is expected >> behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED >> to the value in the XMP metadata absent the presence of a created date in >> the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate". >> I've attached the XMP packet and a PDF with the XMP metadata. I'm using >> Tika 1.24.1 Any help or guidance would be greatly appreciated. >> >> Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is >> base64 encoded string "[42!573]". Curious if anyone knows the >> significance of this. >> >