Re: Xmpbox metadata parsing issue

Tilman Hausherr Tue, 29 Aug 2023 11:58:50 -0700

Fixed in
https://issues.apache.org/jira/browse/PDFBOX-5668
(hopefully)


Please try a snapshot at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.30-SNAPSHOT/

in an hour

Tilman


On 11.07.2023 11:25, Sylvere Babin wrote:

Hello,
We use PDFBox to read the XMP metadata of PDF documents in theFactur-X standard, a Franco-German e-invoicing standard.
The XML schema corresponding to this metadata is quite simple, andretrieving the values are perfectly working with theorg.apache.xmpbox.XMPMetadata.getSchema(String) method.
By default, the prefix is fx :
<rdf:Descriptionxmlns:fx="urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#"rdf:about="">
      <fx:DocumentType>INVOICE</fx:DocumentType>

      <fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>

      <fx:Version>1.0</fx:Version>

      <fx:ConformanceLevel>BASIC</fx:ConformanceLevel>

</rdf:Description>
In one case, there were a document with two schemas with the samenamespace URI, but different prefixes (fx and zf)
I tried the org.apache.xmpbox.XMPMetadata.getSchema(String, String)method, which according to the documentation seems to handle this caseby filtering by prefix.
I got a NullPointerException from this method (line 268), because theprefix of the Factur-x schema in theorg.apache.xmpbox.XMPMetadata.schemas collection was null.
So, I've run tests with a hundred example files provided by theFactur-X consortium, and it seems that for any file, the schema withthe Factur-X URI always gets a null prefix, regardless of whether oneor more schemas exist with this namespace.
This raise two points :

 1. If the prefix can be null, the getSchema(String, String) method
    should handle it.
 2. Is the Factur-X metadata specification a correct XMP standard, or
    is there a bug in the prefix parsing ?
Here’s the PDF document : Icône pdf pdfExemple.pdf<https://cegidgroup-my.sharepoint.com/:b:/g/personal/sbabin_cegid_com/EVN8vpGbR1pEvaOuoIjyvfQBuhV1ZWFlYfAIKMfuAhd6Aw?e=cahEv2>
Here’s the code I use to retrieve the Factur-X metadata values :

import java.io.File;

import java.io.IOException;

import java.io.InputStream;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDDocumentCatalog;

import org.apache.pdfbox.pdmodel.common.PDMetadata;

import org.apache.xmpbox.XMPMetadata;

import org.apache.xmpbox.schema.XMPSchema;

import org.apache.xmpbox.xml.DomXmpParser;

import org.apache.xmpbox.xml.XmpParsingException;

public class FacturX {
public static void main(String[] args) throwsXmpParsingException, IOException {
try {

File finputFile = new File(args[0]);

PDDocument doc = PDDocument.load(finputFile);

PDDocumentCatalog catalog = doc.getDocumentCatalog();

PDMetadata m = catalog.getMetadata();

InputStream xmlInputStream = m.createInputStream();

DomXmpParser p = new DomXmpParser();

p.setStrictParsing(false);

XMPMetadata metadata = p.parse(xmlInputStream);
// Getting the factur-x schema with the default "fx" prefix (case oftwo factur-x schemas with different prefixes)
XMPSchema fx = metadata.getSchema("fx","urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
// If there is no schema with fx prefix, searching for the schema onlywith the namespace URI
if (fx == null) {
fx =metadata.getSchema("urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
}

if (fx == null) {

System.out.println("This PDF document is not a valid Factur-X file");

} else {
String conformanceLevel =fx.getUnqualifiedTextPropertyValue("ConformanceLevel");
String documentType =fx.getUnqualifiedTextPropertyValue("DocumentType");
String version =fx.getUnqualifiedTextPropertyValue("Version");
String documentFileName =fx.getUnqualifiedTextPropertyValue("DocumentFileName");
}

             } catch (XmpParsingException | IOException e) {

e.printStackTrace();

             }

       }

}

Thanks for your help,

*Sylvère Babin*
Developer
Cegid est susceptible d’effectuer un traitement sur vos donnéespersonnelles à des fins de gestion de notre relation commerciale. Pourplus d’information, consultez https://www.cegid.com/fr/privacy-policyCe message et les pièces jointes sont confidentiels et établis àl'attention exclusive de ses destinataires. Toute utilisation oudiffusion, même partielle, non autorisée est interdite. Tout messageélectronique est susceptible d'altération; Cegid décline donc touteresponsabilité au titre de ce message. Si vous n'êtes pas ledestinataire de ce message, merci de le détruire et d'avertirl'expéditeur.
Cegid may process your personal data for the purpose of our businessrelationship management. For more information, please visit ourwebsite https://www.cegid.com/en/privacy-policyThis message and any attachments are confidential and intended solelyfor the addressees. Any unauthorized use or disclosure, either wholeor partial is prohibited. E-mails are susceptible to alteration; Cegidshall therefore not be liable for the content of this message. If youare not the intended recipient of this message, please delete it andnotify the sender.

Re: Xmpbox metadata parsing issue

Reply via email to