Fixed in
https://issues.apache.org/jira/browse/PDFBOX-5668
(hopefully)
Please try a snapshot at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.30-SNAPSHOT/
in an hour
Tilman
On 11.07.2023 11:25, Sylvere Babin wrote:
Hello,
We use PDFBox to read the XMP metadata of PDF documents in the
Factur-X standard, a Franco-German e-invoicing standard.
The XML schema corresponding to this metadata is quite simple, and
retrieving the values are perfectly working with the
org.apache.xmpbox.XMPMetadata.getSchema(String) method.
By default, the prefix is fx :
<rdf:Description
xmlns:fx="urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#"
rdf:about="">
<fx:DocumentType>INVOICE</fx:DocumentType>
<fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>
<fx:Version>1.0</fx:Version>
<fx:ConformanceLevel>BASIC</fx:ConformanceLevel>
</rdf:Description>
In one case, there were a document with two schemas with the same
namespace URI, but different prefixes (fx and zf)
I tried the org.apache.xmpbox.XMPMetadata.getSchema(String, String)
method, which according to the documentation seems to handle this case
by filtering by prefix.
I got a NullPointerException from this method (line 268), because the
prefix of the Factur-x schema in the
org.apache.xmpbox.XMPMetadata.schemas collection was null.
So, I've run tests with a hundred example files provided by the
Factur-X consortium, and it seems that for any file, the schema with
the Factur-X URI always gets a null prefix, regardless of whether one
or more schemas exist with this namespace.
This raise two points :
1. If the prefix can be null, the getSchema(String, String) method
should handle it.
2. Is the Factur-X metadata specification a correct XMP standard, or
is there a bug in the prefix parsing ?
Here’s the PDF document : Icône pdf pdfExemple.pdf
<https://cegidgroup-my.sharepoint.com/:b:/g/personal/sbabin_cegid_com/EVN8vpGbR1pEvaOuoIjyvfQBuhV1ZWFlYfAIKMfuAhd6Aw?e=cahEv2>
Here’s the code I use to retrieve the Factur-X metadata values :
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.XMPSchema;
import org.apache.xmpbox.xml.DomXmpParser;
import org.apache.xmpbox.xml.XmpParsingException;
public class FacturX {
public static void main(String[] args) throws
XmpParsingException, IOException {
try {
File finputFile = new File(args[0]);
PDDocument doc = PDDocument.load(finputFile);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDMetadata m = catalog.getMetadata();
InputStream xmlInputStream = m.createInputStream();
DomXmpParser p = new DomXmpParser();
p.setStrictParsing(false);
XMPMetadata metadata = p.parse(xmlInputStream);
// Getting the factur-x schema with the default "fx" prefix (case of
two factur-x schemas with different prefixes)
XMPSchema fx = metadata.getSchema("fx",
"urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
// If there is no schema with fx prefix, searching for the schema only
with the namespace URI
if (fx == null) {
fx =
metadata.getSchema("urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
}
if (fx == null) {
System.out.println("This PDF document is not a valid Factur-X file");
} else {
String conformanceLevel =
fx.getUnqualifiedTextPropertyValue("ConformanceLevel");
String documentType =
fx.getUnqualifiedTextPropertyValue("DocumentType");
String version =
fx.getUnqualifiedTextPropertyValue("Version");
String documentFileName =
fx.getUnqualifiedTextPropertyValue("DocumentFileName");
}
} catch (XmpParsingException | IOException e) {
e.printStackTrace();
}
}
}
Thanks for your help,
*Sylvère Babin*
Developer
Cegid est susceptible d’effectuer un traitement sur vos données
personnelles à des fins de gestion de notre relation commerciale. Pour
plus d’information, consultez https://www.cegid.com/fr/privacy-policy
Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de ses destinataires. Toute utilisation ou
diffusion, même partielle, non autorisée est interdite. Tout message
électronique est susceptible d'altération; Cegid décline donc toute
responsabilité au titre de ce message. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'avertir
l'expéditeur.
Cegid may process your personal data for the purpose of our business
relationship management. For more information, please visit our
website https://www.cegid.com/en/privacy-policy
This message and any attachments are confidential and intended solely
for the addressees. Any unauthorized use or disclosure, either whole
or partial is prohibited. E-mails are susceptible to alteration; Cegid
shall therefore not be liable for the content of this message. If you
are not the intended recipient of this message, please delete it and
notify the sender.