WG: Splitter does not include structure tree in documents past the first split

THausherr Wed, 14 May 2025 20:38:30 -0700

Please upload the file to a shareholder

Tilman


-- Original-Nachricht --
Von: Alastair Porter <alast...@porter.net.nz>
Betreff: Splitter does not include structure tree in documents past the first 
split
Datum: 14.05.2025, 18:37 Uhr
An: users@pdfbox.apache.org
 Hi,  Apologies if my terminology is wrong on some of the following topics, 
I've not worked with PDFs in much detail before. 
 When using the Splitter to split pdfs, it appears that any split that doesn't 
start on the first page of the input document does not include Structure tree 
elements / accessibility tags. I note the recent work in PDFBOX-2725 ([PATCH] 
Split pdf lose accessibility tags) and PDFBOX-5929 (Remove orphan annotations 
in structure tree) which may have affected some of this related code. 
 I can reproduce this with both the app cli:  java -jar 
pdfbox/app/target/pdfbox-app-4.0.0-SNAPSHOT.jar split -i input.pdf 
-outputPrefix output-split 
 and also with the API: 
 Splitter splitter = new Splitter(); 
 splitter.setSplitAtPage(20); 
 List<PDDocument> documents = splitter.split(inputDocument); 
 I also checked pdfbox 3.0.3 (last release before PDFBOX-5929) and the 
behaviour appears to be the same - that is, it doesn't appear that the patch 
broke some already existing functionality. 

 I am evaluating the resulting pdfs using the PAC PDF Accessibility Checker 
(https://pac.pdf-accessibility.org/en) and also the pdfbox debugger. I expect 
to see items in Root/StructTreeRoot/K in the debugger. 
 In the first file, I correctly see the /K element. What's more, this element 
has correctly been pruned and doesn't include any items from the input document 
which point to pages that are not in this split.  In subsequent split files, I 
see no /K element in the StructTreeRoot at all. 
 I attached a PDF which I've been using for simple testing, which exhibits this 
behaviour. 
 I had a bit of a look through the existing code, and I see that in 
Splitter.java, in cloneStructureTree 
COSBase k1 = srcStructureTreeRoot.getK(); 
 COSBase k2 = new KCloner(dstPageTree).createClone(k1, 
dstStructureTreeRoot.getCOSObject(), null); 
 dstStructureTreeRoot.setK(k2); 
 k2 is always null after the first split, it seems like it may not be created 
correctly. 
 Is this a known bug, or perhaps an issue with the way I'm using the API or the 
format of the input documents? 
 Thanks,  Alastair

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

WG: Splitter does not include structure tree in documents past the first split

Reply via email to