I uploaded it to https://porter.net.nz/~alastair/pdfbox-split-missing-tags.pdf
Alastair On Thu, 15 May 2025 at 05:38, <thaush...@t-online.de> wrote: > Please upload the file to a shareholder > > Tilman > > -- Original-Nachricht -- > Von: Alastair Porter <alast...@porter.net.nz> > Betreff: Splitter does not include structure tree in documents past the > first split > Datum: 14.05.2025, 18:37 Uhr > An: users@pdfbox.apache.org > Hi, Apologies if my terminology is wrong on some of the following > topics, I've not worked with PDFs in much detail before. > When using the Splitter to split pdfs, it appears that any split that > doesn't start on the first page of the input document does not include > Structure tree elements / accessibility tags. I note the recent work in > PDFBOX-2725 ([PATCH] Split pdf lose accessibility tags) and PDFBOX-5929 > (Remove orphan annotations in structure tree) which may have affected some > of this related code. > I can reproduce this with both the app cli: java -jar > pdfbox/app/target/pdfbox-app-4.0.0-SNAPSHOT.jar split -i input.pdf > -outputPrefix output-split > and also with the API: > Splitter splitter = new Splitter(); > splitter.setSplitAtPage(20); > List<PDDocument> documents = splitter.split(inputDocument); > I also checked pdfbox 3.0.3 (last release before PDFBOX-5929) and the > behaviour appears to be the same - that is, it doesn't appear that the > patch broke some already existing functionality. > > I am evaluating the resulting pdfs using the PAC PDF Accessibility > Checker (https://pac.pdf-accessibility.org/en) and also the pdfbox > debugger. I expect to see items in Root/StructTreeRoot/K in the debugger. > In the first file, I correctly see the /K element. What's more, this > element has correctly been pruned and doesn't include any items from the > input document which point to pages that are not in this split. In > subsequent split files, I see no /K element in the StructTreeRoot at all. > I attached a PDF which I've been using for simple testing, which exhibits > this behaviour. > I had a bit of a look through the existing code, and I see that in > Splitter.java, in cloneStructureTree > COSBase k1 = srcStructureTreeRoot.getK(); > COSBase k2 = new KCloner(dstPageTree).createClone(k1, > dstStructureTreeRoot.getCOSObject(), null); > dstStructureTreeRoot.setK(k2); > k2 is always null after the first split, it seems like it may not be > created correctly. > Is this a known bug, or perhaps an issue with the way I'm using the API > or the format of the input documents? > Thanks, Alastair > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org