Re: Splitter does not include structure tree in documents past the first split

Alastair Porter Wed, 14 May 2025 22:52:54 -0700

I uploaded it to
https://porter.net.nz/~alastair/pdfbox-split-missing-tags.pdf


Alastair

On Thu, 15 May 2025 at 05:38, <thaush...@t-online.de> wrote:

> Please upload the file to a shareholder
>
> Tilman
>
> -- Original-Nachricht --
> Von: Alastair Porter <alast...@porter.net.nz>
> Betreff: Splitter does not include structure tree in documents past the
> first split
> Datum: 14.05.2025, 18:37 Uhr
> An: users@pdfbox.apache.org
>  Hi,  Apologies if my terminology is wrong on some of the following
> topics, I've not worked with PDFs in much detail before.
>  When using the Splitter to split pdfs, it appears that any split that
> doesn't start on the first page of the input document does not include
> Structure tree elements / accessibility tags. I note the recent work in
> PDFBOX-2725 ([PATCH] Split pdf lose accessibility tags) and PDFBOX-5929
> (Remove orphan annotations in structure tree) which may have affected some
> of this related code.
>  I can reproduce this with both the app cli:  java -jar
> pdfbox/app/target/pdfbox-app-4.0.0-SNAPSHOT.jar split -i input.pdf
> -outputPrefix output-split
>  and also with the API:
>  Splitter splitter = new Splitter();
>  splitter.setSplitAtPage(20);
>  List<PDDocument> documents = splitter.split(inputDocument);
>  I also checked pdfbox 3.0.3 (last release before PDFBOX-5929) and the
> behaviour appears to be the same - that is, it doesn't appear that the
> patch broke some already existing functionality.
>
>  I am evaluating the resulting pdfs using the PAC PDF Accessibility
> Checker (https://pac.pdf-accessibility.org/en) and also the pdfbox
> debugger. I expect to see items in Root/StructTreeRoot/K in the debugger.
>  In the first file, I correctly see the /K element. What's more, this
> element has correctly been pruned and doesn't include any items from the
> input document which point to pages that are not in this split.  In
> subsequent split files, I see no /K element in the StructTreeRoot at all.
>  I attached a PDF which I've been using for simple testing, which exhibits
> this behaviour.
>  I had a bit of a look through the existing code, and I see that in
> Splitter.java, in cloneStructureTree
> COSBase k1 = srcStructureTreeRoot.getK();
>  COSBase k2 = new KCloner(dstPageTree).createClone(k1,
> dstStructureTreeRoot.getCOSObject(), null);
>  dstStructureTreeRoot.setK(k2);
>  k2 is always null after the first split, it seems like it may not be
> created correctly.
>  Is this a known bug, or perhaps an issue with the way I'm using the API
> or the format of the input documents?
>  Thanks,  Alastair
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Splitter does not include structure tree in documents past the first split

Reply via email to