Ok, thank you for trying to replicate it. I'll try to create a full working example. 🤔
On Mon, 23 Dec 2024 at 18:06, Tilman Hausherr <thaush...@t-online.de> wrote: > Hi Joan, > > I wasn't able to reproduce it, the files didn't have 64MB. > > > Your code wasn't working (what is the value of splitAtPage ?) so I used > this: > > public class JoanFishbeinSplit > { > > public static void main(String[] args) throws IOException > { > > splitPdfByCleanAnnotations(Paths.get("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING.pdf")); > } > > static void splitPdfByCleanAnnotations(Path fileToSplit) throws > IOException > { > Splitter splitter = new Splitter(); > try (PDDocument document = Loader.loadPDF(fileToSplit.toFile())) > { > clearAnnotations(document); > List<PDDocument> docs = splitter.split(document); > > docs.get(1).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-2.pdf"); > > docs.get(2).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-3.pdf"); > > docs.get(3).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-4.pdf"); > > docs.get(4).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-5.pdf"); > > docs.get(5).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-6.pdf"); > } > } > > private static void clearAnnotations(PDDocument document) throws > IOException > { > for (int i = 0; i < document.getNumberOfPages(); i++) > { > document.getPage(i).getAnnotations().clear(); > } > } > } > > > Tilman > > On 23.12.2024 17:11, Joan Fisbein wrote: > > I'm splitting a document into groups of 20 pages using the Splitter (PDFBox > 3.0.3). > It works as expected, the sum of group sizes (~77MB) is similar to the full > document size (~64MB). > *But if I remove the annotations from each page before splitting,* the > result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge > compared to the original document (~64MB). > > *Result without removing annotations:* > Permissions Size User Date Modified Name > .rw-rw-r-- 10M joan 23 dic 16:00 'test 0.pdf' > .rw-rw-r-- 7,9M joan 23 dic 16:00 'test 1.pdf' > .rw-rw-r-- 6,9M joan 23 dic 16:00 'test 2.pdf' > .rw-rw-r-- 6,2M joan 23 dic 16:00 'test 3.pdf' > .rw-rw-r-- 3,1M joan 23 dic 16:00 'test 4.pdf' > .rw-rw-r-- 6,5M joan 23 dic 16:00 'test 5.pdf' > .rw-rw-r-- 6,8M joan 23 dic 16:00 'test 6.pdf' > .rw-rw-r-- 4,3M joan 23 dic 16:00 'test 7.pdf' > .rw-rw-r-- 5,0M joan 23 dic 16:00 'test 8.pdf' > .rw-rw-r-- 2,8M joan 23 dic 16:00 'test 9.pdf' > .rw-rw-r-- 5,4M joan 23 dic 16:00 'test 10.pdf' > .rw-rw-r-- 4,7M joan 23 dic 16:00 'test 11.pdf' > .rw-rw-r-- 3,5M joan 23 dic 16:00 'test 12.pdf' > .rw-rw-r-- 3,4M joan 23 dic 16:00 'test 13.pdf' > .rw-rw-r-- 815k joan 23 dic 16:00 'test 14.pdf' > > *Result removing annotations:* > Permissions Size User Date Modified Name > .rw-rw-r-- 10M joan 23 dic 16:53 'test 0.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 1.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 2.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 3.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 4.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 5.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 6.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 7.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 8.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 9.pdf' > .rw-rw-r-- 64M joan 23 dic 16:53 'test 10.pdf' > .rw-rw-r-- 4,7M joan 23 dic 16:53 'test 11.pdf' > .rw-rw-r-- 3,5M joan 23 dic 16:53 'test 12.pdf' > .rw-rw-r-- 3,4M joan 23 dic 16:53 'test 13.pdf' > .rw-rw-r-- 833k joan 23 dic 16:53 'test 14.pdf' > > > *Related code:* > > private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit, > Supplier<Path> pathSupplier, int splitAtPage) throws IOException { > Splitter splitter = new Splitter(); > splitter.setSplitAtPage(splitAtPage); > try (var document = Loader.loadPDF(fileToSplit.toFile())) { > *clearAnnotations(document);* > return splitAndSave(pathSupplier, splitter, document); > } > } > > private static void clearAnnotations(PDDocument document) throws > IOException { > for (int i = 0; i < document.getNumberOfPages(); i++) { > document.getPage(i).getAnnotations().clear(); > } > } > > private static List<Path> splitAndSave(Supplier<Path> pathSupplier, > Splitter splitter, PDDocument document) throws IOException { > return splitter.split(document).stream() > .map(d -> > callOrLog(() -> { > try (d) { > Path path = pathSupplier.get(); > d.save(path.toFile()); > return path; > } > }) > ).toList(); > } > > Here is the link to the PDF: https://file.io/KI2CFBB87H4c > > Any idea why this is happening with this PDF? > > Thanks! > > P.S: We split 100's of PDFs each day and this is the first time we see this > issue. > > > > -- Joan Fisbein | Engineering Manager joan.fisb...@clarity.ai www.clarity.ai <https://clarity.ai/> <https://clarity.ai/in-the-news/>