BTW, I used *20* as the value of split at page. On Mon, 23 Dec 2024 at 19:03, Joan Fisbein <joan.fisb...@clarity.ai> wrote:
> Ok, thank you for trying to replicate it. I'll try to create a full > working example. 🤔 > > On Mon, 23 Dec 2024 at 18:06, Tilman Hausherr <thaush...@t-online.de> > wrote: > >> Hi Joan, >> >> I wasn't able to reproduce it, the files didn't have 64MB. >> >> >> Your code wasn't working (what is the value of splitAtPage ?) so I used >> this: >> >> public class JoanFishbeinSplit >> { >> >> public static void main(String[] args) throws IOException >> { >> >> splitPdfByCleanAnnotations(Paths.get("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING.pdf")); >> } >> >> static void splitPdfByCleanAnnotations(Path fileToSplit) throws >> IOException >> { >> Splitter splitter = new Splitter(); >> try (PDDocument document = Loader.loadPDF(fileToSplit.toFile())) >> { >> clearAnnotations(document); >> List<PDDocument> docs = splitter.split(document); >> >> docs.get(1).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-2.pdf"); >> >> docs.get(2).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-3.pdf"); >> >> docs.get(3).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-4.pdf"); >> >> docs.get(4).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-5.pdf"); >> >> docs.get(5).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-6.pdf"); >> } >> } >> >> private static void clearAnnotations(PDDocument document) throws >> IOException >> { >> for (int i = 0; i < document.getNumberOfPages(); i++) >> { >> document.getPage(i).getAnnotations().clear(); >> } >> } >> } >> >> >> Tilman >> >> On 23.12.2024 17:11, Joan Fisbein wrote: >> >> I'm splitting a document into groups of 20 pages using the Splitter (PDFBox >> 3.0.3). >> It works as expected, the sum of group sizes (~77MB) is similar to the full >> document size (~64MB). >> *But if I remove the annotations from each page before splitting,* the >> result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge >> compared to the original document (~64MB). >> >> *Result without removing annotations:* >> Permissions Size User Date Modified Name >> .rw-rw-r-- 10M joan 23 dic 16:00 'test 0.pdf' >> .rw-rw-r-- 7,9M joan 23 dic 16:00 'test 1.pdf' >> .rw-rw-r-- 6,9M joan 23 dic 16:00 'test 2.pdf' >> .rw-rw-r-- 6,2M joan 23 dic 16:00 'test 3.pdf' >> .rw-rw-r-- 3,1M joan 23 dic 16:00 'test 4.pdf' >> .rw-rw-r-- 6,5M joan 23 dic 16:00 'test 5.pdf' >> .rw-rw-r-- 6,8M joan 23 dic 16:00 'test 6.pdf' >> .rw-rw-r-- 4,3M joan 23 dic 16:00 'test 7.pdf' >> .rw-rw-r-- 5,0M joan 23 dic 16:00 'test 8.pdf' >> .rw-rw-r-- 2,8M joan 23 dic 16:00 'test 9.pdf' >> .rw-rw-r-- 5,4M joan 23 dic 16:00 'test 10.pdf' >> .rw-rw-r-- 4,7M joan 23 dic 16:00 'test 11.pdf' >> .rw-rw-r-- 3,5M joan 23 dic 16:00 'test 12.pdf' >> .rw-rw-r-- 3,4M joan 23 dic 16:00 'test 13.pdf' >> .rw-rw-r-- 815k joan 23 dic 16:00 'test 14.pdf' >> >> *Result removing annotations:* >> Permissions Size User Date Modified Name >> .rw-rw-r-- 10M joan 23 dic 16:53 'test 0.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 1.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 2.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 3.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 4.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 5.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 6.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 7.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 8.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 9.pdf' >> .rw-rw-r-- 64M joan 23 dic 16:53 'test 10.pdf' >> .rw-rw-r-- 4,7M joan 23 dic 16:53 'test 11.pdf' >> .rw-rw-r-- 3,5M joan 23 dic 16:53 'test 12.pdf' >> .rw-rw-r-- 3,4M joan 23 dic 16:53 'test 13.pdf' >> .rw-rw-r-- 833k joan 23 dic 16:53 'test 14.pdf' >> >> >> *Related code:* >> >> private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit, >> Supplier<Path> pathSupplier, int splitAtPage) throws IOException { >> Splitter splitter = new Splitter(); >> splitter.setSplitAtPage(splitAtPage); >> try (var document = Loader.loadPDF(fileToSplit.toFile())) { >> *clearAnnotations(document);* >> return splitAndSave(pathSupplier, splitter, document); >> } >> } >> >> private static void clearAnnotations(PDDocument document) throws >> IOException { >> for (int i = 0; i < document.getNumberOfPages(); i++) { >> document.getPage(i).getAnnotations().clear(); >> } >> } >> >> private static List<Path> splitAndSave(Supplier<Path> pathSupplier, >> Splitter splitter, PDDocument document) throws IOException { >> return splitter.split(document).stream() >> .map(d -> >> callOrLog(() -> { >> try (d) { >> Path path = pathSupplier.get(); >> d.save(path.toFile()); >> return path; >> } >> }) >> ).toList(); >> } >> >> Here is the link to the PDF: https://file.io/KI2CFBB87H4c >> >> Any idea why this is happening with this PDF? >> >> Thanks! >> >> P.S: We split 100's of PDFs each day and this is the first time we see this >> issue. >> >> >> >> > > -- > > Joan Fisbein | Engineering Manager > joan.fisb...@clarity.ai > www.clarity.ai <https://clarity.ai/> > <https://clarity.ai/in-the-news/> > -- Joan Fisbein | Engineering Manager joan.fisb...@clarity.ai www.clarity.ai <https://clarity.ai/> <https://clarity.ai/in-the-news/>