Re: [POSSIBLE INSECURE EMAIL] Re: Rare behaviour splitting a PDF

Joan Fisbein Mon, 23 Dec 2024 10:06:24 -0800

Ok, thank you for trying to replicate it. I'll try to create a full working
example. 🤔


On Mon, 23 Dec 2024 at 18:06, Tilman Hausherr <thaush...@t-online.de> wrote:

> Hi Joan,
>
> I wasn't able to reproduce it, the files didn't have 64MB.
>
>
> Your code wasn't working (what is the value of splitAtPage ?) so I used
> this:
>
> public class JoanFishbeinSplit
> {
>
>     public static void main(String[] args) throws IOException
>     {
>
> splitPdfByCleanAnnotations(Paths.get("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING.pdf"));
>     }
>
>     static void  splitPdfByCleanAnnotations(Path fileToSplit) throws
> IOException
>     {
>         Splitter splitter = new Splitter();
>         try (PDDocument document = Loader.loadPDF(fileToSplit.toFile()))
>         {
>             clearAnnotations(document);
>             List<PDDocument> docs = splitter.split(document);
>
> docs.get(1).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-2.pdf");
>
> docs.get(2).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-3.pdf");
>
> docs.get(3).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-4.pdf");
>
> docs.get(4).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-5.pdf");
>
> docs.get(5).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-6.pdf");
>         }
>     }
>
>     private static void clearAnnotations(PDDocument document) throws
>             IOException
>     {
>         for (int i = 0; i < document.getNumberOfPages(); i++)
>         {
>             document.getPage(i).getAnnotations().clear();
>         }
>     }
> }
>
>
> Tilman
>
> On 23.12.2024 17:11, Joan Fisbein wrote:
>
> I'm splitting a document into groups of 20 pages using the Splitter (PDFBox
> 3.0.3).
> It works as expected, the sum of group sizes (~77MB) is similar to the full
> document size (~64MB).
> *But if I remove the annotations from each page before splitting,* the
> result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge
> compared to the original document (~64MB).
>
> *Result without removing annotations:*
> Permissions Size User Date Modified Name
> .rw-rw-r--   10M joan 23 dic 16:00  'test 0.pdf'
> .rw-rw-r--  7,9M joan 23 dic 16:00  'test 1.pdf'
> .rw-rw-r--  6,9M joan 23 dic 16:00  'test 2.pdf'
> .rw-rw-r--  6,2M joan 23 dic 16:00  'test 3.pdf'
> .rw-rw-r--  3,1M joan 23 dic 16:00  'test 4.pdf'
> .rw-rw-r--  6,5M joan 23 dic 16:00  'test 5.pdf'
> .rw-rw-r--  6,8M joan 23 dic 16:00  'test 6.pdf'
> .rw-rw-r--  4,3M joan 23 dic 16:00  'test 7.pdf'
> .rw-rw-r--  5,0M joan 23 dic 16:00  'test 8.pdf'
> .rw-rw-r--  2,8M joan 23 dic 16:00  'test 9.pdf'
> .rw-rw-r--  5,4M joan 23 dic 16:00  'test 10.pdf'
> .rw-rw-r--  4,7M joan 23 dic 16:00  'test 11.pdf'
> .rw-rw-r--  3,5M joan 23 dic 16:00  'test 12.pdf'
> .rw-rw-r--  3,4M joan 23 dic 16:00  'test 13.pdf'
> .rw-rw-r--  815k joan 23 dic 16:00  'test 14.pdf'
>
> *Result removing annotations:*
> Permissions Size User Date Modified Name
> .rw-rw-r--   10M joan 23 dic 16:53  'test 0.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 1.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 2.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 3.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 4.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 5.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 6.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 7.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 8.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 9.pdf'
> .rw-rw-r--   64M joan 23 dic 16:53  'test 10.pdf'
> .rw-rw-r--  4,7M joan 23 dic 16:53  'test 11.pdf'
> .rw-rw-r--  3,5M joan 23 dic 16:53  'test 12.pdf'
> .rw-rw-r--  3,4M joan 23 dic 16:53  'test 13.pdf'
> .rw-rw-r--  833k joan 23 dic 16:53  'test 14.pdf'
>
>
> *Related code:*
>
>   private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit,
> Supplier<Path> pathSupplier, int splitAtPage) throws IOException {
>     Splitter splitter = new Splitter();
>     splitter.setSplitAtPage(splitAtPage);
>     try (var document = Loader.loadPDF(fileToSplit.toFile())) {
>       *clearAnnotations(document);*
>       return splitAndSave(pathSupplier, splitter, document);
>     }
>   }
>
>   private static void clearAnnotations(PDDocument document) throws
> IOException {
>     for (int i = 0; i < document.getNumberOfPages(); i++) {
>       document.getPage(i).getAnnotations().clear();
>     }
>   }
>
>   private static List<Path> splitAndSave(Supplier<Path> pathSupplier,
> Splitter splitter, PDDocument document) throws IOException {
>     return splitter.split(document).stream()
>       .map(d ->
>         callOrLog(() -> {
>           try (d) {
>             Path path = pathSupplier.get();
>             d.save(path.toFile());
>             return path;
>           }
>         })
>       ).toList();
>   }
>
> Here is the link to the PDF: https://file.io/KI2CFBB87H4c
>
> Any idea why this is happening with this PDF?
>
> Thanks!
>
> P.S: We split 100's of PDFs each day and this is the first time we see this
> issue.
>
>
>
>

-- 

Joan Fisbein | Engineering Manager
joan.fisb...@clarity.ai
www.clarity.ai <https://clarity.ai/>
<https://clarity.ai/in-the-news/>

Re: [POSSIBLE INSECURE EMAIL] Re: Rare behaviour splitting a PDF

Reply via email to