Re: [POSSIBLE INSECURE EMAIL] Re: Rare behaviour splitting a PDF

Joan Fisbein Mon, 23 Dec 2024 10:06:54 -0800

BTW, I used *20* as the value of split at page.

On Mon, 23 Dec 2024 at 19:03, Joan Fisbein <joan.fisb...@clarity.ai> wrote:


> Ok, thank you for trying to replicate it. I'll try to create a full
> working example. 🤔
>
> On Mon, 23 Dec 2024 at 18:06, Tilman Hausherr <thaush...@t-online.de>
> wrote:
>
>> Hi Joan,
>>
>> I wasn't able to reproduce it, the files didn't have 64MB.
>>
>>
>> Your code wasn't working (what is the value of splitAtPage ?) so I used
>> this:
>>
>> public class JoanFishbeinSplit
>> {
>>
>>     public static void main(String[] args) throws IOException
>>     {
>>
>> splitPdfByCleanAnnotations(Paths.get("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING.pdf"));
>>     }
>>
>>     static void  splitPdfByCleanAnnotations(Path fileToSplit) throws
>> IOException
>>     {
>>         Splitter splitter = new Splitter();
>>         try (PDDocument document = Loader.loadPDF(fileToSplit.toFile()))
>>         {
>>             clearAnnotations(document);
>>             List<PDDocument> docs = splitter.split(document);
>>
>> docs.get(1).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-2.pdf");
>>
>> docs.get(2).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-3.pdf");
>>
>> docs.get(3).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-4.pdf");
>>
>> docs.get(4).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-5.pdf");
>>
>> docs.get(5).save("0967c5b4-d85d-41ca-93ff-165e75d4aa57_ERROR_SPLITTING-6.pdf");
>>         }
>>     }
>>
>>     private static void clearAnnotations(PDDocument document) throws
>>             IOException
>>     {
>>         for (int i = 0; i < document.getNumberOfPages(); i++)
>>         {
>>             document.getPage(i).getAnnotations().clear();
>>         }
>>     }
>> }
>>
>>
>> Tilman
>>
>> On 23.12.2024 17:11, Joan Fisbein wrote:
>>
>> I'm splitting a document into groups of 20 pages using the Splitter (PDFBox
>> 3.0.3).
>> It works as expected, the sum of group sizes (~77MB) is similar to the full
>> document size (~64MB).
>> *But if I remove the annotations from each page before splitting,* the
>> result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge
>> compared to the original document (~64MB).
>>
>> *Result without removing annotations:*
>> Permissions Size User Date Modified Name
>> .rw-rw-r--   10M joan 23 dic 16:00  'test 0.pdf'
>> .rw-rw-r--  7,9M joan 23 dic 16:00  'test 1.pdf'
>> .rw-rw-r--  6,9M joan 23 dic 16:00  'test 2.pdf'
>> .rw-rw-r--  6,2M joan 23 dic 16:00  'test 3.pdf'
>> .rw-rw-r--  3,1M joan 23 dic 16:00  'test 4.pdf'
>> .rw-rw-r--  6,5M joan 23 dic 16:00  'test 5.pdf'
>> .rw-rw-r--  6,8M joan 23 dic 16:00  'test 6.pdf'
>> .rw-rw-r--  4,3M joan 23 dic 16:00  'test 7.pdf'
>> .rw-rw-r--  5,0M joan 23 dic 16:00  'test 8.pdf'
>> .rw-rw-r--  2,8M joan 23 dic 16:00  'test 9.pdf'
>> .rw-rw-r--  5,4M joan 23 dic 16:00  'test 10.pdf'
>> .rw-rw-r--  4,7M joan 23 dic 16:00  'test 11.pdf'
>> .rw-rw-r--  3,5M joan 23 dic 16:00  'test 12.pdf'
>> .rw-rw-r--  3,4M joan 23 dic 16:00  'test 13.pdf'
>> .rw-rw-r--  815k joan 23 dic 16:00  'test 14.pdf'
>>
>> *Result removing annotations:*
>> Permissions Size User Date Modified Name
>> .rw-rw-r--   10M joan 23 dic 16:53  'test 0.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 1.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 2.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 3.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 4.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 5.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 6.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 7.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 8.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 9.pdf'
>> .rw-rw-r--   64M joan 23 dic 16:53  'test 10.pdf'
>> .rw-rw-r--  4,7M joan 23 dic 16:53  'test 11.pdf'
>> .rw-rw-r--  3,5M joan 23 dic 16:53  'test 12.pdf'
>> .rw-rw-r--  3,4M joan 23 dic 16:53  'test 13.pdf'
>> .rw-rw-r--  833k joan 23 dic 16:53  'test 14.pdf'
>>
>>
>> *Related code:*
>>
>>   private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit,
>> Supplier<Path> pathSupplier, int splitAtPage) throws IOException {
>>     Splitter splitter = new Splitter();
>>     splitter.setSplitAtPage(splitAtPage);
>>     try (var document = Loader.loadPDF(fileToSplit.toFile())) {
>>       *clearAnnotations(document);*
>>       return splitAndSave(pathSupplier, splitter, document);
>>     }
>>   }
>>
>>   private static void clearAnnotations(PDDocument document) throws
>> IOException {
>>     for (int i = 0; i < document.getNumberOfPages(); i++) {
>>       document.getPage(i).getAnnotations().clear();
>>     }
>>   }
>>
>>   private static List<Path> splitAndSave(Supplier<Path> pathSupplier,
>> Splitter splitter, PDDocument document) throws IOException {
>>     return splitter.split(document).stream()
>>       .map(d ->
>>         callOrLog(() -> {
>>           try (d) {
>>             Path path = pathSupplier.get();
>>             d.save(path.toFile());
>>             return path;
>>           }
>>         })
>>       ).toList();
>>   }
>>
>> Here is the link to the PDF: https://file.io/KI2CFBB87H4c
>>
>> Any idea why this is happening with this PDF?
>>
>> Thanks!
>>
>> P.S: We split 100's of PDFs each day and this is the first time we see this
>> issue.
>>
>>
>>
>>
>
> --
>
> Joan Fisbein | Engineering Manager
> joan.fisb...@clarity.ai
> www.clarity.ai <https://clarity.ai/>
> <https://clarity.ai/in-the-news/>
>


-- 

Joan Fisbein | Engineering Manager
joan.fisb...@clarity.ai
www.clarity.ai <https://clarity.ai/>
<https://clarity.ai/in-the-news/>

Re: [POSSIBLE INSECURE EMAIL] Re: Rare behaviour splitting a PDF

Reply via email to