I'm splitting a document into groups of 20 pages using the Splitter (PDFBox
3.0.3).
It works as expected, the sum of group sizes (~77MB) is similar to the full
document size (~64MB).
*But if I remove the annotations from each page before splitting,* the
result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge
compared to the original document (~64MB).

*Result without removing annotations:*
Permissions Size User Date Modified Name
.rw-rw-r--   10M joan 23 dic 16:00  'test 0.pdf'
.rw-rw-r--  7,9M joan 23 dic 16:00  'test 1.pdf'
.rw-rw-r--  6,9M joan 23 dic 16:00  'test 2.pdf'
.rw-rw-r--  6,2M joan 23 dic 16:00  'test 3.pdf'
.rw-rw-r--  3,1M joan 23 dic 16:00  'test 4.pdf'
.rw-rw-r--  6,5M joan 23 dic 16:00  'test 5.pdf'
.rw-rw-r--  6,8M joan 23 dic 16:00  'test 6.pdf'
.rw-rw-r--  4,3M joan 23 dic 16:00  'test 7.pdf'
.rw-rw-r--  5,0M joan 23 dic 16:00  'test 8.pdf'
.rw-rw-r--  2,8M joan 23 dic 16:00  'test 9.pdf'
.rw-rw-r--  5,4M joan 23 dic 16:00  'test 10.pdf'
.rw-rw-r--  4,7M joan 23 dic 16:00  'test 11.pdf'
.rw-rw-r--  3,5M joan 23 dic 16:00  'test 12.pdf'
.rw-rw-r--  3,4M joan 23 dic 16:00  'test 13.pdf'
.rw-rw-r--  815k joan 23 dic 16:00  'test 14.pdf'

*Result removing annotations:*
Permissions Size User Date Modified Name
.rw-rw-r--   10M joan 23 dic 16:53  'test 0.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 1.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 2.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 3.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 4.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 5.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 6.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 7.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 8.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 9.pdf'
.rw-rw-r--   64M joan 23 dic 16:53  'test 10.pdf'
.rw-rw-r--  4,7M joan 23 dic 16:53  'test 11.pdf'
.rw-rw-r--  3,5M joan 23 dic 16:53  'test 12.pdf'
.rw-rw-r--  3,4M joan 23 dic 16:53  'test 13.pdf'
.rw-rw-r--  833k joan 23 dic 16:53  'test 14.pdf'


*Related code:*

  private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit,
Supplier<Path> pathSupplier, int splitAtPage) throws IOException {
    Splitter splitter = new Splitter();
    splitter.setSplitAtPage(splitAtPage);
    try (var document = Loader.loadPDF(fileToSplit.toFile())) {
      *clearAnnotations(document);*
      return splitAndSave(pathSupplier, splitter, document);
    }
  }

  private static void clearAnnotations(PDDocument document) throws
IOException {
    for (int i = 0; i < document.getNumberOfPages(); i++) {
      document.getPage(i).getAnnotations().clear();
    }
  }

  private static List<Path> splitAndSave(Supplier<Path> pathSupplier,
Splitter splitter, PDDocument document) throws IOException {
    return splitter.split(document).stream()
      .map(d ->
        callOrLog(() -> {
          try (d) {
            Path path = pathSupplier.get();
            d.save(path.toFile());
            return path;
          }
        })
      ).toList();
  }

Here is the link to the PDF: https://file.io/KI2CFBB87H4c

Any idea why this is happening with this PDF?

Thanks!

P.S: We split 100's of PDFs each day and this is the first time we see this
issue.

Reply via email to