It's a bit more complicated than that. I have a small set of very large
files with different pages matching different people. I need to match those
pages based on some identifying code, and then extract them into either
individual files (one per person) or a single merged file with those pages
sorted by person. But yes, I do close the input files after scanning them,
and then open them later on to extract the relevant pages from them, if
needed. This is actually the reason I opted not to use PDFMergerUtility, as
it would require me to extract all the individual pages as separate files,
so I could merge them later on (as it's not possible to use it to only
merge parts of files).

On Wed, Mar 15, 2023 at 5:28 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Your text sounded like you're not picking stuff from all documents. Are
> you closing the documents where nothing is found at the earliest possble
> time?
> Tilman
>
> On 15.03.2023 17:21, Gilad Denneboom wrote:
> >> The question is, do you close the input files properly?
> > Yes, I do, but only at the very end of the operation, as I was merging
> all
> > these individual files into one large one, so I had to keep the originals
> > open until I save this merged file for the last time, or it would throw
> an
> > exception about the PDDocument being closed.
> > I know this is not the best way of merging documents, by the way. I might
> > try to switch to using PDFMergerUtility, instead.
> >
> > On Wed, Mar 15, 2023 at 8:30 AM Andreas Lehmkuehler <andr...@lehmi.de>
> > wrote:
> >
> >> Hi Gilad,
> >>
> >> PDFBox is using a scratch file per document as long as you are using
> >> setupTempFileOnly. Handling thousands of documents ends up in thousands
> of
> >> scratch files. Those scratch files should be closed once the
> corresponding
> >> documents are closed.
> >>
> >> The question is, do you close the input files properly?
> >>
> >> Andreas
> >>
> >> Am 14.03.23 um 19:16 schrieb Gilad Denneboom:
> >>> Hi Maruan,
> >>>
> >>> Yes, I saw that, but it would be nice if this issue can be solved
> within
> >>> PDFBox, too.
> >>>
> >>> Gilad
> >>>
> >>> On Tue, Mar 14, 2023 at 4:52 PM Maruan Sahyoun <sahy...@fileaffairs.de
> >
> >>> wrote:
> >>>
> >>>> You can set the ulimit on Linux - Standard is 1024 open files.
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>> Am 14.03.2023 um 16:05 schrieb Gilad Denneboom <
> >>>> gilad.denneb...@gmail.com>:
> >>>>> Hi all,
> >>>>>
> >>>>> I created an application that opens many files (I'm talking
> thousands),
> >>>>> searching them for specific pages and then merges those pages into
> new
> >>>> PDF
> >>>>> files. The way I do it is by using the importPage command from the
> >>>> original
> >>>>> files into the split ones.
> >>>>> However, I'm getting an IOException ("Too many open files") from
> >>>>> ScratchFile after several thousands files were processed. I had a
> look
> >> at
> >>>>> the source code for that class and I think it might have to do with a
> >>>>> RandomAccessFile variable ("raf") not being properly closed.
> >>>>> All of the documents are opened using MemoryUsageSetting set to
> >>>>> setupTempFileOnly, by the way.
> >>>>> Could someone confirm this is the issue, and maybe help solve it? I'm
> >>>> using
> >>>>> PDFBox 2.0.26, by the way, and the app runs on a Mac.
> >>>>>
> >>>>> The stack-trace:
> >>>>> Exception in thread "main" java.io.IOException: Too many open files
> >>>>> at java.base/java.io.UnixFileSystem.createFileExclusively0(Native
> >>>> Method)
> >>>>> at
> >>>>> java.base/java.io
> >>>> .UnixFileSystem.createFileExclusively(UnixFileSystem.java:356)
> >>>>> at java.base/java.io.File.createTempFile(File.java:2179)
> >>>>> at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:217)
> >>>>> at org.apache.pdfbox.io.ScratchFile.getNewPage(ScratchFile.java:167)
> >>>>> at
> >>>>> org.apache.pdfbox.io
> >>>> .ScratchFileBuffer.addPage(ScratchFileBuffer.java:126)
> >>>>> at org.apache.pdfbox.io.ScratchFileBuffer.
> >>>> <init>(ScratchFileBuffer.java:84)
> >>>>> at org.apache.pdfbox.io
> .ScratchFile.createBuffer(ScratchFile.java:424)
> >>>>> at
> >>
> org.apache.pdfbox.cos.COSStream.createRaw0utputStream(COSStream.java:273)
> >>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1140)
> >>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:929)
> >>>>> at
> >>>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:888)
> >>>>> at
> >>>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:800)
> >>>>> at
> >>>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:760)
> >>>>> at
> >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
> >>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
> >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1014)
> >>>>> at MergeStudentRecords_2021.main(MergeStudentRecords_2021.java:324)
> >>>>>
> >>>>> Thanks in advance!
> >>>>>
> >>>>> Gilad
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to