It's a bit more complicated than that. I have a small set of very large files with different pages matching different people. I need to match those pages based on some identifying code, and then extract them into either individual files (one per person) or a single merged file with those pages sorted by person. But yes, I do close the input files after scanning them, and then open them later on to extract the relevant pages from them, if needed. This is actually the reason I opted not to use PDFMergerUtility, as it would require me to extract all the individual pages as separate files, so I could merge them later on (as it's not possible to use it to only merge parts of files).
On Wed, Mar 15, 2023 at 5:28 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Your text sounded like you're not picking stuff from all documents. Are > you closing the documents where nothing is found at the earliest possble > time? > Tilman > > On 15.03.2023 17:21, Gilad Denneboom wrote: > >> The question is, do you close the input files properly? > > Yes, I do, but only at the very end of the operation, as I was merging > all > > these individual files into one large one, so I had to keep the originals > > open until I save this merged file for the last time, or it would throw > an > > exception about the PDDocument being closed. > > I know this is not the best way of merging documents, by the way. I might > > try to switch to using PDFMergerUtility, instead. > > > > On Wed, Mar 15, 2023 at 8:30 AM Andreas Lehmkuehler <andr...@lehmi.de> > > wrote: > > > >> Hi Gilad, > >> > >> PDFBox is using a scratch file per document as long as you are using > >> setupTempFileOnly. Handling thousands of documents ends up in thousands > of > >> scratch files. Those scratch files should be closed once the > corresponding > >> documents are closed. > >> > >> The question is, do you close the input files properly? > >> > >> Andreas > >> > >> Am 14.03.23 um 19:16 schrieb Gilad Denneboom: > >>> Hi Maruan, > >>> > >>> Yes, I saw that, but it would be nice if this issue can be solved > within > >>> PDFBox, too. > >>> > >>> Gilad > >>> > >>> On Tue, Mar 14, 2023 at 4:52 PM Maruan Sahyoun <sahy...@fileaffairs.de > > > >>> wrote: > >>> > >>>> You can set the ulimit on Linux - Standard is 1024 open files. > >>>> > >>>> BR > >>>> Maruan > >>>> > >>>>> Am 14.03.2023 um 16:05 schrieb Gilad Denneboom < > >>>> gilad.denneb...@gmail.com>: > >>>>> Hi all, > >>>>> > >>>>> I created an application that opens many files (I'm talking > thousands), > >>>>> searching them for specific pages and then merges those pages into > new > >>>> PDF > >>>>> files. The way I do it is by using the importPage command from the > >>>> original > >>>>> files into the split ones. > >>>>> However, I'm getting an IOException ("Too many open files") from > >>>>> ScratchFile after several thousands files were processed. I had a > look > >> at > >>>>> the source code for that class and I think it might have to do with a > >>>>> RandomAccessFile variable ("raf") not being properly closed. > >>>>> All of the documents are opened using MemoryUsageSetting set to > >>>>> setupTempFileOnly, by the way. > >>>>> Could someone confirm this is the issue, and maybe help solve it? I'm > >>>> using > >>>>> PDFBox 2.0.26, by the way, and the app runs on a Mac. > >>>>> > >>>>> The stack-trace: > >>>>> Exception in thread "main" java.io.IOException: Too many open files > >>>>> at java.base/java.io.UnixFileSystem.createFileExclusively0(Native > >>>> Method) > >>>>> at > >>>>> java.base/java.io > >>>> .UnixFileSystem.createFileExclusively(UnixFileSystem.java:356) > >>>>> at java.base/java.io.File.createTempFile(File.java:2179) > >>>>> at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:217) > >>>>> at org.apache.pdfbox.io.ScratchFile.getNewPage(ScratchFile.java:167) > >>>>> at > >>>>> org.apache.pdfbox.io > >>>> .ScratchFileBuffer.addPage(ScratchFileBuffer.java:126) > >>>>> at org.apache.pdfbox.io.ScratchFileBuffer. > >>>> <init>(ScratchFileBuffer.java:84) > >>>>> at org.apache.pdfbox.io > .ScratchFile.createBuffer(ScratchFile.java:424) > >>>>> at > >> > org.apache.pdfbox.cos.COSStream.createRaw0utputStream(COSStream.java:273) > >>>>> at > >> > org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1140) > >>>>> at > >> > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:929) > >>>>> at > >>>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:888) > >>>>> at > >>>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:800) > >>>>> at > >>>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:760) > >>>>> at > >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187) > >>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) > >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107) > >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090) > >>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1014) > >>>>> at MergeStudentRecords_2021.main(MergeStudentRecords_2021.java:324) > >>>>> > >>>>> Thanks in advance! > >>>>> > >>>>> Gilad > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >