Yeah, I thought of doing that, too... OK, thanks for help, anyway! On Fri, Mar 17, 2023 at 7:51 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote:
> Am 15.03.23 um 17:51 schrieb Gilad Denneboom: > > It's a bit more complicated than that. I have a small set of very large > > files with different pages matching different people. I need to match > those > > pages based on some identifying code, and then extract them into either > > individual files (one per person) or a single merged file with those > pages > > sorted by person. But yes, I do close the input files after scanning > them, > > and then open them later on to extract the relevant pages from them, if > > needed. This is actually the reason I opted not to use PDFMergerUtility, > as > > it would require me to extract all the individual pages as separate > files, > > so I could merge them later on (as it's not possible to use it to only > > merge parts of files). > How about extracting those pages using the splitter? This will produce the > file > per person you are looking for. Use the merger to get the summary file. If > there > are to many files use several steps to do the merge. > > Andreas > > > > > On Wed, Mar 15, 2023 at 5:28 PM Tilman Hausherr <thaush...@t-online.de> > > wrote: > > > >> Your text sounded like you're not picking stuff from all documents. Are > >> you closing the documents where nothing is found at the earliest possble > >> time? > >> Tilman > >> > >> On 15.03.2023 17:21, Gilad Denneboom wrote: > >>>> The question is, do you close the input files properly? > >>> Yes, I do, but only at the very end of the operation, as I was merging > >> all > >>> these individual files into one large one, so I had to keep the > originals > >>> open until I save this merged file for the last time, or it would throw > >> an > >>> exception about the PDDocument being closed. > >>> I know this is not the best way of merging documents, by the way. I > might > >>> try to switch to using PDFMergerUtility, instead. > >>> > >>> On Wed, Mar 15, 2023 at 8:30 AM Andreas Lehmkuehler <andr...@lehmi.de> > >>> wrote: > >>> > >>>> Hi Gilad, > >>>> > >>>> PDFBox is using a scratch file per document as long as you are using > >>>> setupTempFileOnly. Handling thousands of documents ends up in > thousands > >> of > >>>> scratch files. Those scratch files should be closed once the > >> corresponding > >>>> documents are closed. > >>>> > >>>> The question is, do you close the input files properly? > >>>> > >>>> Andreas > >>>> > >>>> Am 14.03.23 um 19:16 schrieb Gilad Denneboom: > >>>>> Hi Maruan, > >>>>> > >>>>> Yes, I saw that, but it would be nice if this issue can be solved > >> within > >>>>> PDFBox, too. > >>>>> > >>>>> Gilad > >>>>> > >>>>> On Tue, Mar 14, 2023 at 4:52 PM Maruan Sahyoun < > sahy...@fileaffairs.de > >>> > >>>>> wrote: > >>>>> > >>>>>> You can set the ulimit on Linux - Standard is 1024 open files. > >>>>>> > >>>>>> BR > >>>>>> Maruan > >>>>>> > >>>>>>> Am 14.03.2023 um 16:05 schrieb Gilad Denneboom < > >>>>>> gilad.denneb...@gmail.com>: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I created an application that opens many files (I'm talking > >> thousands), > >>>>>>> searching them for specific pages and then merges those pages into > >> new > >>>>>> PDF > >>>>>>> files. The way I do it is by using the importPage command from the > >>>>>> original > >>>>>>> files into the split ones. > >>>>>>> However, I'm getting an IOException ("Too many open files") from > >>>>>>> ScratchFile after several thousands files were processed. I had a > >> look > >>>> at > >>>>>>> the source code for that class and I think it might have to do > with a > >>>>>>> RandomAccessFile variable ("raf") not being properly closed. > >>>>>>> All of the documents are opened using MemoryUsageSetting set to > >>>>>>> setupTempFileOnly, by the way. > >>>>>>> Could someone confirm this is the issue, and maybe help solve it? > I'm > >>>>>> using > >>>>>>> PDFBox 2.0.26, by the way, and the app runs on a Mac. > >>>>>>> > >>>>>>> The stack-trace: > >>>>>>> Exception in thread "main" java.io.IOException: Too many open files > >>>>>>> at java.base/java.io.UnixFileSystem.createFileExclusively0(Native > >>>>>> Method) > >>>>>>> at > >>>>>>> java.base/java.io > >>>>>> .UnixFileSystem.createFileExclusively(UnixFileSystem.java:356) > >>>>>>> at java.base/java.io.File.createTempFile(File.java:2179) > >>>>>>> at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:217) > >>>>>>> at org.apache.pdfbox.io > .ScratchFile.getNewPage(ScratchFile.java:167) > >>>>>>> at > >>>>>>> org.apache.pdfbox.io > >>>>>> .ScratchFileBuffer.addPage(ScratchFileBuffer.java:126) > >>>>>>> at org.apache.pdfbox.io.ScratchFileBuffer. > >>>>>> <init>(ScratchFileBuffer.java:84) > >>>>>>> at org.apache.pdfbox.io > >> .ScratchFile.createBuffer(ScratchFile.java:424) > >>>>>>> at > >>>> > >> > org.apache.pdfbox.cos.COSStream.createRaw0utputStream(COSStream.java:273) > >>>>>>> at > >>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1140) > >>>>>>> at > >>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:929) > >>>>>>> at > >>>>>>> > >>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:888) > >>>>>>> at > >>>>>>> > >>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:800) > >>>>>>> at > >>>>>>> > >>>> > >> > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:760) > >>>>>>> at > >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187) > >>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) > >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107) > >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090) > >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1014) > >>>>>>> at MergeStudentRecords_2021.main(MergeStudentRecords_2021.java:324) > >>>>>>> > >>>>>>> Thanks in advance! > >>>>>>> > >>>>>>> Gilad > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>> > >>>>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >