Yeah, I thought of doing that, too... OK, thanks for help, anyway!

On Fri, Mar 17, 2023 at 7:51 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

> Am 15.03.23 um 17:51 schrieb Gilad Denneboom:
> > It's a bit more complicated than that. I have a small set of very large
> > files with different pages matching different people. I need to match
> those
> > pages based on some identifying code, and then extract them into either
> > individual files (one per person) or a single merged file with those
> pages
> > sorted by person. But yes, I do close the input files after scanning
> them,
> > and then open them later on to extract the relevant pages from them, if
> > needed. This is actually the reason I opted not to use PDFMergerUtility,
> as
> > it would require me to extract all the individual pages as separate
> files,
> > so I could merge them later on (as it's not possible to use it to only
> > merge parts of files).
> How about extracting those pages using the splitter? This will produce the
> file
> per person you are looking for. Use the merger to get the summary file. If
> there
> are to many files use several steps to do the merge.
>
> Andreas
>
> >
> > On Wed, Mar 15, 2023 at 5:28 PM Tilman Hausherr <thaush...@t-online.de>
> > wrote:
> >
> >> Your text sounded like you're not picking stuff from all documents. Are
> >> you closing the documents where nothing is found at the earliest possble
> >> time?
> >> Tilman
> >>
> >> On 15.03.2023 17:21, Gilad Denneboom wrote:
> >>>> The question is, do you close the input files properly?
> >>> Yes, I do, but only at the very end of the operation, as I was merging
> >> all
> >>> these individual files into one large one, so I had to keep the
> originals
> >>> open until I save this merged file for the last time, or it would throw
> >> an
> >>> exception about the PDDocument being closed.
> >>> I know this is not the best way of merging documents, by the way. I
> might
> >>> try to switch to using PDFMergerUtility, instead.
> >>>
> >>> On Wed, Mar 15, 2023 at 8:30 AM Andreas Lehmkuehler <andr...@lehmi.de>
> >>> wrote:
> >>>
> >>>> Hi Gilad,
> >>>>
> >>>> PDFBox is using a scratch file per document as long as you are using
> >>>> setupTempFileOnly. Handling thousands of documents ends up in
> thousands
> >> of
> >>>> scratch files. Those scratch files should be closed once the
> >> corresponding
> >>>> documents are closed.
> >>>>
> >>>> The question is, do you close the input files properly?
> >>>>
> >>>> Andreas
> >>>>
> >>>> Am 14.03.23 um 19:16 schrieb Gilad Denneboom:
> >>>>> Hi Maruan,
> >>>>>
> >>>>> Yes, I saw that, but it would be nice if this issue can be solved
> >> within
> >>>>> PDFBox, too.
> >>>>>
> >>>>> Gilad
> >>>>>
> >>>>> On Tue, Mar 14, 2023 at 4:52 PM Maruan Sahyoun <
> sahy...@fileaffairs.de
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> You can set the ulimit on Linux - Standard is 1024 open files.
> >>>>>>
> >>>>>> BR
> >>>>>> Maruan
> >>>>>>
> >>>>>>> Am 14.03.2023 um 16:05 schrieb Gilad Denneboom <
> >>>>>> gilad.denneb...@gmail.com>:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> I created an application that opens many files (I'm talking
> >> thousands),
> >>>>>>> searching them for specific pages and then merges those pages into
> >> new
> >>>>>> PDF
> >>>>>>> files. The way I do it is by using the importPage command from the
> >>>>>> original
> >>>>>>> files into the split ones.
> >>>>>>> However, I'm getting an IOException ("Too many open files") from
> >>>>>>> ScratchFile after several thousands files were processed. I had a
> >> look
> >>>> at
> >>>>>>> the source code for that class and I think it might have to do
> with a
> >>>>>>> RandomAccessFile variable ("raf") not being properly closed.
> >>>>>>> All of the documents are opened using MemoryUsageSetting set to
> >>>>>>> setupTempFileOnly, by the way.
> >>>>>>> Could someone confirm this is the issue, and maybe help solve it?
> I'm
> >>>>>> using
> >>>>>>> PDFBox 2.0.26, by the way, and the app runs on a Mac.
> >>>>>>>
> >>>>>>> The stack-trace:
> >>>>>>> Exception in thread "main" java.io.IOException: Too many open files
> >>>>>>> at java.base/java.io.UnixFileSystem.createFileExclusively0(Native
> >>>>>> Method)
> >>>>>>> at
> >>>>>>> java.base/java.io
> >>>>>> .UnixFileSystem.createFileExclusively(UnixFileSystem.java:356)
> >>>>>>> at java.base/java.io.File.createTempFile(File.java:2179)
> >>>>>>> at org.apache.pdfbox.io.ScratchFile.enlarge(ScratchFile.java:217)
> >>>>>>> at org.apache.pdfbox.io
> .ScratchFile.getNewPage(ScratchFile.java:167)
> >>>>>>> at
> >>>>>>> org.apache.pdfbox.io
> >>>>>> .ScratchFileBuffer.addPage(ScratchFileBuffer.java:126)
> >>>>>>> at org.apache.pdfbox.io.ScratchFileBuffer.
> >>>>>> <init>(ScratchFileBuffer.java:84)
> >>>>>>> at org.apache.pdfbox.io
> >> .ScratchFile.createBuffer(ScratchFile.java:424)
> >>>>>>> at
> >>>>
> >>
> org.apache.pdfbox.cos.COSStream.createRaw0utputStream(COSStream.java:273)
> >>>>>>> at
> >>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1140)
> >>>>>>> at
> >>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:929)
> >>>>>>> at
> >>>>>>>
> >>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:888)
> >>>>>>> at
> >>>>>>>
> >>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:800)
> >>>>>>> at
> >>>>>>>
> >>>>
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:760)
> >>>>>>> at
> >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
> >>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
> >>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1014)
> >>>>>>> at MergeStudentRecords_2021.main(MergeStudentRecords_2021.java:324)
> >>>>>>>
> >>>>>>> Thanks in advance!
> >>>>>>>
> >>>>>>> Gilad
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to