Re: Stream parsing huge PDF document in order to prevent memory issues

Stefan Magnus Landrø Fri, 07 Mar 2014 05:27:32 -0800

Here it is: https://issues.apache.org/jira/browse/PDFBOX-1965


Thanks

Stefan


2014-03-07 12:47 GMT+01:00 Maruan Sahyoun <[email protected]>:

> Hi Stefan,
>
> unfortunately this is seems to be a bug. When the parseMinimal property is
> set to true indirect objects are not followed when the PDF is parsed. May I
> ask you to file a issue in Jira [
> https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in
> question.
>
> BR
> Maruan Sahyoun
>
> Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <[email protected]>:
>
> > Hi Stefan,
> >
> > just fine. If I need more information I’ll let you know.
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <
> [email protected]>:
> >
> >> Hi Maruan,
> >>
> >> So I created a small maven project containing a PDF-file I just
> generated
> >> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> >> I could create a zip and upload to your bugtracker, but that feels kinda
> >> awkward.
> >> What do you prefer?
> >>
> >> Stefan
> >>
> >>
> >>
> >> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>:
> >>
> >>> Yes please, file a bug report together with a sample PDF and sample
> code
> >>> to reproduce the issue. Which PDFBox version are you using?
> >>>
> >>> BR
> >>> Maruan Sahyoun
> >>>
> >>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> >>> [email protected]>:
> >>>
> >>>> Hi there,
> >>>>
> >>>> So I tried using the NonSequentialParser setting the
> >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal
> property
> >>> to
> >>>> true.
> >>>>
> >>>> The memory footprint looks much better, however, I can't get the
> >>> individual
> >>>> pages due to a NPE in the getPage code.
> >>>>
> >>>> It turns out the resDict below is mostly null - which again causes a
> NPE
> >>> in
> >>>> parseDictObjects.
> >>>>
> >>>> Should I file a bug?
> >>>>
> >>>> Stefan
> >>>>
> >>>>
> >>>>  public PDPage getPage(int pageNr) throws IOException
> >>>>  {
> >>>>      getPagesObject();
> >>>>
> >>>>      // ---- get list of top level pages
> >>>>      COSArray kids = (COSArray)
> >>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
> >>>>
> >>>>      if (kids == null)
> >>>>      {
> >>>>          throw new IOException("Missing 'Kids' entry in pages
> >>>> dictionary.");
> >>>>      }
> >>>>
> >>>>      // ---- get page we are looking for (possibly going recursively
> >>> into
> >>>>      // subpages)
> >>>>      COSObject pageObj = getPageObject(pageNr, kids, 0);
> >>>>
> >>>>      if (pageObj == null)
> >>>>      {
> >>>>          throw new IOException("Page " + pageNr + " not found.");
> >>>>      }
> >>>>
> >>>>      // ---- parse all objects necessary to load page.
> >>>>      COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >>>>
> >>>>      if (parseMinimalCatalog && (!allPagesParsed))
> >>>>      {
> >>>>          // parse page resources since we did not do this on start
> >>>>          COSDictionary resDict = (COSDictionary)
> >>>> pageDict.getDictionaryObject(COSName.RESOURCES);
> >>>>          parseDictObjects(resDict);
> >>>>      }
> >>>>
> >>>>      return new PDPage(pageDict);
> >>>>  }
> >>>>
> >>>>
> >>>>
> >>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> PDF is a random access format with key information (the Cross
> Reference
> >>>>> where to find the objects) being at the end of the file and the PDF
> >>> objects
> >>>>> spread around the file.
> >>>>>
> >>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >>>>> instead of PDDocument.load and set the system property
> >>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> >>> does
> >>>>> a minimal parsing of the PDF. That could reduce the memory
> consumption a
> >>>>> little bit.  Unfortunately once an object has been parsed it’s
> content
> >>>>> stays in memory so you would need to do a low level parsing yourself
> >>> with
> >>>>> the information available from the initial parsing stage.
> >>>>>
> >>>>> Maruan Sahyoun
> >>>>>
> >>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >>>>> [email protected]>:
> >>>>>
> >>>>>> Hi there,
> >>>>>>
> >>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>>>>> according to the following rule set:
> >>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>>>>> - There should be no content within a certain rectangular area of a
> >>> page
> >>>>>> (left margin where the print shop inserts a bar code)
> >>>>>> - Number of pages should be less than N
> >>>>>> - PDF version used
> >>>>>>
> >>>>>> So far we've been using
> >>>>>>
> >>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >>>>> product
> >>>>>> catalogues), things explode.
> >>>>>> Is there a way to stream parse a PDF similar to stream parsing an
> XML
> >>>>>> document (e.g. using StAX) and validate one page at a time?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> Stefan
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> BEKK Open
> >>>> http://open.bekk.no
> >>>>
> >>>> TesTcl - a unit test framework for iRules
> >>>> http://testcl.com
> >>>
> >>>
> >>
> >>
> >> --
> >> BEKK Open
> >> http://open.bekk.no
> >>
> >> TesTcl - a unit test framework for iRules
> >> http://testcl.com
> >
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Reply via email to