On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote: > Am 29.11.2018 um 09:49 schrieb Nicolas Paris: > > Hi > > > > > It could be an XFA forms pdf... then you'd have to analyze the XML > > > content. > > I opened the pdf in a text editor, and I can say the boxes are in a > > stream xml entity, in binary format. (By removing some binary, I have > > been able to remove the boxes. > > Does it exclude the XFA form pdf nature ? > > > Sorry, "nature" looks like a bad translation, and sadly I don't know what > you meant... please write that part in french, which I understand too.
I meant, "do the above informations prove it is *not* a XFA form ?". I mean, the boxes arent in xml but in the binary part. > > PDFBox doesn't have an API for the XFA form. > > You can also upload the PDF to a sharehoster (no mail attachments). Or look > at the PDF in PDFDebugger. I cannot share any copy of the pdf. Thanks for that proposition that would help a lot. > > > > > It could be ordinary text, then the text stripper would do the job. > > The regular textstripper does not extract them. Does it exclude the text > > nature ? > > > Same problem with "nature". PDFBox cannot extract XFA forms. It can detect > glyphs that are used for forms, e.g. squares. I meant, "if the built-in pdfbox text stripper does not extract the check-boxes, does it prove that they are not ordinary text." How could I determine the kind of checkbox I have ? Is there a way to list all the objects within the pdf ? > > > > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote: > > > It could be an XFA forms pdf... then you'd have to analyze the XML > > > content. > > > > > > It could be widgets annotations without acroform, then you'd have to > > > analyse > > > these. > > > > > > It could be ordinary text, then the text stripper would do the job. > > > > > > It could be vector graphics, then it gets really difficult. > > > > > > Tilman > > > > > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris: > > > > Hi > > > > > > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to extract > > > > the content as text, including the checkboxes values in it. > > > > > > > > THe pdf looks like a regular form pdf with checkboxes. However it is not > > > > a acro form based pdf, and the regular pdfbox code I use in this case > > > > does not apply : the acroform is null ! > > > > > > > > I wonder how I can iterate on those checkboxes (or visually equivalent) > > > > objects or symbols. > > > > > > > > If someone can give me a starter to list all objects in that pdf, that > > > > might be helpful to begin with. > > > > > > > > Thanks by advance, > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- nicolas --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

