Re: How to extract the object id from a form field?

Maruan Sahyoun Wed, 19 Aug 2015 13:31:37 -0700

Hi,

> Am 19.08.2015 um 21:16 schrieb Roberto Nibali <[email protected]>:
> 
> Hi Tilman
> 
> Thanks for your reply ... I did not really succeed. We'll probably end up 
> looking at how the PDFDebugger code does it ;).
> 
> On Tue, Aug 18, 2015 at 9:08 PM, Tilman Hausherr <[email protected] 
> <mailto:[email protected]>> wrote:
> Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
>> Hi
>> 
>> I'd like to print out the corresponding object id given a specific form
>> field. How would I do that with PDFBox programmatically?
>> 
>> Let's for the sake of the argument, assume that the form field is
>> represented by the following obj:
>> 
>> obj 218 0
>>   <<
>>     /DA <2B94B0298F2FD7F81F32C6E22043>
>>     /F 4
>>     /FT /Tx
>>     /Ff 4194304
>>     /MK
>>     /P 28 0 R
>>     /Parent 46 0 R
>>     /Rect [159.781 764.53 347.142 777.195]
>>     /Subtype /Widget
>>     /T <5EB6B730886188AB3D3194B9654C18094C>
>>     /Type /Annot
>>     /V <45BBBA249C618BBD3974A4BE61501E57181D>
>>     /AP 666 0 R
>>   >>
>> 
>> If I am going over all PDField entries of a PDF, how would I get to the
>> underlying obj number (in the above case 218) from a PDField object?
> 
> I haven't tried this myself, but I think you could "synchronise" the 
> getChildren() results with the getCOSObject().getItem(COSName.KIDS) array, 
> i.e. sort out which indirect type is which item returned from getChildren(). 
> The Kids COSArray has indirect objects (= COSObject type), as seen here:
> 
> 
> 
> COSObject.getObject() returns the dereferenced object.
> 
> The reason I asked about this is that while migrating some documents, we 
> found out that the originating PDFs not only have textual changes in the PDF 
> (mostly legal aspect changes in the fix text); the client in certain cases 
> modified the PDFs by adding borders or other graphical elements inside. Those 
> obviously do not show up in the template PDF. 
> 
> My somewhat (maybe stupid) idea was to simply print out the obj id or even 
> the whole object and subsequently insert it into the template for the final 
> PDF during the form field migration, on top of updating all references to the 
> new obj id.
> 
> At least for simple geometric shapes, like rectangles, this should be 
> feasible, no? Anyway, after constantly getting "null" from the 
> getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() from a 
> given PDField, I kind of gave up.
> 
> Imagine you had the following code, and wanted to additionally dump out the 
> underlying object id and the referencing ids of the PDField:
> @Test
> private void excuteDumpFields() throws IOException {
>     PDDocument srcDoc = null;
>     try {
>         srcDoc = PDDocument.load(new File(srcDocName)); 
>         PDAcroForm acroForm = srcDoc.getDocumentCatalog().getAcroForm();
>         List<PDField> fields = acroForm.getFields();
>         for (PDField field : fields) {
>             dumpField(srcDoc, field);
>         }
>         srcDoc.close();
>     } catch (Exception e) {
>         logerr(e.getMessage());
>     } finally {
>         if (srcDoc != null) {
>             srcDoc.close();
>         }
>     }
> }
> 
> private void dumpField(PDDocument srcDoc, PDField srcField) throws 
> IOException {
>     if (srcField instanceof PDNonTerminalField) {
>         for (PDField child : ((PDNonTerminalField) srcField).getChildren()) {
>             dumpField(srcDoc, child);
>         }
>     } else if (!(srcField instanceof PDSignatureField)) {
>         String fqName = srcField.getFullyQualifiedName();
>         String fTypes[] = srcField.getClass().getName().split("\\.");
>         System.out.printf("fqName=%s type=%s%n", fqName, 
> fTypes[fTypes.length-1]);
>     }
> }
> It has become customary to me to dump the objects using the pdf-parser 
> (http://blog.didierstevens.com/programs/pdf-tools/ 
> <http://blog.didierstevens.com/programs/pdf-tools/>) as follows to futher 
> investigate issues (excerpt showing the dump of object 228):
> 
> $ python pdf-parser.py -o 228 ../../ccmig2.pdf
> 
> obj 228 0
>  Type: /Annot
>  Referencing: 685 0 R, 28 0 R, 46 0 R, 686 0 R
> 
>   <<
>     /AA
>       <<
>         /K 685 0 R
>       >>
>     /DA <92F8913CB200CF3C13A363C2D20D>
>     /F 4
>     /FT /Tx
>     /Ff 12582912
>     /MK
>     /MaxLen 1
>     /P 28 0 R
>     /Parent 46 0 R
>     /Q 1
>     /Rect [454.437 769.504 465.482 782.169]
>     /Subtype /Widget
>     /T <8C8A>
>     /Type /Annot
>     /V ()
>     /AP 686 0 R
>   >>
> 
> And to get the objects referencing object 228:
> 
> $ python pdf-parser.py -r 228 ../../ccmig2.pdf
> 
> obj 28 0
>  Type: /Page
>  Referencing: 101 0 R, 217 0 R, 218 0 R, 219 0 R, 220 0 R, 221 0 R, 222 0 R, 
> 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 230 0 R, 231 0 
> R, 232 0 R, 61 0 R, 60 0 R, 62 0 R, 63 0 R, 64 0 R, 65 0 R, 66 0 R, 67 0 R, 
> 69 0 R, 68 0 R, 70 0 R, 71 0 R, 72 0 R, 73 0 R, 74 0 R, 75 0 R, 76 0 R, 77 0 
> R, 78 0 R, 79 0 R, 80 0 R, 81 0 R, 82 0 R, 83 0 R, 84 0 R, 86 0 R, 87 0 R, 88 
> 0 R, 89 0 R, 90 0 R, 91 0 R, 92 0 R, 93 0 R, 94 0 R, 95 0 R, 96 0 R, 97 0 R, 
> 85 0 R, 233 0 R, 234 0 R, 235 0 R, 236 0 R, 237 0 R, 238 0 R, 239 0 R, 22 0 
> R, 240 0 R, 241 0 R, 242 0 R, 243 0 R, 244 0 R, 245 0 R, 246 0 R, 247 0 R, 
> 103 0 R, 248 0 R, 6 0 R, 205 0 R, 206 0 R, 207 0 R, 208 0 R, 209 0 R, 210 0 
> R, 211 0 R, 213 0 R, 212 0 R
> 
>   <<
>     /Annots '[101 0 R 217 0 R 218 0 R 219 0 R 220 0 R 221 0 R 222 0 R 223 0 R 
> 224 0 R 225 0 R\n226 0 R 227 0 R 228 0 R 229 0 R 230 0 R 231 0 R 232 0 R 61 0 
> R 60 0 R 62 0 R\n63 0 R 64 0 R 65 0 R 66 0 R 67 0 R 69 0 R 68 0 R 70 0 R 71 0 
> R 72 0 R\n73 0 R 74 0 R 75 0 R 76 0 R 77 0 R 78 0 R 79 0 R 80 0 R 81 0 R 82 0 
> R\n83 0 R 84 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90 0 R 91 0 R 92 0 R 93 0 R\n94 
> 0 R 95 0 R 96 0 R 97 0 R 85 0 R 233 0 R 234 0 R 235 0 R 236 0 R 237 0 R\n238 
> 0 R 239 0 R 22 0 R 240 0 R 241 0 R 242 0 R 243 0 R 244 0 R 245 0 R 246 0 
> R\n247 0 R 103 0 R]'
>     /BleedBox [0.0 0.0 595.276 841.89]
>     /Contents 248 0 R
>     /CropBox [0.0 0.0 595.276 841.89]
>     /MediaBox [0.0 0.0 595.276 841.89]
>     /Parent 6 0 R
>     /Resources
>       <<
>         /ExtGState
>           <<
>             /GS0 205 0 R
>             /GS1 206 0 R
>             /GS2 207 0 R
>             /GS3 208 0 R
>           >>
>         /Font
>           <<
>             /C2_0 209 0 R
>             /C2_1 210 0 R
>             /TT0 211 0 R
>             /TT1 213 0 R
>             /TT2 212 0 R
>           >>
>         /ProcSet [/PDF /Text]
>       >>
>     /Rotate 0
>     /Tabs /W
>     /TrimBox [0.0 0.0 595.276 841.89]
>     /Type /Page
>   >>
> 
> 
> obj 46 0
>  Type:
>  Referencing: 218 0 R, 230 0 R, 231 0 R, 232 0 R, 219 0 R, 217 0 R, 220 0 R, 
> 221 0 R, 222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 
> R, 17 0 R
> 
>   <<
>     /Kids '[218 0 R 230 0 R 231 0 R 232 0 R 219 0 R 217 0 R 220 0 R 221 0 R 
> 222 0 R 223 0 R\n224 0 R 225 0 R 226 0 R 227 0 R 228 0 R 229 0 R]'
>     /Parent 17 0 R
>     /T <32AB37>
>   >>
> 
> It would be tremendous if I could get at least the proper object id out of 
> the PDFields using PDFBox.


a PDField is uniquely identified by it's full name - which can als be used to 
find it within the template. Now if someone added a border in the source 
document field which you would like to add to the template document field this 
is part of the widget definition for the field e.g. the /MK entry. There are 
also some defaults used by Acrobat e.g. when a border color is defined there 
will be a small border around the field even if there is no border width 
defined.

If I understood your use case correctly knowing the object id of the field 
wouldn't help in this case.

BR
Maruan


> 
> Take care
> Roberto
> 
>

Re: How to extract the object id from a form field?

Reply via email to