AW: Merging freetext annotations with missing spans

Kai Keggenhoff Wed, 17 Oct 2018 04:54:42 -0700

It just occured to me, that issue 3646, which I reported a while back, is 
likely caused by this method as well, since it puts "<" and "&" from 
Node.getNodeValue() directly into the rich contents, when it actually should 
put their entities there.
 
I'll check this and update the issues accordingly.
 
-----Ursprüngliche Nachricht-----
Von: Kai Keggenhoff <keggenh...@conclude.com> 
Gesendet: Mittwoch, 17. Oktober 2018 10:41
An: users@pdfbox.apache.org
Betreff: AW: Merging freetext annotations with missing spans


Hi Tilman, 

I've opened an issue (4345) and attached a possible fix, but as I'm not that 
familiar with the XFDF spec, the fix contains some guesswork, such as "do I 
really have to deal with CDATA?".
I'm not even contemplating comment nodes, processing instructions and all that 
unusual stuff.

Meanwhile, the workaround for my application is to embed all text nodes which 
have siblings in "span" elements before I pass them to FDFAnnotation, which so 
far did not produce any visible side effects for the affected annotations.

Best regards,

Kai

-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr <thaush...@t-online.de> 
Gesendet: Dienstag, 16. Oktober 2018 17:48
An: users@pdfbox.apache.org
Betreff: Re: Merging freetext annotations with missing spans

Hello Kai,

Sorry for never answering, I don't do much with XML, I always hoped 
somebody else would take it up. Can you create an issue in JIRA with all 
you got (source, XML, PDF)?
https://issues.apache.org/jira/browse/PDFBOX
This is better than mail because it least it would stays open until 
somebody does something. And yes this does happen. Sometimes sooner, 
sometimes later.

(Of course, we'd be delighted if you can contribute a fix too :-))

Tilman

Am 16.10.2018 um 10:47 schrieb Kai Keggenhoff:
> Hello,
>
> I tracked down this behaviour to FDFAnnotation.richContentsToString
>
> This method ignores Text nodes if they are siblings of Elements and therefore 
> the rich contents of the annotation lack those parts.
>
> Since I have several examples of Adobe Acrobat Reader DC producing this 
> structure I consider this a a bug in PDFBox.
>
> Best regards,
>
> Kai Keggenhoff
>
> -----Ursprüngliche Nachricht-----
> Von: Kai Keggenhoff <keggenh...@conclude.com>
> Gesendet: Montag, 24. September 2018 11:10
> An: users@pdfbox.apache.org
> Betreff: AW: Merging freetext annotations with missing spans
>
> Hello,
>
> in addition to my old email I would like to add sample code which produces 
> two PDF files showing the difference between freetext annotations containing
>
> <p dir="ltr"><span style="font-family:Helvetica">P1 </span><span 
> style="text-decoration:word;font-family:Helvetica">P2</span><span 
> style="font-family:Helvetica"> P3</span></p>
>
> in contrast to
>
> <p dir="ltr">P1 <span 
> style="text-decoration:word;font-family:Helvetica">P2</span> P3</p>
>
> The former produces the expected "P1 P2 P3", the latter shows only "P2".
>
> For my tests I used PDFBox 2.0.11.
>
> Thanks in advance,
>
> Kai Keggenhoff
>
>
>
> package xfdfannotation;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.pdmodel.fdf.FDFAnnotation;
> import org.apache.pdfbox.pdmodel.fdf.FDFDocument;
> import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
>
> import javax.xml.parsers.DocumentBuilderFactory;
> import org.xml.sax.InputSource;
> import java.io.StringReader;
> import java.util.List;
>
> public class MergeText {
>       public static void main(String args[]) {
>
>               String xfdf_without_spans = "<?xml version=\"1.0\" 
> encoding=\"UTF-8\"?>" +
> "<xfdf xmlns=\"http://ns.adobe.com/xfdf/\"; xml:space=\"preserve\"" +
> "><annots" +
> "><freetext color=\"#FFFFFF\" creationdate=\"D:20180924102518+02'00'\" 
> flags=\"print\" date=\"D:20180924102537+02'00'\" page=\"0\" 
> rect=\"17.382233,685.894287,121.675568,758.765869\" subject=\"Textfeld\" 
> title=\"keggenhoff\"" +
> "><contents-richtext" +
> "><body xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:xfa=\"http://www.xfa.org/schema/xfa-data/1.0/\"; 
> xfa:APIVersion=\"Acrobat:18.11.0\" xfa:spec=\"2.0.2\" 
> style=\"font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal\""
>  +
> "><p dir=\"ltr\"" +
> ">P1 <span style=\"text-decoration:word;font-family:Helvetica\"" +
> ">P2</span" +
> "> P3</p" +
> "></body" +
> "></contents-richtext" +
> "><defaultappearance" +
> ">0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance" +
> "><defaultstyle" +
> ">font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 
> </defaultstyle" +
> "></freetext" +
> "></annots" +
> "><f href=\"/C/Users/KEGGEN~1/AppData/Local/Temp/demo.pdf\"" +
> "/><fields" +
> "><field name=\"submit\"" +
> "/></fields" +
> "><ids original=\"F285D06ECA30C5579E72B6B7AE07BC0B\" 
> modified=\"1A190CB840919E279B93BF3D5D488C13\"" +
> "/></xfdf" +
> ">";
>
>               String xfdf_with_spans = "<?xml version=\"1.0\" 
> encoding=\"UTF-8\"?>" +
> "<xfdf xmlns=\"http://ns.adobe.com/xfdf/\"; xml:space=\"preserve\"" +
> "><annots" +
> "><freetext color=\"#FFFFFF\" creationdate=\"D:20180924102518+02'00'\" 
> flags=\"print\" date=\"D:20180924102537+02'00'\" page=\"0\" 
> rect=\"17.382233,685.894287,121.675568,758.765869\" subject=\"Textfeld\" 
> title=\"keggenhoff\"" +
> "><contents-richtext" +
> "><body xmlns=\"http://www.w3.org/1999/xhtml\"; 
> xmlns:xfa=\"http://www.xfa.org/schema/xfa-data/1.0/\"; 
> xfa:APIVersion=\"Acrobat:18.11.0\" xfa:spec=\"2.0.2\" 
> style=\"font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal\""
>  +
> "><p dir=\"ltr\"" +
> "><span style=\"font-family:Helvetica\"" +
> ">P1 </span" +
> "><span style=\"text-decoration:word;font-family:Helvetica\"" +
> ">P2</span" +
> "><span style=\"font-family:Helvetica\"" +
> "> P3</span" +
> "></p" +
> "></body" +
> "></contents-richtext" +
> "><defaultappearance" +
> ">0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance" +
> "><defaultstyle" +
> ">font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 
> </defaultstyle" +
> "></freetext" +
> "></annots" +
> "><f href=\"/C/Users/KEGGEN~1/AppData/Local/Temp/demo.pdf\"" +
> "/><fields" +
> "><field name=\"submit\"" +
> "/></fields" +
> "><ids original=\"F285D06ECA30C5579E72B6B7AE07BC0B\" 
> modified=\"1A190CB840919E279B93BF3D5D488C13\"" +
> "/></xfdf" +
> ">";
>               createPdf("demo_no_spans.pdf", xfdf_without_spans);
>               createPdf("demo_with_spans.pdf", xfdf_with_spans);
>
>       }
>
>       private static void createPdf(String filename, String xfdf) {
>               try {
>                       org.w3c.dom.Document xfdf_doc = 
> DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new 
> InputSource(new StringReader(xfdf)));
>                       FDFDocument fdf_doc = new FDFDocument(xfdf_doc);
>
>                       PDPage page = new PDPage();
>
>                       List<FDFAnnotation> xfdfAnnotations = 
> fdf_doc.getCatalog().getFDF().getAnnotations();
>                       for (FDFAnnotation xfdfAnnotation : xfdfAnnotations) {
>                               PDAnnotation a = 
> PDAnnotation.createAnnotation(xfdfAnnotation.getCOSObject());
>                               page.getAnnotations().add(a);
>                       }
>
>                       PDDocument pdf = new PDDocument();
>                       pdf.addPage(page);
>                       pdf.save(filename);
>               }
>               catch (Exception e) {
>                       e.printStackTrace();
>               }
>       }
> }
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Kai Keggenhoff <keggenh...@conclude.com>
> Gesendet: Donnerstag, 13. September 2018 13:44
> An: users@pdfbox.apache.org
> Betreff: Merging freetext annotations with missing spans
>
> Hello,
>
> I'm working on an application which merges XFDF files with annotations with 
> PDF files and noticed some strange behaviour with certain types of text 
> annotations.
>
> It looks like text that is not contained in a span is ignored when merging.
>
> One user uploaded this annotation (not the actual texts) from an older 
> Acrobat :
>
> <freetext width="2.000000" color="#FFFFFF" 
> creationdate="D:20180910162711+02'00'" flags="print" 
> date="D:20180911172716+02'00'" page="0" 
> rect="1136.342529,3886.797363,1221.432617,4367.977539" rotation="90" 
> subject="Textfeld" title="username"
>> <contents-richtext
>> <body xmlns="http://www.w3.org/1999/xhtml"; 
>> xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"; 
>> xfa:APIVersion="Acrobat:11.0.15" xfa:spec="2.0.2" 
>> style="font-size:11.0pt;text-align:left;color:#0000FF;font-weight:bold;font-style:normal;font-family:Arial;font-stretch:normal"
>> <p dir="ltr"
>> ABC <span style="text-decoration:underline"
>> DEF</span
>> GHI&#xD;</p
>> <p dir="ltr"
>> <span style="font-weight:normal"
>> More text&#xD;</span
>> </p
>> <p dir="ltr"
>> <span style="font-weight:normal"
>> More text</span
>> </p
>> <p dir="ltr"
>> <span style="font-weight:normal"
>> More text</span
>> </p
>> </body
>> </contents-richtext
>> <defaultappearance
>> 0 0 1 rg /Arial,Bold 11 Tf</defaultappearance
>> <defaultstyle
>> font: bold Arial 11.0pt; text-align:left; color:#0000FF </defaultstyle
>> </freetext
>>
> After merging, the texts "ABC" and "GHI" are gone - they are not displayed 
> and not shown in the comments area in Acrobat Reader.
>
> When I tried to create a similar annotation using a current Acrobat Reader 
> DC, I get
>
> <freetext color="#FFFFFF" creationdate="D:20180913132943+02'00'" 
> flags="print" date="D:20180913132956+02'00'" page="0" 
> rect="181.799377,672.266907,326.595337,723.213623" subject="Textfeld" 
> title="keggenhoff"
>> <contents-richtext
>> <body xmlns="http://www.w3.org/1999/xhtml"; 
>> xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"; 
>> xfa:APIVersion="Acrobat:18.11.0" xfa:spec="2.0.2" 
>> style="font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal"
>> <p dir="ltr"
>> <span style="font-family:Helvetica"
>> ABC</span
>> <span style="text-decoration:word;font-family:Helvetica"
>> DEF</span
>> <span style="font-family:Helvetica"
>> GHI</span
>> </p
>> </body
>> </contents-richtext
>> <defaultappearance
>> 0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance
>> <defaultstyle
>> font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 
>> </defaultstyle
>> </freetext
>>
> When I merge this annotation with the PDF, the text is complete.
> However, when I remove the span tags around ABC and GHI, both texts are again 
> missing after merging.
>
> Now my question is whether the (ancient) Acrobat should have included span 
> tags there or if PDFBox should process the text that is not inside a span.
>
> I tested this with PDFBox 2.0.6 and 2.0.11 and the behaviour was identical.
>
> Thanks in advance,
>
> Kai Keggenhoff
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�\�\��][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�\�\��Z[���
�\X�K�ܙ�B�

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

AW: Merging freetext annotations with missing spans

Reply via email to