Am 19.06.2016 um 16:11 schrieb Tilman Hausherr:
Am 19.06.2016 um 08:52 schrieb John Hewson:
>>JIRA, and attach your code as a patch / diff.
>There is already some code handling those operators, see
PDFMarkedContentExtractor. It could be moved to a more generic place so that
we have to add some filtering only.
Yes, that's is the proper way to handle this. Operators are handled with a an
OperatorProcessor, not my modifying the parser (e.g. processStreamOperators).
Better yet, we already have the code to handle BMC/EMC. All that is needed is
for PDFRenderer to add a constructor which accepts a list of layer names to
render, which are then passed as part of PageDrawerParmeters.

The problem is that these two operators influence whether or not all the other
tokens in the content stream are used or not. So the method by C. makes sense to
me.  The alternative would be to alter every operator processor to check whether
it is relevant or not.
Or they would have to be extended from some common class that does this check.

PDFMarkedContentExtractor is not really helpful. Here's some code to show what
it does - it shows the objects that belong to a specific group. The output
cannot be used for rendering.
Maybe there is a misunderstanding. We need to track the current layer and the stack of all current layers. C. provided some code doing that and we already have some code doing it (I'm talking about the operators in org.apache.pdfbox.contentstream.operator.markedcontent). What is missing is some sort of filter based on that information.

BR
Andreas


import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
import
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
import org.apache.pdfbox.text.PDFMarkedContentExtractor;

public class ExtractMarkedContent extends PDFMarkedContentExtractor
{

    public ExtractMarkedContent() throws IOException
    {
    }

    public static void main(String[] args) throws IOException
    {

       PDDocument doc = PDDocument.load(new File("C......\\PDFBox
reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
        PDOptionalContentProperties ocp =
doc.getDocumentCatalog().getOCProperties();
        System.out.println("Group names in document catalog: " +
Arrays.toString(ocp.getGroupNames()));
        for (String groupName : ocp.getGroupNames())
        {
            PDOptionalContentGroup group = ocp.getGroup(groupName);
            System.out.println(group.getCOSObject());
        }
        ExtractMarkedContent extractMarkedContent = new ExtractMarkedContent();
        PDPage page = doc.getPage(0);
        System.out.println("Property names in page resources: " +
page.getResources().getPropertiesNames());
        extractMarkedContent.processPage(page);
        List<PDMarkedContent> markedContents =
extractMarkedContent.getMarkedContents();
        System.out.println("Extracted contents: ");
        for (PDMarkedContent mc : markedContents)
        {
            PDPropertyList propertyList =
page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
            String propName = 
propertyList.getCOSObject().getString(COSName.NAME);
            System.out.println(mc.getTag() + " (" + propName + "): " +
mc.getContents());
        }
        doc.close();
    }
}


The output is:

Group names in document catalog: [background, enabled, disabled]
COSDictionary{(COSName{Type}:COSName{OCG}) 
(COSName{Name}:COSString{background}) }
COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{enabled}) }
COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{disabled}) }
Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
Extracted contents:
oc1 (background): [P, D, F,  , 1, ., 5, :,  , O, p, t, i, o, n, a, l,  , C, o,
n, t, e, n, t,  , G, r, o, u, p, s, Y, o, u,  , s, h, o, u, l, d,  , s, e, e,  ,
a,  , g, r, e, e, n,  , t, e, x, t, l, i, n, e, ,,  , b, u, t,  , n, o,  , r, e,
d,  , t, e, x, t,  , l, i, n, e, .]
oc2 (enabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a, n,  , e, n, a, b, l,
e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
t, h, a, t, ', s,  , g, o, o, d, .]
oc3 (disabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a,  , d, i, s, a, b, l,
e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
t, h, a, t, ', s,  , N, O, T,  , g, o, o, d, !]






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to