Hi, this is Yan from Japan.
I'm also a user of PDFBox.
About your problem, I've not understood clearly.
Do you want to process the contents inside a form?
I can give a sample code used in my project.
It use PDFStreamEngine to get form objects in PDF.
I hope it can help you.
-----Original Message-----
From: Andrea Vacondio [mailto:[email protected]]
Sent: Thursday, December 1, 2016 6:02 PM
To: [email protected]
Subject: Text extraction and clip area
Hi, I had a couple of issues with text extraction and I tried to dig a bit into
the code. As far as I can see the "current clipping area" is never used during
text extraction, is this correct? My issue is with a form xobject where the
bounding box clips out part of the text but that text is returned by the text
stripper.
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroupAttributes;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
public class GetImageColorSpace extends PDFStreamEngine {
public GetImageColorSpace()
{
addOperator(new Concatenate());
addOperator(new DrawObject());
addOperator(new SetGraphicsStateParameters());
addOperator(new Save());
addOperator(new Restore());
addOperator(new SetMatrix());
}
public static void main(String[] args) throws IOException {
PDDocument document = null;
try
{
document = PDDocument.load(new File(args[0]));
GetImageColorSpace printer = new GetImageColorSpace();
int pageNum = 0;
for(PDPage page : document.getPages())
{
pageNum++;
System.out.println( "Processing page: " + pageNum);
printer.processPage(page);
}
}
finally
{
if(document != null)
{
document.close();
}
}
}
/**
* This is used to handle an operation.
*
* @param operator The operation to perform.
* @param operands The list of arguments.
*
* @throws IOException If there is an error processing the operation.
*/
@Override
protected void processOperator(Operator operator, List<COSBase> operands)
throws IOException
{
String operation = operator.getName();
if("Do".equals(operation))
{
COSName objectName = (COSName) operands.get(0);
PDXObject xobject = getResources().getXObject(objectName);
if(xobject instanceof PDFormXObject)
{
PDFormXObject form = (PDFormXObject)xobject;
PDTransparencyGroupAttributes forGroup = form.getGroup();
// processing form's content goes here.
}
}
else
{
super.processOperator(operator, operands);
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]