Hi:
Below is the output of pdf document after parsing using parse-pdf.
I have tried all the possible ways but i can not remove the javascript element:
PDF Comment '%PDF-1.5\n'
PDF Comment '%öäüß\n'
obj 1 0
Type: /Catalog
Referencing: 2 0 R, 3 0 R, 4 0 R, 5 0 R
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
/Names 4 0 R
/AA 5 0 R
>>
obj 6 0
Type:
Referencing:
<<
>>
obj 2 0
Type: /Outlines
Referencing:
<<
/Type /Outlines
/Count 0
>>
obj 3 0
Type: /Pages
Referencing: 7 0 R
<<
/Type /Pages
/Kids [7 0 R]
/Count 1
>>
obj 4 0
Type:
Referencing:
<<
>>
obj 5 0
Type:
Referencing:
<<
>>
obj 7 0
Type: /Page
Referencing: 3 0 R, 8 0 R
<<
/Type /Page
/Parent 3 0 R
/MediaBox [184 105 295 275]
/AA
<<
/O
<<
/JS 8 0 R
/S /JavaScript
>>
>>
>>
obj 8 0
Type:
Referencing:
Contains stream
<<
/Length 22707
/Filter [/ASCIIHexDecode]
>>
xref
trailer
<<
/Size 9
/Root 1 0 R
/Info 6 0 R
/ID [<A907B1FADCDEB716192423CEBAF39A77><A907B1FADCDEB716192423CEBAF39A77>]
>>
startxref 23176
PDF Comment '%%EOF\n'
Thanks,
Kodjo
--- Begin Message ---
Hi Tilman,
Thank you for the quick response.
Basically below is the sample code I am using to strip out all the javascript...
I was suprised that this exploit manage to circumvent the process... The entry
point is sanitize(PDDocument pdfDoc)..
I was have been having the same problem.. the virus scan is preventing me from
viewing the file.. I will do some more research to understand where exactly is
the javascript or the mailicious code so that I can remove it...
/**
* The following code was taken from here:
* https://github.com/mjclemente/pdfbox.cfc
*/
public class PdfSanitizer {
/**
*
https://stackoverflow.com/questions/14454387/pdfbox-how-to-flatten-a-pdf-form#19723539
* @hint Flattens any forms on the pdf
* Note that data in XFA forms is not visible after this process.
Chrome/Firefox/Safari/Preview no longer support XFA PDFs; the format seems to
be on its way out and is only supported by Adobe (via Acrobat) and IE. Adobe
ColdFusion does not allow cfpdf's 'sanitize' action on PDFs with XFA content.
*/
protected void flatten(PDDocument pdfDoc) throws PdfSanitizationException {
PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();
if ( acroForm != null ) {
try {
acroForm.flatten();
} catch (IOException e) {
throw new PdfSanitizationException(e);
}
}
}
/**
*
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotation.html
* @hint returns all annotations within the pdf as an array; the type of
each object returned is PDAnnotation, so you'll need to look at the javadocs
for that to see what methods are available
*/
protected List<PDAnnotation> listAnnotations(PDDocument pdfDoc) throws
PdfSanitizationException {
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree pages = pdfDoc.getPages();
Iterator<PDPage> iterator = pages.iterator();
while( iterator.hasNext() ) {
PDPage page = iterator.next();
try {
annotations.addAll(page.getAnnotations());
} catch (IOException e) {
throw new PdfSanitizationException(e);
}
}
return annotations;
}
/**
*
https://stackoverflow.com/questions/32741468/how-to-delete-annotations-in-pdf-file-using-pdfbox
*
https://lists.apache.org/thread.html/d5b5f7a1d07d4eb9c515054ae7e87bdf4aefb3f138b235f82297401d@%3Cusers.pdfbox.apache.org%3E
* @hint Strips out comments and other annotations
* Form fields are made visible/usable via annotations (as I understand
it); consequently, removing all annotations renders forms,
* effectively, invisible and unusable, though the markup remains present
(visible via the Debugger).
* The default behavior, therefore, is to leave annotations related to
forms present,
* so that the forms remain functional. While you can remove form
annotations by setting preserveForm = false,
* the better approach is to use flatten().
* Reminder: Added links are a type of annotation (PDAnnotationLink) so
they're removed by this method
*/
protected void removeAnnotations( PDDocument pdfDoc, Boolean preserveForm)
throws PdfSanitizationException {
PDPageTree pages = pdfDoc.getPages();
Iterator<PDPage> iterator = pages.iterator();
while( iterator.hasNext() ) {
PDPage page = iterator.next();
if ( !preserveForm ) {
page.setAnnotations(null);
} else {
List<PDAnnotation> annotations = new ArrayList<>();
try {
for(PDAnnotation annotation: page.getAnnotations()) {
if (annotation.getSubtype().equalsIgnoreCase("Widget"))
{
annotations.add(annotation);
}
}
} catch (IOException e) {
throw new PdfSanitizationException(e);
}
page.setAnnotations( annotations );
}
}
}
/**
*
https://stackoverflow.com/questions/17019960/extract-embedded-files-from-pdf-using-pdfbox-in-net-application
*
https://github.com/Valuya/fontbox/blob/master/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java
* @hint Removes embedded files
*/
protected void removeEmbeddedFiles(PDDocument pdfDoc) {
PDDocumentNameDictionary namesDictionary = new
PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
PDEmbeddedFilesNameTreeNode efTree =namesDictionary.getEmbeddedFiles();
if (efTree != null) {
efTree.getCOSObject().clear();
}
}
/**
* @hint Attempts to remove all javascript from the pdf. Javascript can
appear in a lot of places; this tackles the standard locations. If more are
found, they'll be incorporated here.
*/
protected void removeJavaScript(PDDocument pdfDoc) throws
PdfSanitizationException {
removeEmbeddedJavaScript(pdfDoc);
removeDocumentJavaScriptActions(pdfDoc);
removeFormFieldActions(pdfDoc);
removeLinkActions(pdfDoc);
}
/**
* @hint Removes the javascript embedded in the document itself
*/
protected void removeEmbeddedJavaScript(PDDocument pdfDoc) {
PDDocumentNameDictionary namesDictionary = new
PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
PDJavascriptNameTreeNode embeddedJavaScript =
namesDictionary.getJavaScript();
if (embeddedJavaScript != null) {
embeddedJavaScript.getCOSObject().clear();
}
}
/**
*
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDDocumentCatalogAdditionalActions.html
* @hint Removes the actions that can be triggered on open, before close,
before/after printing, and before/after saving
*/
protected void removeDocumentJavaScriptActions(PDDocument pdfDoc) {
PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog();
catalog.setOpenAction(null);
PDDocumentCatalogAdditionalActions actions = catalog.getActions();
if (actions != null) {
actions.setDP( null);
actions.setDS( null);
actions.setWC( null);
actions.setWP( null);
actions.setWS( null);
}
}
/**
*
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDFormFieldAdditionalActions.html
* There may be another class this need to address:
PDAnnotationAdditionalActions (but I'm not sure exactly how these actions are
differ from those handled here).
* For reference and future examination, PDAnnotationAdditionalActions is
returned by PDAnnotationWidget
(https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationWidget.html),
which is the annotation type related to form fields.
* @hint removes actions embedded in the form fields ( triggered onFocus,
onBlur, etc )
*/
protected void removeFormFieldActions(PDDocument pdfDoc) {
PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();
if ( acroForm != null ) {
Iterator<PDField> iterator = acroForm.getFieldIterator();
while( iterator.hasNext() ) {
PDField formField = iterator.next();
PDFormFieldAdditionalActions formFieldActions =
formField.getActions();
if ( formFieldActions != null ) {
formFieldActions.getCOSObject().clear();
}
}
}
}
/**
*
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationLink.html
* @hint removes actions embedded in the links ( triggered onFocus, onBlur,
etc )
*/
protected void removeLinkActions(PDDocument pdfDoc) throws
PdfSanitizationException {
PDPageTree pages = pdfDoc.getPages();
Iterator<PDPage> iterator = pages.iterator();
while( iterator.hasNext() ) {
PDPage page = iterator.next();
try {
List<PDAnnotation> annotations = page.getAnnotations();
for(PDAnnotation annotation: annotations) {
if (annotation.getSubtype() == "Link") {
PDAnnotationLink link = (PDAnnotationLink) annotation;
PDAction action = link.getAction();
if (action.getSubType() == "JavaScript") {
action.getCOSObject().clear();
}
}
}
} catch (IOException e) {
throw new PdfSanitizationException(e);
}
}
}
/**
* @hint Removes metadata from the document
*
* Reference: metadata is stored in two separate locations in a document:
* The Info (Document Information) - likely a key value pairing.
* The XMP XML
* Different PDF readers, when displaying document information may give
preference to different sources. For example, Preview may read the "Author A"
from Document Information, while Acrobat may ignore that and read dc:creator
element from the XML and display "Author B".
* Using the PDFDebugger bundled with PDFBox, via `java -jar
pdfbox-app-2.0.11.jar PDFDebugger -viewstructure example.pdf` will provide an
accurate view of both Document Information and XML metadata, and so is
preferable to pdf readers
*
*/
protected void removeMetaData(PDDocument pdfDoc) {
PDDocumentInformation documentInfo = pdfDoc.getDocumentInformation();
documentInfo.setAuthor(null);
documentInfo.setCreationDate(null);
documentInfo.setCreator(null);
documentInfo.setKeywords(null);
documentInfo.setModificationDate(null);
documentInfo.setProducer(null);
documentInfo.setSubject(null);
documentInfo.setTitle(null);
documentInfo.setTrapped(null);
/*
org.apache.xmpbox.XMPMetadata
var XMPMetadata = createObject( 'java', 'org.apache.xmpbox.XMPMetadata'
);
var metadata = XMPMetadata.createXMPMetadata();
var serializer = createObject( 'java',
'org.apache.xmpbox.xml.XmpSerializer' );
var baos = createObject( 'java', 'java.io.ByteArrayOutputStream'
).init();
serializer.serialize( metadata, baos, true );
var metadataStream = createObject( 'java',
'org.apache.pdfbox.pdmodel.common.PDMetadata' ).init( variables.pdf );
metadataStream.importXMPMetadata( baos.toByteArray() );
variables.pdf.getDocumentCatalog().setMetadata( metadataStream );
variables.hasMetadata = false;
*/
}
/**
*
https://lists.apache.org/thread.html/801ea985610d3adf51cb69103729797af3a745a9364bc3f442f80384@%3Cusers.pdfbox.apache.org%3E
* @hint If there is an embedded search index, this removes it (at least
instances of an embedded searches that I've seen)
*/
protected void removeEmbeddedIndex(PDDocument pdfDoc) {
COSBase placeInfo =
pdfDoc.getDocumentCatalog().getCOSObject().getItem("PieceInfo");
if (placeInfo != null) {
((COSDictionary)
placeInfo).removeItem(COSName.getPDFName("SearchIndex"));
}
}
/**
* @hint Runs all data removal methods on the pdf. As new methods are added
to the component, they'll be added here as well. Please be aware that sensitive
data may remain in the pdf, even after running this method.
*/
public void sanitize(PDDocument pdfDoc) throws PdfSanitizationException {
removeAnnotations(pdfDoc, false);
removeEmbeddedFiles(pdfDoc);
removeJavaScript(pdfDoc);
removeEmbeddedIndex(pdfDoc);
removeMetaData(pdfDoc);
flatten(pdfDoc);
}
}
________________________________________
From: Tilman Hausherr [thaush...@t-online.de]
Sent: 22 July 2019 18:03
To: users@pdfbox.apache.org
Subject: Re: Unable to remove the virus associated with the file
Hi,
I'm uable to download that file... nothing happens. Maybe my antivirus
prevents to download it. I suggest you upload it as text. (rename to .txt)
Anyway, you should just tell the sender that it is a virus. Either there
is some suspicious javascript in it, or something else that triggers mayhem.
Tilman
Am 22.07.2019 um 16:22 schrieb Kodjo Afriyie - iSite Eng:
> Hi,
>
> I have been trying to remove a virus that has been detected on a pdf file..
> The link below is the offending file..
>
> https://1drv.ms/u/s!AmNEMt7g6Kbuhh2hqVo8iKKEn9Tj?e=ERJ1uq
>
> Below is the message that is displayed when the file is downloaded onto my
> computer.
>
> [X]
>
>
> Thanks,
> Kodjo
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
--- End Message ---
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org