RE: Unable to remove the virus associated with the file

Kodjo Afriyie - iSite Eng Tue, 23 Jul 2019 01:00:28 -0700

Hi Tilman,

Thank you for the quick response. 
Basically below is the sample code I am using to strip out all the javascript...


I was suprised that this exploit manage to circumvent the process... The entry 
point is sanitize(PDDocument pdfDoc)..

I was have been having the same problem.. the virus scan is preventing me from 
viewing the file..  I will do some more research to understand where exactly is 
the javascript or the mailicious code so that I can remove it...

/**
 * The following code was taken from here:
 * https://github.com/mjclemente/pdfbox.cfc
 */
public class PdfSanitizer {

    /**
     * 
https://stackoverflow.com/questions/14454387/pdfbox-how-to-flatten-a-pdf-form#19723539
     * @hint Flattens any forms on the pdf
     * Note that data in XFA forms is not visible after this process. 
Chrome/Firefox/Safari/Preview no longer support XFA PDFs; the format seems to 
be on its way out and is only supported by Adobe (via Acrobat) and IE. Adobe 
ColdFusion does not allow cfpdf's 'sanitize' action on PDFs with XFA content.
     */
    protected void flatten(PDDocument pdfDoc) throws PdfSanitizationException {

        PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();
        if ( acroForm != null ) {
            try {
                acroForm.flatten();
            } catch (IOException e) {
                throw new PdfSanitizationException(e);
            }
        }

    }

    /**
     * 
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotation.html
     * @hint returns all annotations within the pdf as an array; the type of 
each object returned is PDAnnotation, so you'll need to look at the javadocs 
for that to see what methods are available
     */
    protected List<PDAnnotation> listAnnotations(PDDocument pdfDoc) throws 
PdfSanitizationException {
        List<PDAnnotation> annotations = new ArrayList<>();
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();
        while( iterator.hasNext() ) {
            PDPage page = iterator.next();
            try {
                annotations.addAll(page.getAnnotations());
            } catch (IOException e) {
                throw new PdfSanitizationException(e);
            }
        }
        return annotations;
    }

    /**
     * 
https://stackoverflow.com/questions/32741468/how-to-delete-annotations-in-pdf-file-using-pdfbox
     * 
https://lists.apache.org/thread.html/d5b5f7a1d07d4eb9c515054ae7e87bdf4aefb3f138b235f82297401d@%3Cusers.pdfbox.apache.org%3E
     * @hint Strips out comments and other annotations
     * Form fields are made visible/usable via annotations (as I understand 
it); consequently, removing all annotations renders forms,
     * effectively, invisible and unusable, though the markup remains present 
(visible via the Debugger).
     * The default behavior, therefore, is to leave annotations related to 
forms present,
     * so that the forms remain functional. While you can remove form 
annotations by setting preserveForm = false,
     * the better approach is to use flatten().
     * Reminder: Added links are a type of annotation (PDAnnotationLink) so 
they're removed by this method
     */
    protected void removeAnnotations( PDDocument pdfDoc, Boolean preserveForm) 
throws PdfSanitizationException {
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();

        while( iterator.hasNext() ) {
            PDPage page = iterator.next();

            if ( !preserveForm ) {
                page.setAnnotations(null);
            } else {
                List<PDAnnotation> annotations = new ArrayList<>();

                try {
                    for(PDAnnotation annotation: page.getAnnotations()) {
                        if (annotation.getSubtype().equalsIgnoreCase("Widget")) 
{
                            annotations.add(annotation);
                        }
                    }
                } catch (IOException e) {
                    throw new PdfSanitizationException(e);
                }
                page.setAnnotations( annotations );
            }
        }
    }


    /**
     * 
https://stackoverflow.com/questions/17019960/extract-embedded-files-from-pdf-using-pdfbox-in-net-application
     * 
https://github.com/Valuya/fontbox/blob/master/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java
     * @hint Removes embedded files
     */
    protected void removeEmbeddedFiles(PDDocument pdfDoc) {

        PDDocumentNameDictionary namesDictionary = new 
PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
        PDEmbeddedFilesNameTreeNode efTree =namesDictionary.getEmbeddedFiles();

        if (efTree != null) {
            efTree.getCOSObject().clear();

        }
    }

    /**
     * @hint Attempts to remove all javascript from the pdf. Javascript can 
appear in a lot of places; this tackles the standard locations. If more are 
found, they'll be incorporated here.
     */
    protected void removeJavaScript(PDDocument pdfDoc) throws 
PdfSanitizationException {
        removeEmbeddedJavaScript(pdfDoc);
        removeDocumentJavaScriptActions(pdfDoc);
        removeFormFieldActions(pdfDoc);
        removeLinkActions(pdfDoc);
    }

    /**
     * @hint Removes the javascript embedded in the document itself
     */
    protected void removeEmbeddedJavaScript(PDDocument pdfDoc) {
        PDDocumentNameDictionary namesDictionary = new 
PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
        PDJavascriptNameTreeNode embeddedJavaScript = 
namesDictionary.getJavaScript();
        if (embeddedJavaScript != null) {
            embeddedJavaScript.getCOSObject().clear();
        }
    }

    /**
     * 
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDDocumentCatalogAdditionalActions.html
     * @hint Removes the actions that can be triggered on open, before close, 
before/after printing, and before/after saving
     */
    protected void removeDocumentJavaScriptActions(PDDocument pdfDoc) {

        PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog();
        catalog.setOpenAction(null);

        PDDocumentCatalogAdditionalActions actions = catalog.getActions();
        if (actions != null) {
            actions.setDP( null);
            actions.setDS( null);
            actions.setWC( null);
            actions.setWP( null);
            actions.setWS( null);
        }
    }

    /**
     * 
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDFormFieldAdditionalActions.html
     * There may be another class this need to address: 
PDAnnotationAdditionalActions (but I'm not sure exactly how these actions are 
differ from those handled here).
     * For reference and future examination, PDAnnotationAdditionalActions is 
returned by PDAnnotationWidget 
(https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationWidget.html),
 which is the annotation type related to form fields.
     * @hint removes actions embedded in the form fields ( triggered onFocus, 
onBlur, etc )
     */
    protected void removeFormFieldActions(PDDocument pdfDoc) {
        PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();

        if ( acroForm != null ) {
            Iterator<PDField> iterator = acroForm.getFieldIterator();

            while( iterator.hasNext() ) {
                PDField formField = iterator.next();
                PDFormFieldAdditionalActions formFieldActions = 
formField.getActions();

                if ( formFieldActions != null ) {
                    formFieldActions.getCOSObject().clear();
                }
            }
        }
    }

    /**
     * 
https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationLink.html
     * @hint removes actions embedded in the links ( triggered onFocus, onBlur, 
etc )
     */
    protected void removeLinkActions(PDDocument pdfDoc) throws 
PdfSanitizationException {
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();

        while( iterator.hasNext() ) {
            PDPage page = iterator.next();

            try {
            List<PDAnnotation> annotations = page.getAnnotations();

                for(PDAnnotation annotation: annotations) {
                    if (annotation.getSubtype() == "Link") {
                        PDAnnotationLink link = (PDAnnotationLink) annotation;
                        PDAction action = link.getAction();
                        if (action.getSubType() == "JavaScript") {
                            action.getCOSObject().clear();
                        }
                    }
                }
            } catch (IOException e) {
               throw new PdfSanitizationException(e);
            }
        }
    }

    /**
     * @hint Removes metadata from the document
     *
     * Reference: metadata is stored in two separate locations in a document:
     * The Info (Document Information) - likely a key value pairing.
     * The XMP XML
     * Different PDF readers, when displaying document information may give 
preference to different sources. For example, Preview may read the "Author A" 
from Document Information, while Acrobat may ignore that and read dc:creator 
element from the XML and display "Author B".
     * Using the PDFDebugger bundled with PDFBox, via `java -jar 
pdfbox-app-2.0.11.jar PDFDebugger -viewstructure example.pdf` will provide an 
accurate view of both Document Information and XML metadata, and so is 
preferable to pdf readers
     *
     */
    protected void removeMetaData(PDDocument pdfDoc) {
        PDDocumentInformation documentInfo = pdfDoc.getDocumentInformation();
        documentInfo.setAuthor(null);
        documentInfo.setCreationDate(null);
        documentInfo.setCreator(null);
        documentInfo.setKeywords(null);
        documentInfo.setModificationDate(null);
        documentInfo.setProducer(null);
        documentInfo.setSubject(null);
        documentInfo.setTitle(null);
        documentInfo.setTrapped(null);

        /*
        org.apache.xmpbox.XMPMetadata

        var XMPMetadata = createObject( 'java', 'org.apache.xmpbox.XMPMetadata' 
);
        var metadata = XMPMetadata.createXMPMetadata();

        var serializer = createObject( 'java', 
'org.apache.xmpbox.xml.XmpSerializer' );
        var baos = createObject( 'java', 'java.io.ByteArrayOutputStream' 
).init();
        serializer.serialize( metadata, baos, true );
        var metadataStream = createObject( 'java', 
'org.apache.pdfbox.pdmodel.common.PDMetadata' ).init( variables.pdf );
        metadataStream.importXMPMetadata( baos.toByteArray() );
        variables.pdf.getDocumentCatalog().setMetadata( metadataStream );

        variables.hasMetadata = false;
         */
    }

    /**
     * 
https://lists.apache.org/thread.html/801ea985610d3adf51cb69103729797af3a745a9364bc3f442f80384@%3Cusers.pdfbox.apache.org%3E
     * @hint If there is an embedded search index, this removes it (at least 
instances of an embedded searches that I've seen)
     */
    protected void removeEmbeddedIndex(PDDocument pdfDoc) {

        COSBase placeInfo = 
pdfDoc.getDocumentCatalog().getCOSObject().getItem("PieceInfo");
        if (placeInfo != null) {
            ((COSDictionary) 
placeInfo).removeItem(COSName.getPDFName("SearchIndex"));
        }

    }

    /**
     * @hint Runs all data removal methods on the pdf. As new methods are added 
to the component, they'll be added here as well. Please be aware that sensitive 
data may remain in the pdf, even after running this method.
     */
    public void sanitize(PDDocument pdfDoc) throws PdfSanitizationException {
        removeAnnotations(pdfDoc, false);
        removeEmbeddedFiles(pdfDoc);
        removeJavaScript(pdfDoc);
        removeEmbeddedIndex(pdfDoc);
        removeMetaData(pdfDoc);
        flatten(pdfDoc);

    }

}



________________________________________
From: Tilman Hausherr [thaush...@t-online.de]
Sent: 22 July 2019 18:03
To: users@pdfbox.apache.org
Subject: Re: Unable to remove the virus associated with the file

Hi,

I'm uable to download that file... nothing happens. Maybe my antivirus
prevents to download it. I suggest you upload it as text. (rename to .txt)

Anyway, you should just tell the sender that it is a virus. Either there
is some suspicious javascript in it, or something else that triggers mayhem.

Tilman

Am 22.07.2019 um 16:22 schrieb Kodjo Afriyie - iSite Eng:
> Hi,
>
> I have been trying to remove a virus that has been detected on a pdf file..
> The link below is the offending file..
>
> https://1drv.ms/u/s!AmNEMt7g6Kbuhh2hqVo8iKKEn9Tj?e=ERJ1uq
>
> Below is the message that is displayed when the file is downloaded onto my 
> computer.
>
> [X]
>
>
> Thanks,
> Kodjo
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

RE: Unable to remove the virus associated with the file

Reply via email to