Hi Tilman, Thank you for the quick response. Basically below is the sample code I am using to strip out all the javascript...
I was suprised that this exploit manage to circumvent the process... The entry point is sanitize(PDDocument pdfDoc).. I was have been having the same problem.. the virus scan is preventing me from viewing the file.. I will do some more research to understand where exactly is the javascript or the mailicious code so that I can remove it... /** * The following code was taken from here: * https://github.com/mjclemente/pdfbox.cfc */ public class PdfSanitizer { /** * https://stackoverflow.com/questions/14454387/pdfbox-how-to-flatten-a-pdf-form#19723539 * @hint Flattens any forms on the pdf * Note that data in XFA forms is not visible after this process. Chrome/Firefox/Safari/Preview no longer support XFA PDFs; the format seems to be on its way out and is only supported by Adobe (via Acrobat) and IE. Adobe ColdFusion does not allow cfpdf's 'sanitize' action on PDFs with XFA content. */ protected void flatten(PDDocument pdfDoc) throws PdfSanitizationException { PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm(); if ( acroForm != null ) { try { acroForm.flatten(); } catch (IOException e) { throw new PdfSanitizationException(e); } } } /** * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotation.html * @hint returns all annotations within the pdf as an array; the type of each object returned is PDAnnotation, so you'll need to look at the javadocs for that to see what methods are available */ protected List<PDAnnotation> listAnnotations(PDDocument pdfDoc) throws PdfSanitizationException { List<PDAnnotation> annotations = new ArrayList<>(); PDPageTree pages = pdfDoc.getPages(); Iterator<PDPage> iterator = pages.iterator(); while( iterator.hasNext() ) { PDPage page = iterator.next(); try { annotations.addAll(page.getAnnotations()); } catch (IOException e) { throw new PdfSanitizationException(e); } } return annotations; } /** * https://stackoverflow.com/questions/32741468/how-to-delete-annotations-in-pdf-file-using-pdfbox * https://lists.apache.org/thread.html/d5b5f7a1d07d4eb9c515054ae7e87bdf4aefb3f138b235f82297401d@%3Cusers.pdfbox.apache.org%3E * @hint Strips out comments and other annotations * Form fields are made visible/usable via annotations (as I understand it); consequently, removing all annotations renders forms, * effectively, invisible and unusable, though the markup remains present (visible via the Debugger). * The default behavior, therefore, is to leave annotations related to forms present, * so that the forms remain functional. While you can remove form annotations by setting preserveForm = false, * the better approach is to use flatten(). * Reminder: Added links are a type of annotation (PDAnnotationLink) so they're removed by this method */ protected void removeAnnotations( PDDocument pdfDoc, Boolean preserveForm) throws PdfSanitizationException { PDPageTree pages = pdfDoc.getPages(); Iterator<PDPage> iterator = pages.iterator(); while( iterator.hasNext() ) { PDPage page = iterator.next(); if ( !preserveForm ) { page.setAnnotations(null); } else { List<PDAnnotation> annotations = new ArrayList<>(); try { for(PDAnnotation annotation: page.getAnnotations()) { if (annotation.getSubtype().equalsIgnoreCase("Widget")) { annotations.add(annotation); } } } catch (IOException e) { throw new PdfSanitizationException(e); } page.setAnnotations( annotations ); } } } /** * https://stackoverflow.com/questions/17019960/extract-embedded-files-from-pdf-using-pdfbox-in-net-application * https://github.com/Valuya/fontbox/blob/master/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java * @hint Removes embedded files */ protected void removeEmbeddedFiles(PDDocument pdfDoc) { PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(pdfDoc.getDocumentCatalog()); PDEmbeddedFilesNameTreeNode efTree =namesDictionary.getEmbeddedFiles(); if (efTree != null) { efTree.getCOSObject().clear(); } } /** * @hint Attempts to remove all javascript from the pdf. Javascript can appear in a lot of places; this tackles the standard locations. If more are found, they'll be incorporated here. */ protected void removeJavaScript(PDDocument pdfDoc) throws PdfSanitizationException { removeEmbeddedJavaScript(pdfDoc); removeDocumentJavaScriptActions(pdfDoc); removeFormFieldActions(pdfDoc); removeLinkActions(pdfDoc); } /** * @hint Removes the javascript embedded in the document itself */ protected void removeEmbeddedJavaScript(PDDocument pdfDoc) { PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(pdfDoc.getDocumentCatalog()); PDJavascriptNameTreeNode embeddedJavaScript = namesDictionary.getJavaScript(); if (embeddedJavaScript != null) { embeddedJavaScript.getCOSObject().clear(); } } /** * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDDocumentCatalogAdditionalActions.html * @hint Removes the actions that can be triggered on open, before close, before/after printing, and before/after saving */ protected void removeDocumentJavaScriptActions(PDDocument pdfDoc) { PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog(); catalog.setOpenAction(null); PDDocumentCatalogAdditionalActions actions = catalog.getActions(); if (actions != null) { actions.setDP( null); actions.setDS( null); actions.setWC( null); actions.setWP( null); actions.setWS( null); } } /** * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDFormFieldAdditionalActions.html * There may be another class this need to address: PDAnnotationAdditionalActions (but I'm not sure exactly how these actions are differ from those handled here). * For reference and future examination, PDAnnotationAdditionalActions is returned by PDAnnotationWidget (https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationWidget.html), which is the annotation type related to form fields. * @hint removes actions embedded in the form fields ( triggered onFocus, onBlur, etc ) */ protected void removeFormFieldActions(PDDocument pdfDoc) { PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm(); if ( acroForm != null ) { Iterator<PDField> iterator = acroForm.getFieldIterator(); while( iterator.hasNext() ) { PDField formField = iterator.next(); PDFormFieldAdditionalActions formFieldActions = formField.getActions(); if ( formFieldActions != null ) { formFieldActions.getCOSObject().clear(); } } } } /** * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationLink.html * @hint removes actions embedded in the links ( triggered onFocus, onBlur, etc ) */ protected void removeLinkActions(PDDocument pdfDoc) throws PdfSanitizationException { PDPageTree pages = pdfDoc.getPages(); Iterator<PDPage> iterator = pages.iterator(); while( iterator.hasNext() ) { PDPage page = iterator.next(); try { List<PDAnnotation> annotations = page.getAnnotations(); for(PDAnnotation annotation: annotations) { if (annotation.getSubtype() == "Link") { PDAnnotationLink link = (PDAnnotationLink) annotation; PDAction action = link.getAction(); if (action.getSubType() == "JavaScript") { action.getCOSObject().clear(); } } } } catch (IOException e) { throw new PdfSanitizationException(e); } } } /** * @hint Removes metadata from the document * * Reference: metadata is stored in two separate locations in a document: * The Info (Document Information) - likely a key value pairing. * The XMP XML * Different PDF readers, when displaying document information may give preference to different sources. For example, Preview may read the "Author A" from Document Information, while Acrobat may ignore that and read dc:creator element from the XML and display "Author B". * Using the PDFDebugger bundled with PDFBox, via `java -jar pdfbox-app-2.0.11.jar PDFDebugger -viewstructure example.pdf` will provide an accurate view of both Document Information and XML metadata, and so is preferable to pdf readers * */ protected void removeMetaData(PDDocument pdfDoc) { PDDocumentInformation documentInfo = pdfDoc.getDocumentInformation(); documentInfo.setAuthor(null); documentInfo.setCreationDate(null); documentInfo.setCreator(null); documentInfo.setKeywords(null); documentInfo.setModificationDate(null); documentInfo.setProducer(null); documentInfo.setSubject(null); documentInfo.setTitle(null); documentInfo.setTrapped(null); /* org.apache.xmpbox.XMPMetadata var XMPMetadata = createObject( 'java', 'org.apache.xmpbox.XMPMetadata' ); var metadata = XMPMetadata.createXMPMetadata(); var serializer = createObject( 'java', 'org.apache.xmpbox.xml.XmpSerializer' ); var baos = createObject( 'java', 'java.io.ByteArrayOutputStream' ).init(); serializer.serialize( metadata, baos, true ); var metadataStream = createObject( 'java', 'org.apache.pdfbox.pdmodel.common.PDMetadata' ).init( variables.pdf ); metadataStream.importXMPMetadata( baos.toByteArray() ); variables.pdf.getDocumentCatalog().setMetadata( metadataStream ); variables.hasMetadata = false; */ } /** * https://lists.apache.org/thread.html/801ea985610d3adf51cb69103729797af3a745a9364bc3f442f80384@%3Cusers.pdfbox.apache.org%3E * @hint If there is an embedded search index, this removes it (at least instances of an embedded searches that I've seen) */ protected void removeEmbeddedIndex(PDDocument pdfDoc) { COSBase placeInfo = pdfDoc.getDocumentCatalog().getCOSObject().getItem("PieceInfo"); if (placeInfo != null) { ((COSDictionary) placeInfo).removeItem(COSName.getPDFName("SearchIndex")); } } /** * @hint Runs all data removal methods on the pdf. As new methods are added to the component, they'll be added here as well. Please be aware that sensitive data may remain in the pdf, even after running this method. */ public void sanitize(PDDocument pdfDoc) throws PdfSanitizationException { removeAnnotations(pdfDoc, false); removeEmbeddedFiles(pdfDoc); removeJavaScript(pdfDoc); removeEmbeddedIndex(pdfDoc); removeMetaData(pdfDoc); flatten(pdfDoc); } } ________________________________________ From: Tilman Hausherr [thaush...@t-online.de] Sent: 22 July 2019 18:03 To: users@pdfbox.apache.org Subject: Re: Unable to remove the virus associated with the file Hi, I'm uable to download that file... nothing happens. Maybe my antivirus prevents to download it. I suggest you upload it as text. (rename to .txt) Anyway, you should just tell the sender that it is a virus. Either there is some suspicious javascript in it, or something else that triggers mayhem. Tilman Am 22.07.2019 um 16:22 schrieb Kodjo Afriyie - iSite Eng: > Hi, > > I have been trying to remove a virus that has been detected on a pdf file.. > The link below is the offending file.. > > https://1drv.ms/u/s!AmNEMt7g6Kbuhh2hqVo8iKKEn9Tj?e=ERJ1uq > > Below is the message that is displayed when the file is downloaded onto my > computer. > > [X] > > > Thanks, > Kodjo > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org