Thanks for the info Tim. I will ask the authors of that parser if they are 
interested in switching and also see if I am able to use the license. If I add 
the abstraction code myself, what do you think is a good model to copy from? 
PDF parsing?

Obfuscation is a big deal, as you say. Perhaps I can help on that later down 
the line. I understand your conundrum regarding a general tool vs specific 
forensic analysis. If you try to support every nuance of every format, then 
perhaps the utility of a generic abstraction is lost because every format has 
"special" cases.

I will develop with the latest snapshot of the system and see where I get to. I 
am able to see at least some VBA macros as embedded documents using the current 
snapshot and that is good enough for the moment.

I have an ultimate goal of trying to identify at least some malicious items 
using machine learning techniques, but I am quite a way off that as of now I 
think :)

Jim

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, November 7, 2016 21:06
To: user@tika.apache.org
Subject: RE: PDF Processing

>https://www.free-decompiler.com/flash/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.free-2Ddecompiler.com_flash_&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=Z8t1x2w31Fyfqf3_MdGSN21jbzqgj19oxMynPLK43VU&s=BIhVKzzcfgxAczyaLJs3ov0_CpuCQ8YTPhERJG98PaQ&e=>
> and perhaps a thing for me to do would be to add abstraction support for this 
>parser to Tika.

Y, the license on that is incompatible with the Apache License[1], so we can't 
include it unless we get the authors to change the license.  Also, it looks 
like it requires native libs?  But _you_ could easily write a wrapper for it 
for your use [2].

> I am using Tika to extract and later scan/process components of any document 
> that may perform malicious actions ... either add enhancements or maybe add 
> new parsers.
This has been something I'm struggling with...how much is the right amount of 
processing for a general tool like Tika.  True forensic analysis is, indeed, a 
tall order, and there is an abundance of file-format-specific scripts to handle 
various aspects.  In short, be careful on relying on what we have so far and 
please do open issues as needed:

1)      We made some recent improvements to macro extraction in POI, but those 
won't be folded in until Tika 1.15.

2)      My initial patch for javascript extraction from PDFs will not handle 
the more fun obfuscation techniques [3].

       Best,

                 Tim

[1] https://www.apache.org/legal/resolved#category-x
[2] https://tika.apache.org/1.13/parser_guide.html
[3] just google pdf javascript obfuscation

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Sunday, November 6, 2016 8:41 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing

I forgot to answer "If there are other components that you'd like to have 
extracted, let us know, and we'll consider adding them." I am using Tika to 
extract and later scan/process components of any document that may perform 
malicious actions. So that is any script-like or macro-like construct, plus any 
binary data, embedded images and so forth. So essentially I need to break down 
all components of all documents, which is a tall order of course. But it seems 
like the collection of parsers that Tika provides is my best bet, and either 
add enhancements or maybe add new parsers.

For instance it seems that Flash is only supported via flv. There is what looks 
like a good parser here: 
https://www.free-decompiler.com/flash/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.free-2Ddecompiler.com_flash_&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=Z8t1x2w31Fyfqf3_MdGSN21jbzqgj19oxMynPLK43VU&s=BIhVKzzcfgxAczyaLJs3ov0_CpuCQ8YTPhERJG98PaQ&e=>
 and perhaps a thing for me to do would be to add abstraction support for this 
parser to Tika.

Jim

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Thursday, November 3, 2016 10:11
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing

PDAction extraction is probably what I need. Embedded streams in general, 
though for non-text "pieces" it would be fine to get offset and length 
information from some event. I will take a look at your example output below.

I'll press on with Tika as an abstraction for now as I generally like what I 
see. I am just a bit worried that the one abstraction to rule them all may 
preclude me from easily handling more esoteric parts of some document formats.

I presume that the best way to request enhancements is to create a JIRA entry 
so it can be tracked?

Thanks for your help,

Jim

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, November 2, 2016 19:02
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing

It depends (tm).  As soon as 1.14 is released, I'll add PDAction extraction 
from PDFs (TIKA-2090), and that will include javascript (as stored in 
PDActions)... that capability doesn't currently exist.  If there are other 
components that you'd like to have extracted, let us know, and we'll consider 
adding them.

If you want a look at what javascript extraction will look like, I recently 
extracted ~70k javascript elements from our 500k regression corpus:
http://162.242.228.174/embedded_files<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo&s=nNP4J4eB9FTGgO9ZlvgSUhiVtxLFZuS47JwZ4stKBqo&e=>

specifically:

http://162.242.228.174/embedded_files/js_in_pdfs.tar.bz2<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles_js-5Fin-5Fpdfs.tar.bz2&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo&s=_R9F1g8DMvxLVjPhOFrvLS6kS4_cALopdcqWez1cs1U&e=>

> entire structure of a document and extract any or all pieces from it.
Within reason(tm), that _is_ the goal of Tika.  The focus is text, but we try 
to maintain some structural information where we can, e.g. bold/italic/lists 
and paragraph boundaries in MSOffice and related formats.  We do not do full 
stylistic extraction (font name, size, etc), but the general formatting 
components that apply across formats, we try to maintain.



From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Wednesday, November 2, 2016 3:30 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: PDF Processing

I am wondering if I am using Tika for purposes it was not aimed at. I am 
beginning to thing that it's main aim is extract text from documents, whereas I 
really want to get an entire structure of a document and extract any or all 
pieces from it. For instance when parsing a PDF, if it has embedded streams, I 
want to be able to extract the embed stream (for instance a JavaScript). PDFBox 
can do this, but such information does not turn up in a ContentHandler passed 
to Tika.

If I want to do more than get just the text, should I really use the underlying 
parsers directly and not try to abstract them using Tika?

Many thanks,

Jim

Reply via email to