RE: Macro enabled Office documents - extract Macros

Jim Idle Wed, 02 Nov 2016 19:09:13 -0700

Thanks for the information. I will try the 1.14 snapshot and look forward to 
its release. I will also incorporate the latest POI. I misspoke when I 
mentioned contributors to POI, I read only that the older word format processor 
has not maintainer, but took that to mean the whole office parser suite.


I will prepare and submit a JIRA for the link issue shortly. I may fix it 
unless the learning curve is too long. I am also happy to contribute to any 
open source project, especially if I have a need to use it :)

The RecursiveParserWrapper does sound exactly what I want structurally - I read 
this example and breezed right past that piece for some reason.

Cheers,

Jim

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, November 2, 2016 18:54
To: user@tika.apache.org
Subject: RE: Macro enabled Office documents - extract Macros

>According to TIKA-2069, extracting macros from Office documents will now be 
>enabled on Tika 1.14. Any plans to release that soon?

Y, probably at the end of this week.  But...we made quite a few improvements to 
macro extraction in POI, and those won't get folded in until we upgrade POI and 
release Tika 1.15.


>Otherwise, is there a snapshot repo I can refer to in my Maven project such 
>that I can use the latest and greatest?
https://repository.apache.org/content/repositories/snapshots/<https://urldefense.proofpoint.com/v2/url?u=https-3A__repository.apache.org_content_repositories_snapshots_&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=6-swNvh5KLRaKvLhFEs65pabIs4YWZDjCzTxpCAjIJw&s=xkR0vpOs5etCuDKPSRzXl4pmOIhfRBAbcyIpkJbc0Zw&e=>
However, that doesn't integrate the snapshot version of POI.  To integrate the 
latest version of POI, see: 
https://wiki.apache.org/tika/MSOfficeParsers<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_tika_MSOfficeParsers&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=6-swNvh5KLRaKvLhFEs65pabIs4YWZDjCzTxpCAjIJw&s=HcL1bNot3fTikcBILraQZ2ulp2kcZ1K_7QNXr8p_cQw&e=>

>Also, what about POI? It seems to have no maintainer these days - is anyone 
>here actively patching it for bugs, or will I need to do that myself as I find 
>them?
Uh, there is actually quite a bit of activity on POI.  However, bug reports 
with patches are the fastest way to improve POI.


Related to this, I see that if I embed a link in to a Word document and then 
parse it, I am given a startElement event for the 'a' element but looking at 
the attributes supplied to the call, I don't see the target for the link, 
though the next characters event does give me the text of the link. How can I 
extract the target?

What I get is one attribute:

Attr(0).localName = name
Attr(0).QName = name
Attr(0).Type = CDATA
Attr(0).URI =
Attr(0).Value = _GoBack

I can't quite see why the value for "name" would be "_GoBack" unless there is 
something I am missing about the format of links in (in this case) a word .docm 
document. I believe that this is to do with supplying display text for the link 
rather than just putting in a direct http reference (display text is the link). 
When I add just the href target without any display text, I do get the target 
value in the href attribute (which is not given when the link is covered by 
text).

This might be worth opening an issue on our Jira.  Please supply an example 
document and the expected output and/or a unit test.



>Finally, when dealing with composite documents and archives such as zip, 
>should I handle these with a
ParsingEmbeddedDocumentExtractor and create a new handler for each embedded 
document? I don't need to extract the embedded docs  as per the example of 
this, just parse them as new documents and create my own information structure, 
which I will track myself. This seems to be the way to do it. I can experiment 
and work all this out, but if anyone has any pointers or links to examples 
other than the ones that come with the source, then I would be grateful.

If you want to embedded documents as their own documents, you might want to 
look into the RecursiveParserWrapper, this will return a list of Metadata 
objects per input document.  The first item contains the "container" document, 
and then there's a separate Metadata object for each embedded document.  The 
content is stored in "X-TIKA:content" IIRC.

For an example, see: 
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_tika_trunk_tika-2Dexample_src_main_java_org_apache_tika_example_ParsingExample.java&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=6-swNvh5KLRaKvLhFEs65pabIs4YWZDjCzTxpCAjIJw&s=g015M18xeGqtOf2KrqEyopB53Wwq-Jdv7CS0WFxZmNA&e=>

RE: Macro enabled Office documents - extract Macros

Reply via email to