[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534262#comment-15534262
 ] 

Tim Allison commented on TIKA-2069:
---

Y, sorry.  I opened TIKA-2104 to track this.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, 
> tika-app-1.14-20160928.19-109-test-macro-doc.docm.output, 
> tika-app-1.14-20160928.19-109-xlsmacro.xlsm.output, word-macro.PNG, 
> xlsmacro.xlsm, xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-29 Thread Jeff Swindle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534259#comment-15534259
 ] 

Jeff Swindle commented on TIKA-2069:


Thanks.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, 
> tika-app-1.14-20160928.19-109-test-macro-doc.docm.output, 
> tika-app-1.14-20160928.19-109-xlsmacro.xlsm.output, word-macro.PNG, 
> xlsmacro.xlsm, xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-29 Thread Jeff Swindle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534253#comment-15534253
 ] 

Jeff Swindle commented on TIKA-2069:


[~talli...@apache.org] I tried a tika-app 1.14 snapshot and didn't get the 
expected output for the test-macro-doc.docm file. I also tried another internal 
file and didn't see macro output.
Executing against xlsmacro.xlsm provided expected output of macro content.

I've attached the output from tika-app against xlsmacro.xlsm and 
test-macro-doc.docm.
Here are the commands I used:
# java -jar tika-app-1.14-20160928.19-109.jar test-macro-doc.docm > 
tika-app-1.14-20160928.19-109-test-macro-doc.docm.output
# java -jar tika-app-1.14-20160928.19-109.jar xlsmacro.xlsm > 
tika-app-1.14-20160928.19-109-xlsmacro.xlsm.output
Is there something specific I need to add to the command to extract the macro 
in the docm?


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, 
> tika-app-1.14-20160928.19-109-test-macro-doc.docm.output, 
> tika-app-1.14-20160928.19-109-xlsmacro.xlsm.output, word-macro.PNG, 
> xlsmacro.xlsm, xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534244#comment-15534244
 ] 

Tim Allison commented on TIKA-2069:
---

Right.  Sorry. Unfortunately, there's a bug in POI that prevents reading the 
macro in your docm file.  See 
[above|https://issues.apache.org/jira/browse/TIKA-2069?focusedCommentId=15510207&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15510207].

There's still some work to do on the POI side.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, 
> tika-app-1.14-20160928.19-109-test-macro-doc.docm.output, 
> tika-app-1.14-20160928.19-109-xlsmacro.xlsm.output, word-macro.PNG, 
> xlsmacro.xlsm, xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513354#comment-15513354
 ] 

Hudson commented on TIKA-2069:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1105 (See 
[https://builds.apache.org/job/Tika-trunk/1105/])
TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target 
(tallison: rev 8a45f67a2e3641b08fcfb5e2283e4a43ff86f3cd)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513351#comment-15513351
 ] 

Hudson commented on TIKA-2069:
--

SUCCESS: Integrated in Jenkins build tika-2.x #147 (See 
[https://builds.apache.org/job/tika-2.x/147/])
TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target 
(tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513279#comment-15513279
 ] 

Hudson commented on TIKA-2069:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #51 (See 
[https://builds.apache.org/job/tika-2.x-windows/51/])
TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target 
(tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f)
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511987#comment-15511987
 ] 

Hudson commented on TIKA-2069:
--

ABORTED: Integrated in Jenkins build Tika-trunk #1104 (See 
[https://builds.apache.org/job/Tika-trunk/1104/])
TIKA-2069 -- extract macros from MSOffice docs (tallison: rev 
2ae7206d9c99fb553314cff21bb155d4e6f06d12)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
* (edit) CHANGES.txt
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.docm
* (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.pptm
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.doc
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xlsm
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xls
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.ppt


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511970#comment-15511970
 ] 

Hudson commented on TIKA-2069:
--

ABORTED: Integrated in Jenkins build tika-2.x #146 (See 
[https://builds.apache.org/job/tika-2.x/146/])
TIKA-2069 -- extract macros from MSOffice files. (tallison: rev 
66f433471f59d5af931f0a49bf8bddd33a7f27a7)
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (edit) CHANGES.txt
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm
* (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511744#comment-15511744
 ] 

Hudson commented on TIKA-2069:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #50 (See 
[https://builds.apache.org/job/tika-2.x-windows/50/])
TIKA-2069 -- extract macros from MSOffice files. (tallison: rev 
66f433471f59d5af931f0a49bf8bddd33a7f27a7)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm
* (edit) CHANGES.txt
* (add) 
tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls
* (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm


> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Fix For: 2.0, 1.14
>
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510446#comment-15510446
 ] 

Tim Allison commented on TIKA-2069:
---

Currently, multiple macros are appended to one string in POI.

{noformat}
Attribute VB_Name = "NewMacros"
Sub Embolden()
Attribute Embolden.VB_Description = "This tests changing the selection to bold"
Attribute Embolden.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Embolden"
'
' Embolden Macro
'
'
Selection.Font.Bold = wdToggle
Selection.Font.BoldBi = wdToggle
End Sub

Sub Italicize()
Attribute Italicize.VB_Description = "This tests italicizing"
Attribute Italicize.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Italicize"
'
' Italicize Macro
'
'
Selection.Font.Italic = wdToggle
Selection.Font.ItalicBi = wdToggle
End Sub
{noformat}

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510207#comment-15510207
 ] 

Tim Allison commented on TIKA-2069:
---

[~jeffswindle], I should point out that the VBAMacroReader is still relatively 
new in POI, and there are currently 3 open bugs, one triggered by the docm file 
that you submitted.

* [60158|https://bz.apache.org/bugzilla/show_bug.cgi?id=60158]
* [59830|https://bz.apache.org/bugzilla/show_bug.cgi?id=59830]
* [59858|https://bz.apache.org/bugzilla/show_bug.cgi?id=59858]

For now, we'll swallow the exceptions in Tika, but there's much more work to be 
done.  Patches to POI would be welcomed! :)

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Jeff Swindle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510157#comment-15510157
 ] 

Jeff Swindle commented on TIKA-2069:


For my purposes, the output shown is good. I need the macro text content 
primarily.

Thanks Tim!





> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509832#comment-15509832
 ] 

Tim Allison commented on TIKA-2069:
---

This reminds me that I need to commit TIKA-2047 so that the mime-type isn't 
overwritten.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509825#comment-15509825
 ] 

Tim Allison commented on TIKA-2069:
---

I think I get it.  One challenge is that we're currently getting a 
{{Map}} from POI, there doesn't seem currently to be an obvious 
way to link metadata to the actual text.  On POI's test doc, 
with this code:
{noformat}
VBAMacroReader reader = new VBAMacroReader(fs);
for (Map.Entry e : reader.readMacros().entrySet()) {
Metadata m = new Metadata();
m.set(Metadata.EMBEDDED_RESOURCE_TYPE, 
TikaCoreProperties.EmbeddedResourceType.MACRO.toString());
m.set(Metadata.CONTENT_TYPE, "text/x-vbasic");
EmbeddedDocumentExtractor ex = 
context.get(EmbeddedDocumentExtractor.class);
if (ex == null) {
ex = new ParsingEmbeddedDocumentExtractor(context);
}
if (ex.shouldParseEmbedded(m)) {
ex.parseEmbedded(new 
ByteArrayInputStream(e.getValue().getBytes(StandardCharsets.UTF_8)), xhtml, m, 
true);
}

}
{noformat}

we get: 

{noformat}
1: X-Parsed-By : org.apache.tika.parser.DefaultParser
1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
1: embeddedResourceType : MACRO
1: Content-Encoding : windows-1252
1: X-TIKA:parse_time_millis : 27
1: X-TIKA:content : http://www.w3.org/1999/xhtml";>









Attribute VB_Name = "Module1"
Sub TestMacro()
'
' TestMacro Macro
' This is a test macro
'

'
ActiveDocument.Paragraphs(1).Range.Text = "This is a macro word processing 
document"
End Sub



1: X-TIKA:embedded_resource_path : /embedded-1
1: Content-Type : text/plain; charset=windows-1252
2: X-Parsed-By : org.apache.tika.parser.DefaultParser
2: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
2: embeddedResourceType : MACRO
2: Content-Encoding : windows-1252
2: X-TIKA:parse_time_millis : 4
2: X-TIKA:content : http://www.w3.org/1999/xhtml";>









Attribute VB_Name = "ThisDocument"
Attribute VB_Base = "1Normal.ThisDocument"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = True
Attribute VB_TemplateDerived = True
Attribute VB_Customizable = True


2: X-TIKA:embedded_resource_path : /embedded-2
2: Content-Type : text/plain; charset=windows-1252
{noformat}

Is this good enough for now?

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504643#comment-15504643
 ] 

Tim Allison commented on TIKA-2069:
---

Just realized that we might want to handle extraction of Actions and/or 
javascript from PDFs in a similar way?  New+related ticket if anyone has an 
interest?

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493924#comment-15493924
 ] 

Nick Burch commented on TIKA-2069:
--

Yes! If you wrote a VB Script, and zipped it up, it'd be a {{text/x-vbasic}} 
with no extra metadata. When you add a macro to an office doc, you get the 
macro text but also some metadata. We wouldn't need a parser for 
{[text/x-vbasic}}, only for {{application/vnd.ms-office.vbaProject}} which 
would expose the embedded script text + metadata

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493780#comment-15493780
 ] 

Tim Allison commented on TIKA-2069:
---

Makes sense, although I'd prefer to write one parser rather than two.  :)   
Would the {{application/vnd.ms-office.vbaProject}} ever have any content?  
Would its metadata be different from the vbscript?

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487662#comment-15487662
 ] 

Nick Burch commented on TIKA-2069:
--

I think the idea of a Macro is probably general enough across a range of file 
formats that we could add it as an embedded type

However, there's actually 2 levels to an OOXML macro. The OOXML file contains a 
binary vba project bin file, and within that is the actual macro text + its 
properties. Maybe we should have the ooxml extractor first expose a 
`application/vnd.ms-office.vbaProject` embedded resource, then we use a second 
parser which extracts a body of the macro vbscript as {{text/x-vbasic}} with 
the other macro properties/attributes (name, sid, various boolean flags) as 
metadata?

eg {{application/vnd.ms-excel.sheet.macroenabled.12}} -> 
{{application/vnd.ms-office.vbaProject}} -> {{text/x-vbasic}} + metadata

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484929#comment-15484929
 ] 

Tim Allison commented on TIKA-2069:
---

Sounds good.  Thank you, [~gagravarr].  

Do we want to distinguish between an attached vba/text file and a macro?  
Perhaps add {{MACRO}} to {{TikaCoreProperties.EmbeddedResourceType}}?  Or, do 
we want to distinguish between the two by using a different mime type?  I think 
I'd prefer the former.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484834#comment-15484834
 ] 

Nick Burch commented on TIKA-2069:
--

I think that, given both how big macros can get and how they logically fit with 
the document, as an embedded document might be best

Mimetype wise, some people seem to use {{application/x-vba}}, but the office 
content types file uses {{application/vnd.ms-office.vbaProject}}. Our own tika 
mimetypes file defines {{text/x-vbasic}}. I'd lean towards one of the latter two

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477438#comment-15477438
 ] 

Tim Allison commented on TIKA-2069:
---

Once we upgrade to POI 3.15-beta3, this _should_ be fairly straightforward, 
thanks to the work of others on POI.  We may want to copy/modify the "find the 
vba.bin file" at the Tika level for OOXML files to pass an npoifs into 
VBAMacroReader from an open OOXML/zip file.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477415#comment-15477415
 ] 

Tim Allison commented on TIKA-2069:
---

Thanks to  [~blagerw...@gmail.com], [~gagravarr] and [~onealj] among others, it 
looks like this is all nicely handled by POI now as of 
[bug-52949|https://bz.apache.org/bugzilla/show_bug.cgi?id=52949].

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-09 Thread Jeff Swindle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477255#comment-15477255
 ] 

Jeff Swindle commented on TIKA-2069:


OOXML would be great.
Not just limited to Word and Excel. Need Powerpoint also. 

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477188#comment-15477188
 ] 

Tim Allison commented on TIKA-2069:
---

Thank you!  

This question is for [~jeffswindle] and fellow Tika devs (esp. [~rgauss]), 
should we add macros as metadata items or inline them in the content via  
elements?

I'd prefer a metadata item for each macro, but could go either way.

[~jeffswindle], the title of this issue is for msoffice...is it ok to limit 
this to ooxml?  Do you need this for the older doc and xls?

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-09 Thread Jeff Swindle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477176#comment-15477176
 ] 

Jeff Swindle commented on TIKA-2069:


Desire is for TIKA to extract macro text from Microsoft Office files as it does 
metadata and content.
Need is to search for specific signatures that may be present in macros and if 
present should be removed prior to distributing document. TIKA would facilitate 
the search.

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475422#comment-15475422
 ] 

Tim Allison commented on TIKA-2069:
---

[~jeffswindle], thank you for opening this.  Would you be able to share some 
example test documents and expected output?  Bonus points for a unit test or 
two... :)  

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)