[ https://issues.apache.org/jira/browse/TIKA-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121546#comment-16121546 ]
Tim Allison commented on TIKA-2441: ----------------------------------- Thank you for opening this issue. I'll look into it at the POI level. As a workaround, you can switch to our experimental SAX-based parser for DOCX, and that does work on this document. In your tika-config file, do something like this: {noformat} <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"> <params> <param name="useSAXDocxExtractor" type="bool">true</param> </params> </parser> </parsers> </properties> {noformat} Or if you want to use SolrJ: {noformat} ParseContext pc = new ParseContext(); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); pc.set(OfficeParserConfig.class, officeParserConfig); ...then add the pc to your call to parse... {noformat} > Unable to extract text present in a table inside a textbox in MS Word > --------------------------------------------------------------------- > > Key: TIKA-2441 > URL: https://issues.apache.org/jira/browse/TIKA-2441 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Environment: Windows, Linux, Apache tika 1.15 used with Apache > Solr-6.6.0 > Reporter: Amit Humnabadkar > Labels: sax_docx_fixes > Attachments: doc001.zip > > > Hello, > I am using Tika-1.15 with Solr-6.6.0 to indexing and searching. This setup > fails to index text present in a table inside a textbox in a word document. > A MS Word document contains two words - > 1. Germany - present in a table inside a textbox > 2. Africa - present in a textbox > Germany is not getting indexed while Africa gets indexed successfully. Looks > like Tika fails to extract the content present in table inside a textbox. > Please have a look. > Thanks, > Amit Humnabadkar > [^doc001.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029)