[ https://issues.apache.org/jira/browse/TIKA-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506438#comment-15506438 ]
Tim Allison edited comment on TIKA-2077 at 9/20/16 2:12 PM: ------------------------------------------------------------ In the xml, the AAAAAAA... exists as a paragraph within the text box: {noformat} <w:p> <w:pPr> <w:pStyle w:val="FrameContents"/> <w:rPr></w:rPr> </w:pPr> <w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r> </w:p> {noformat} Is there any signal that we should skip that section? When I save the file as text, the entire text box disappears. When I save it as html, the AAAAA... appears: {noformat} <span style='mso-ignore:vglayout;position: relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img width=718 height=33 src="TestData_files/image001.gif" alt="Text Box: TEST: AAAAAAAAAAAAAA " v:shapes="_x0000_s1026"></span> {noformat} was (Author: talli...@mitre.org): In the xml, there AAAAAAA... exists as a paragraph within the text box: {noformat} <w:p> <w:pPr> <w:pStyle w:val="FrameContents"/> <w:rPr></w:rPr> </w:pPr> <w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r> </w:p> {noformat} Is there any signal that we should skip that section? When I save the file as text, the entire text box disappears. When I save it as html, the AAAAA... appears: {noformat} <span style='mso-ignore:vglayout;position: relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img width=718 height=33 src="TestData_files/image001.gif" alt="Text Box: TEST: AAAAAAAAAAAAAA " v:shapes="_x0000_s1026"></span> {noformat} > Special character extracted as AAAAAAAA in docx file extraction > --------------------------------------------------------------- > > Key: TIKA-2077 > URL: https://issues.apache.org/jira/browse/TIKA-2077 > Project: Tika > Issue Type: Bug > Affects Versions: 1.13 > Reporter: Akash Sudhakar > Attachments: TestData.docx > > > During docx file extraction using tika 1.13, special character is extracted > as AAAAAAAA. > How to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)