Hi I needed some advice from you regarding embedded ole extraction from microsoft documents like word, excel etc.
Is there any way by which we can *exclude* embedded ole information which we get on calling *wordExtractor.getText() ? *For example, I get the following as output when I call *Apache POI WordExtractor's getText* on test word document with other embedded documents inside - Extracted Text--------> We have an excel sheet embedded in this doc. Test test test test. Blah blah.blah EMBED Excel.Sheet.8 EMBED PowerPoint.Show.8 EMBED Word.Document.8 \s EMBED AcroExch.Document.7 I don't want the information with the 'EMBED' tag mentioned above. Is there any way to sort this out using the existing Apache HWPF poi? Thanks & Regards Som Ranjan
