Hi

I needed some advice from you regarding embedded ole extraction from
microsoft documents like word, excel etc.

Is there any way by which we can *exclude* embedded ole information which we
get on calling *wordExtractor.getText() ?

*For example, I get the following as output when I call *Apache POI
WordExtractor's getText* on test word document with other embedded documents
inside -


Extracted Text--------> We have an excel sheet embedded in this doc. Test
test test test. Blah blah.blah



EMBED Excel.Sheet.8

EMBED PowerPoint.Show.8
EMBED Word.Document.8 \s



EMBED AcroExch.Document.7


I don't want the information with the 'EMBED' tag mentioned above. Is there
any way to sort this out using the existing Apache HWPF poi?


Thanks & Regards
Som Ranjan

Reply via email to