This is deeply embarassing because I completely forgot about a piece of code that Christian and I worked on some months ago. It was designed to extract the merge fields from a Word document and it may provide a template that you could use to address this problem - assuming it still exists that is.
You can download the code from here; http://rapidshare.com/files/208737669/MailMerge.rar Just copy and paste the address into a browser, choose the free download option and once you have the archive unzip it into a folder somewhere. The first thing to do is have a look at the FieldDelimiters class. As Nick's last reply suggested, Word uses delimiters placed within the files contents to indicate that what follows is not text but something 'special'. Christian and I used the POIFSViewer class that is part of POI to identify the delimiters that surrounded a field, and you can do something similar to identify those that surround the OLE insertion. As you will see, I chose to use the numeric value of the 'special' characters in my searches whilst I think that the stripFields() method probably uses their hex value. Next have a look at the MergeMasterCheck class because this is where the action occurs. It uses the field delimiters to identify and extract the merge fields from the documents text. I guess that you want to do the reverse - get at the text and leave everything else behind - but it should be easy enough to modify the existing code to accomplish this. Sorry it took so long for me to remember about this work and I hope that it can help now. It had been our intention to submit it to the developers of POI but I was uncertain about it and simply decided not to - silly in retrospect I suppose. If you want to discuss it further, just drop me an email. Som Satpathy wrote: > > Yes I tried using stripFields(). It strips some part of the unwanted text > (with the EMBED tag), but some part still remains. > > I suspect the problem might be with the encoding format of the "embedded > object strings" (the ones starting with EMBED tag and ending with embedded > doc's progID). > > The stripFields() does not strip all of the encoded text. > > > Regards > Som Ranjan > > > On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <[email protected]> wrote: > >> On Tue, 28 Apr 2009, Som Satpathy wrote: >> >>> Is there any way by which we can *exclude* embedded ole information >>> which >>> we >>> get on calling *wordExtractor.getText() ? >>> >> >> Did you try stripFields? >> >> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29> >> >> Nick >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > -- View this message in context: http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23457846.html Sent from the POI - User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
