Re: Advice needed regarding embedded ole extraction

MSB Fri, 08 May 2009 23:57:36 -0700

This is deeply embarassing because I completely forgot about a piece of code
that Christian and I worked on some months ago. It was designed to extract
the merge fields from a Word document and it may provide a template that you
could use to address this problem - assuming it still exists that is.

You can download the code from here;

http://rapidshare.com/files/208737669/MailMerge.rar

Just copy and paste the address into a browser, choose the free download
option and once you have the archive unzip it into a folder somewhere. The
first thing to do is have a look at the FieldDelimiters class. As Nick's
last reply suggested, Word uses delimiters placed within the files contents
to indicate that what follows is not text but something 'special'. Christian
and I used the POIFSViewer class that is part of POI to identify the
delimiters that surrounded a field, and you can do something similar to
identify those that surround the OLE insertion. As you will see, I chose to
use the numeric value of the 'special' characters in my searches whilst I
think that the stripFields() method probably uses their hex value.

Next have a look at the MergeMasterCheck class because this is where the
action occurs. It uses the field delimiters to identify and extract the
merge fields from the documents text. I guess that you want to do the
reverse - get at the text and leave everything else behind - but it should
be easy enough to modify the existing code to accomplish this.

Sorry it took so long for me to remember about this work and I hope that it
can help now. It had been our intention to submit it to the developers of
POI but I was uncertain about it and simply decided not to - silly in
retrospect I suppose. If you want to discuss it further, just drop me an
email.

Som Satpathy wrote:
> 
> Yes I tried using stripFields(). It strips some part of the unwanted text
> (with the EMBED tag), but some part still remains.
> 
> I suspect the problem might be with the encoding format of the "embedded
> object strings" (the ones starting with EMBED tag and ending with embedded
> doc's progID).
> 
> The stripFields() does not strip all of the encoded text.
> 
> 
> Regards
> Som Ranjan
> 
> 
> On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <[email protected]> wrote:
> 
>> On Tue, 28 Apr 2009, Som Satpathy wrote:
>>
>>> Is there any way by which we can *exclude* embedded ole information
>>> which
>>> we
>>> get on calling *wordExtractor.getText() ?
>>>
>>
>> Did you try stripFields?
>>
>> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23457846.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Advice needed regarding embedded ole extraction

Reply via email to