Over the last day or so, I have had the opportunity to dig around a little.
Firstly, I made myself a test document by embedding one Word document into
another. To do this, I used the Insert...Object...Create Object From File
menu options to insert the EMBED field into my test document.

Firstly, I used WordExtractor to recover the contents of the document and
found that the EMBED field was returned. Next, I called the stripFields()
method and it worked as Nick suggested it should; the EMBED field was
removed from the paragraph text. It seems, therefore, as thoughthere is
something different about the EMBED fields in your document.

When we were working on 'our' code, Christian and I used a very simple piece
of code to look at the structure of the fields;

StringBuffer charString = new StringBuffer();
StringBuffer intString = new StringBuffer();
StringBuffer hexString = new StringBuffer();
            
FileInputStream fis = new FileInputStream(new
File("C:\\temp\\embedded.doc"));
org.apache.poi.hwpf.extractor.WordExtractor we = 
    new org.apache.poi.hwpf.extractor.WordExtractor(fis);
String[] text = we.getParagraphText();
String tempString = null;
            
for(String item : text) {
    char[] charArray = item.toCharArray();
    for(char aChar : charArray) {
        charString.append(aChar + "      ");
        tempString = String.valueOf((int)aChar);
        if(tempString.length() == 1) {
            tempString = tempString + "      ";
        }
        else if(tempString.length() == 2) {
            tempString = tempString + "     ";
        }
        else if(tempString.length() == 3) {
            tempString = tempString + "    ";
        }
        
        intString.append(tempString);
        
        tempString = Integer.toHexString((int)aChar);
        if(tempString.length() == 1) {
            tempString = "\\u000" + tempString + " ";
        }
        else if(tempString.length() == 2) {
            tempString = "\\u00" + tempString + " ";
        }
        else if(tempString.length() == 3) {
            tempString = "\\u0" + tempString + " ";
        }
        else if (tempString.length() == 4) {
            tempString = "\\u" + tempString + " ";
        }
        
         hexString.append(tempString);
     }
 }
 System.out.println("Characters:     [" + 
     charString.toString() +
     " ]");
 System.out.println("Numeric Values: [" + 
     intString.toString() + 
     " ]");
 System.out.println("Hex Values:     [" + 
     hexString.toString() + 
     " ]");

Running that against my test file showed us that the fields had the
following structure;

{ INSTRUCTION } CURRENT VALUE }

The opening and closing braces were in fact control characters with the
following unicode values, \u0013, \u0014 and \u0015 respectively. Between
\u0013 and \u0014 was the instruction - EMBED Word.Document.8 for example -
and between \u0014 and \u0015 was the current value if any. As you no doubt
know, fields can be used to insert a very wide range of values such as the
date the document was created which may be stored when the user saves the
file.

If you run the simple code above against your file with the EMBED fields,
then it may help to identify whether there are any differences in the filed
structure.



Som Satpathy wrote:
> 
> Yes I tried using stripFields(). It strips some part of the unwanted text
> (with the EMBED tag), but some part still remains.
> 
> I suspect the problem might be with the encoding format of the "embedded
> object strings" (the ones starting with EMBED tag and ending with embedded
> doc's progID).
> 
> The stripFields() does not strip all of the encoded text.
> 
> 
> Regards
> Som Ranjan
> 
> 
> On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <[email protected]> wrote:
> 
>> On Tue, 28 Apr 2009, Som Satpathy wrote:
>>
>>> Is there any way by which we can *exclude* embedded ole information
>>> which
>>> we
>>> get on calling *wordExtractor.getText() ?
>>>
>>
>> Did you try stripFields?
>>
>> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23468348.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to