Thanks a lot for the suggestion Mark..I will try the code and see if it helps. Actually the 'EMBED' + progid information was creating problem for me while trying to add the extracted text including the embedded information into an XML file of mine.. As the wordextractor.getText() was returning some 'mojibake' for the EMBEDED information, the XML never accepted it..
My word document has an embedded excel, powerpoint and a pdf..may be that's why the stripfields() didn't work.. While working with the stripfields(), I came to know that it would remove only one set of \u0013,14 and 15 at a time.. I still didnt understand though as to why the wordextractor.getText() returns unrecognized stuff for embedded information. In fact it should have been omitted as we can still read for embedded text through event listeners.. But thanks for your input, I will do some investigation with it.. Cheers Som Ranjan On Sun, May 10, 2009 at 2:32 PM, MSB <[email protected]> wrote: > > Over the last day or so, I have had the opportunity to dig around a little. > Firstly, I made myself a test document by embedding one Word document into > another. To do this, I used the Insert...Object...Create Object From File > menu options to insert the EMBED field into my test document. > > Firstly, I used WordExtractor to recover the contents of the document and > found that the EMBED field was returned. Next, I called the stripFields() > method and it worked as Nick suggested it should; the EMBED field was > removed from the paragraph text. It seems, therefore, as thoughthere is > something different about the EMBED fields in your document. > > When we were working on 'our' code, Christian and I used a very simple > piece > of code to look at the structure of the fields; > > StringBuffer charString = new StringBuffer(); > StringBuffer intString = new StringBuffer(); > StringBuffer hexString = new StringBuffer(); > > FileInputStream fis = new FileInputStream(new > File("C:\\temp\\embedded.doc")); > org.apache.poi.hwpf.extractor.WordExtractor we = > new org.apache.poi.hwpf.extractor.WordExtractor(fis); > String[] text = we.getParagraphText(); > String tempString = null; > > for(String item : text) { > char[] charArray = item.toCharArray(); > for(char aChar : charArray) { > charString.append(aChar + " "); > tempString = String.valueOf((int)aChar); > if(tempString.length() == 1) { > tempString = tempString + " "; > } > else if(tempString.length() == 2) { > tempString = tempString + " "; > } > else if(tempString.length() == 3) { > tempString = tempString + " "; > } > > intString.append(tempString); > > tempString = Integer.toHexString((int)aChar); > if(tempString.length() == 1) { > tempString = "\\u000" + tempString + " "; > } > else if(tempString.length() == 2) { > tempString = "\\u00" + tempString + " "; > } > else if(tempString.length() == 3) { > tempString = "\\u0" + tempString + " "; > } > else if (tempString.length() == 4) { > tempString = "\\u" + tempString + " "; > } > > hexString.append(tempString); > } > } > System.out.println("Characters: [" + > charString.toString() + > " ]"); > System.out.println("Numeric Values: [" + > intString.toString() + > " ]"); > System.out.println("Hex Values: [" + > hexString.toString() + > " ]"); > > Running that against my test file showed us that the fields had the > following structure; > > { INSTRUCTION } CURRENT VALUE } > > The opening and closing braces were in fact control characters with the > following unicode values, \u0013, \u0014 and \u0015 respectively. Between > \u0013 and \u0014 was the instruction - EMBED Word.Document.8 for example - > and between \u0014 and \u0015 was the current value if any. As you no doubt > know, fields can be used to insert a very wide range of values such as the > date the document was created which may be stored when the user saves the > file. > > If you run the simple code above against your file with the EMBED fields, > then it may help to identify whether there are any differences in the filed > structure. > > > > Som Satpathy wrote: > > > > Yes I tried using stripFields(). It strips some part of the unwanted text > > (with the EMBED tag), but some part still remains. > > > > I suspect the problem might be with the encoding format of the "embedded > > object strings" (the ones starting with EMBED tag and ending with > embedded > > doc's progID). > > > > The stripFields() does not strip all of the encoded text. > > > > > > Regards > > Som Ranjan > > > > > > On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <[email protected]> wrote: > > > >> On Tue, 28 Apr 2009, Som Satpathy wrote: > >> > >>> Is there any way by which we can *exclude* embedded ole information > >>> which > >>> we > >>> get on calling *wordExtractor.getText() ? > >>> > >> > >> Did you try stripFields? > >> > >> > http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29> > < > http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29 > > > >> > >> Nick > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23468348.html > Sent from the POI - User mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
