This morning, I had the chance to try what I spoke of in my previous email - using Java to replace text in a copy of the Paragraph's text and then replacing all of the text in the Paragraph with the modified copy. It worked but only up to a point; if the replacement text was exactly the same length as the search term, this technique worked but any differences in length rendered the resulting file corrupt; and I am guessing that this is the same problem as you originally encountered.
So, I think I can conclude that if you are trying to replace the search term with a String of text that is longer than it, you will run into problems. As long as the replacement is shorter than the search term - or at most the same length - then the previous piece of code I posted seems to work well enough. To my mind then you have two options. Option 1 would be to patch HWPF so that it will work as you wish - the API is very immature and has not been the focus of the same sort of development effort as has HSSF for example. Option 2 is to use an alternative such as SWT/OLE or OpenOffice. The limitation with the OpenOffice approach is that whilst it can read OpenXML documents - Office 2007 and beyond with the .docx or similar extension - it cannot save a document in this format. Sorry for the bad news. Yours Mark B karthik-33 wrote: > > Hi Mark, Thanks for sending this program. > I tried this program with POI 3.2 final version, which iam currently > using. > CharaterRun doesnt behave consistently the same way, sometimes it splits > the > paragraph text into more number of Character run, sometimes it doesnt > split > and i see the whole paragraph text in one character run. So the search > text > is not getting replaced. > Is there anyway to solve this issue? > On Sat, Aug 8, 2009 at 6:34 AM, MSB <[email protected]> wrote: > >> >> Here is the HWPF based code that I put together to play around with. It >> was >> written a very long time ago so I am not sure what testing I undertook >> and >> exactly what the results were but I have run it this morning just to >> ensure >> that >> it does not crash the PC and all seems to be well. This section has been >> cut >> from a much larger class that is full of other test code that I play >> around >> with peridocically. Everything you need is there I believe but on the >> off-chance >> that it calls another method whose source I have neglected to include, >> just >> drop >> an email to the list please. >> >> Currently, I am running POI version 3.5 beta 7 on a PC operating under >> Windows XP SP2. Office 2007 is installed now and it seems able to open >> the >> files this code produces quite happily. In the back of my mind, I seem to >> remember that the files produced by some search and replace code I put >> together could be opened but not modified; I tested that problem this >> morning >> and the files this code produces seem fine, I can open them, make changes >> and >> then save the results again. But do please be prepared for problems like >> that. >> >> Again, can I emphasise this is test code; it is scruffy and there are >> going >> to >> be variables I put in there so that I could monitor the progress of the >> code >> by dumping messages to the screen. As you go through, if something seems >> to >> be superfluous, then this is likely the reason and you can comment it out >> or >> delete it. >> >> Good luck and I do hope it all works. If you have any problems, just drop >> a >> message onto the list. >> >> Yours >> >> Mark B >> >> >> import org.apache.poi.poifs.filesystem.POIFSFileSystem; >> import org.apache.poi.hwpf.HWPFDocument; >> import org.apache.poi.hwpf.usermodel.Range; >> import org.apache.poi.hwpf.usermodel.Paragraph; >> import org.apache.poi.hwpf.usermodel.CharacterRun; >> >> import java.io.File; >> import java.io.FileOutputStream; >> import java.io.FileInputStream; >> import java.io.BufferedOutputStream; >> import java.io.BufferedInputStream; >> import java.util.HashMap; >> import java.util.Iterator; >> >> /** >> * This code is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR >> * CONDITIONS OF ANY KIND, either express or implied. It is not intended >> to >> * be used in a 'production' environment without undergoing rigorous >> testing. >> * >> * With that out of the way, an instance of this class can be used to >> search >> for >> * and replace Strings of text within a Word document. To see how the >> code >> may >> * be used, look into the main() method for examples. >> * >> * Note the replacements made by the code contained within this class >> ignore >> * any formatting that may have been applied to the text that is >> replaced. >> That >> * is to say that if the text was originally formatted to use the Arial >> font, >> * was sized to 24 points, emboldened, underlined and red in colour, then >> all >> * of this will be lost if it is replaced. Further if any text is >> replaced >> in a >> * Paragraph, all the formatting applied to that Paragraph's contents is >> likely >> * to be lost. >> * >> * @author Mark Beardsley [msb at apache.org] >> * @version 1.00 8th August 2009 (cannot remember when originally put >> together) >> */ >> public class SearchReplace { >> >> >> /** >> * Search for and replace a single occurrence of a string of text >> within a >> * Word document. >> * >> * Note that no checks are made on the parameter's values; that is >> to say >> * that the file named in the InputFilename parameter will not be >> checked >> * to ensure the file exists and neither of the searchTerm nor >> * replacementTerm parameters will be checked to ensure they are >> not >> null. >> * Also, note that I have never tested passing the same String to >> the >> * inputFilename and outputFilename parameters but cannot see why >> that >> * should not be possible. >> * >> * @param inputFilename An instance of the String class that >> encapsulates >> * the name of and path to a Word document >> which is >> * in the binary (OLE2CDF) format. The >> contents >> of >> this >> * document will be searched for occurrences >> of >> the >> * search term. >> * @param outputFilename An instance of the String class that >> encapsulates >> * the name of and path to a Word document >> which is >> * in the binary (OLE2CDF) format. This >> document will >> * contain the results of the search and >> replace >> * operation. >> * @param searchTerm An instance of the String class that >> encapsulates a >> * series of characters, a word or words. The >> document >> * will be searched for occurrences of this >> String. >> * @param replacementTerm An instance of the String class that >> contains a >> * series of characters, a word or words. >> The >> String >> * encapsulated by the searchTerm parameter >> will be >> * replaced by the 'contents' of this >> parameter. >> * >> */ >> public void searchAndReplace(String inputFilename, >> String outputFilename, >> String searchTerm, >> String replacementText) { >> >> File inputFile = null; >> File outputFile = null; >> FileInputStream fileIStream = null; >> FileOutputStream fileOStream = null; >> BufferedInputStream bufIStream = null; >> BufferedOutputStream bufOStream = null; >> POIFSFileSystem fileSystem = null; >> HWPFDocument document = null; >> Range docRange = null; >> Paragraph paragraph = null; >> CharacterRun charRun = null; >> int numParagraphs = 0; >> int numCharRuns = 0; >> String text = null; >> >> try { >> // Create an instance of the POIFSFileSystem class and >> // attach it to the Word document using an InputStream. >> inputFile = new File(inputFilename); >> fileIStream = new FileInputStream(inputFile); >> bufIStream = new BufferedInputStream(fileIStream); >> fileSystem = new POIFSFileSystem(bufIStream); >> document = new HWPFDocument(fileSystem); >> >> // Get the overall Range object for the document. Note the >> // use of the getRange() method and not the getOverallRange() >> // method, this is just historic - when the code was >> originally >> // written, I do not believe the latter method was part of the >> API. >> docRange = document.getRange(); >> >> // Get the number of Paragraph(s) in the overall range and >> iterate >> // through them >> numParagraphs = docRange.numParagraphs(); >> for(int i = 0; i < numParagraphs; i++) { >> >> // Get a Paragraph and recover the text from it. This step >> is >> far from >> // necessary and I think I only got the text so that I >> could >> print >> // it to screen as a diagnostic check to ensure that the >> Paragraph >> // contained the text I was searching for. Experiment with >> this. >> paragraph = docRange.getParagraph(i); >> text = paragraph.text(); >> >> // Get the number of CharacterRuns in the Paragraph >> numCharRuns = paragraph.numCharacterRuns(); >> for(int j = 0; j < numCharRuns; j++) { >> >> // Get a character run and recover it's text - >> note >> that >> // the same text variable is used as for the >> Paragraph >> above. >> // So, it MUST be safe to remove the text = >> paragraph.text() >> // line above. >> charRun = paragraph.getCharacterRun(j); >> text = charRun.text(); >> >> // Check to see if the text of the CharacterRun >> contains >> the >> // search term. If it does, find out where that term >> starts >> // and call the replaceText() method passing the >> index. >> // Maybe this is the key difference between what we >> are >> // doing. >> if(text.contains(searchTerm)) { >> int start = text.indexOf(searchTerm); >> charRun.replaceText(searchTerm, replacementText, >> start); >> } >> } >> } >> >> // Close the InputStream >> bufIStream.close(); >> bufIStream = null; >> >> // Open an OutputStream and write the document away. >> outputFile = new File(outputFilename); >> fileOStream = new FileOutputStream(outputFile); >> bufOStream = new BufferedOutputStream(fileOStream); >> >> document.write(bufOStream); >> >> } >> catch(Exception ex) { >> System.out.println("Caught an: " + ex.getClass().getName()); >> System.out.println("Message: " + ex.getMessage()); >> System.out.println("Stacktrace follows............."); >> ex.printStackTrace(System.out); >> } >> finally { >> if(bufOStream != null) { >> try { >> //bufOStream.flush(); >> bufOStream.close(); >> bufOStream = null; >> } >> catch(Exception ex) { >> >> } >> } >> if(bufIStream != null) { >> try { >> bufIStream.close(); >> bufIStream = null; >> } >> catch(Exception ex) { >> // I G N O R E // >> } >> } >> } >> >> } >> >> /** >> * Search for and replace a single occurrence of a string of text >> within a >> * Word document. >> * >> * Note that no checks are made on the parameter's values; that is >> to say >> * that the file named in the InputFilename parameter will not be >> checked >> * to ensure the file exists and neither of the searchTerm nor >> * replacementTerm pare,eters will be checked to ensure they are >> not >> null. >> * Also, note that I have never tested passing the same String to >> the >> * inputFilename and outputFilename parameters but cannot see why >> that >> * should not be possible. >> * >> * @param inputFilename An instance of the String class that >> encapsulates >> * the name of and path to a Word document >> which is >> * in the binary (OLE2CDF) format. The >> contents >> of >> this >> * document will be searched for occurrences >> of >> the >> * search term. >> * @param outputFilename An instance of the String class that >> encapsulates >> * the name of and path to a Word document >> which is >> * in the binary (OLE2CDF) format. This >> document will >> * contain the results of the search and >> replace >> * operation. >> * @param replacements An instance of the java.util.HashMap class >> that >> * contains a series of key, value pairs. Each >> key >> * is an instance of the String class that >> encapsulates >> * a series of characters, a word or words >> that >> the >> * code will search for and the accompanying >> value is >> * also an instance of the String class that >> likewise >> * encapsulates a series of characters, a word >> or >> words. >> * The 'contents' of the value's String will >> be >> used to >> * replace the contents of the key's String if >> an >> * occurrence of the latter is found. >> */ >> public void searchAndReplace(String inputFilename, >> String outputFilename, >> HashMap<String, String> replacements) { >> >> File inputFile = null; >> File outputFile = null; >> FileInputStream fileIStream = null; >> FileOutputStream fileOStream = null; >> BufferedInputStream bufIStream = null; >> BufferedOutputStream bufOStream = null; >> POIFSFileSystem fileSystem = null; >> HWPFDocument document = null; >> Range docRange = null; >> Paragraph paragraph = null; >> CharacterRun charRun = null; >> Set<String> keySet = null; >> Iterator<String> keySetIterator = null; >> int numParagraphs = 0; >> int numCharRuns = 0; >> String text = null; >> String key = null; >> String value = null; >> >> try { >> // Create an instance of the POIFSFileSystem class and >> // attach it to the Word document using an InputStream. >> inputFile = new File(inputFilename); >> fileIStream = new FileInputStream(inputFile); >> bufIStream = new BufferedInputStream(fileIStream); >> fileSystem = new POIFSFileSystem(bufIStream); >> document = new HWPFDocument(fileSystem); >> >> // Get a reference to the overall Range for the >> document >> // and discover how many Paragraphs objects there >> are >> // in the document. >> docRange = document.getRange(); >> numParagraphs = docRange.numParagraphs(); >> >> // Recover a Set of the keys in the HashMap >> keySet = replacements.keySet(); >> >> // Step through each Paragraph >> for(int i = 0; i < numParagraphs; i++) { >> paragraph = docRange.getParagraph(i); >> // This line can almost certainly be removed - see >> // the comments in the method above. >> text = paragraph.text(); >> >> // Get the number of CharacterRuns in the Paragraph >> // and step through each one. >> numCharRuns = paragraph.numCharacterRuns(); >> for(int j = 0; j < numCharRuns; j++) { >> charRun = paragraph.getCharacterRun(j); >> >> // Get the text from the CharacterRun and recover an >> // Iterator to step through the Set of keys. >> text = charRun.text(); >> keySetIterator = keySet.iterator(); >> while(keySetIterator.hasNext()) { >> >> // Get the key - which is also the search term - >> and >> // check to see if it can be found within the >> // CharacterRuns text. >> key = keySetIterator.next(); >> if(text.contains(key)) { >> >> // If the search term was found in the >> text, >> get >> the >> // matching value from the HashMap, find >> out >> whereabouts >> // in the CharacterRuns text the search >> term >> is >> // and call the replaceText() method to >> substitute >> // the replacement term for the search >> term. >> value = replacements.get(key); >> int start = text.indexOf(key); >> charRun.replaceText(key, value, start); >> >> // Note that this code was added to test >> whether >> // it was possible to replace multiple >> occurrences >> // of the search term. I cannot remember if I >> tested >> // it but believe that it did work; either >> way, >> // it could be tested now and if succeeds, >> then >> the >> // searchAndReplace() method above could be >> modified >> // to include this. >> docRange = document.getRange(); >> paragraph = docRange.getParagraph(i); >> charRun = paragraph.getCharacterRun(j); >> text = charRun.text(); >> } >> } >> } >> } >> >> // Close the InputStream >> bufIStream.close(); >> bufIStream = null; >> >> // Open an OutputStream and save the modified document away. >> outputFile = new File(outputFilename); >> fileOStream = new FileOutputStream(outputFile); >> bufOStream = new BufferedOutputStream(fileOStream); >> document.write(bufOStream); >> } >> catch(Exception ex) { >> System.out.println("Caught an: " + ex.getClass().getName()); >> System.out.println("Message: " + ex.getMessage()); >> System.out.println("Stacktrace follows............."); >> ex.printStackTrace(System.out); >> } >> finally { >> if(bufIStream != null) { >> try { >> bufIStream.close(); >> bufIStream = null; >> } >> catch(Exception ex) { >> // I G N O R E // >> } >> } >> if(bufOStream != null) { >> try { >> bufOStream.flush(); >> bufOStream.close(); >> bufOStream = null; >> } >> catch(Exception ex) { >> >> } >> } >> } >> >> } >> >> /** >> * The main entry point to the program demonstrating how the code >> may >> * be utilised. >> * >> * @param args An array of type String containing argumnets passed >> to the >> * program on execution. >> */ >> public static void main(String[] args) { >> SearchReplace replacer = new SearchReplace(); >> >> // To serach for and replace single items. Note, the code >> has not, at >> // least as far as I can remember, been tested by passing >> the same >> // file to both the searchTerm and replacementTerm >> parameters. It ought >> // to work but has NOT been tested I believe. >> replacer.searchAndReplace("Document.doc", // >> Source Document >> "Replaced Document.doc", >> // >> Result Document >> "search term", >> // >> Search term >> "replacement term"); >> // >> Replacement term >> >> // To search for and replace a series of items >> HashMap<String, String> searchTerms = new HashMap<String, >> String>(); >> searchTerms.put("search term 1", "replacement term 1"); >> searchTerms.put("search term 2", "replacement term 2"); >> searchTerms.put("search term 3", "replacement term 3"); >> searchTerms.put("search term 4", "replacement term 4"); >> >> replacer.searchAndReplace("Document.doc", // Source >> Document >> "Replaced Document.doc", // Result >> Document >> searchTerms) // >> Search/replacement items >> } >> } >> >> >> >> karthik-33 wrote: >> > >> > Thanks for the reply mark. >> > I dont think i need to preserve text formatting, but i would like to >> try >> > your code and see how it works. >> > I think that would help me too. >> > >> > I cant go the open office since my business requirement is to use >> > microsoft >> > word documents. >> > I will be using this search and replace function in the same PC as that >> of >> > the application. >> > >> > If you can send me that code, i will try and let u know how it works. >> > >> > Thanks >> > Karthik >> > >> > >> > On Fri, Aug 7, 2009 at 11:49 AM, MSB <[email protected]> wrote: >> > >> >> >> >> Can I ask two questions please? >> >> >> >> Do you need to preserve the formatting applied to the text? If not, >> then >> >> I >> >> think that somewhere I have a piece of HWPF code that does a search >> and >> >> replace. I am not at all certain about the state of the code and >> cannot >> >> remember if I hit the same problem as you - and I may well have - but >> I >> >> am >> >> willing to look it out if you think it might help. >> >> >> >> Secondly, do you have to use HWPF/XWPF? The API is still immature and >> it >> >> is >> >> really only suitable for realtively simple tasks. Better alternatives >> >> might >> >> be OpenOffice which you can 'control' through it's UNO API or Word >> itself >> >> that can be manipulated using OLE. You can ONLY use OLE if you are >> >> working >> >> on a windows based PC and you have Word installed on that PC. >> OpenOffice >> >> is >> >> more flexible but it still cannot be used - at least as far as I am >> aware >> >> - >> >> as a document server, so it is best to have that application installed >> on >> >> the PC you will be using for the search/replace operation. >> >> >> >> Yours >> >> >> >> Mark B >> >> >> >> >> >> karthik-33 wrote: >> >> > >> >> > I have microsoft office 2007 and while saving the document, i save >> it >> >> as >> >> > microsoft 2003 document. >> >> > Iam trying to replace the text using replaceText method in >> Paragraph. >> >> > It works fine when the replacement text and search text are of equal >> >> > length. >> >> > It corrupts the document, when the length of the string is either >> >> greater >> >> > or >> >> > less. >> >> > If anyone has gone through the issue and resolved or have any idea. >> >> Please >> >> > let me know, it will be useful for me.. >> >> > Iam not sure what is causing the problem to corrupt the document >> >> > >> >> > Code is: >> >> > >> >> > String replaceTxt = "Replacement"; >> >> > String searchText = "Orginial"; >> >> > POIFSFileSystem ps = new POIFSFileSystem (new >> >> > FileInputStream("C:/Document.doc")); >> >> > HWPFDocument doc = new HWPFDocument (); >> >> > Range range = doc.getRange(); >> >> > for(int x=0;x<range.numSections();x++) >> >> > { >> >> > Section s = range.getSection(x); >> >> > for(int y=0;y<s.numParagraphs();y++) >> >> > { >> >> > Paragraph p = s.getParagraph(y); >> >> > String paraText = p.text(); >> >> > int offset = paraText.indexOf(searchText ); >> >> > if(offset != -1) >> >> > { >> >> > p.replaceText(searchText,replaceTxt,offset); >> >> > >> >> > } >> >> > } >> >> > >> >> > } >> >> > >> >> > >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24867251.html >> >> Sent from the POI - User mailing list archive at Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> >> > >> > >> > -- >> > karthik >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24876942.html >> Sent from the POI - User mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > karthik > > -- View this message in context: http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24885699.html Sent from the POI - User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
