Here is the HWPF based code that I put together to play around with. It was
written a very long time ago so I am not sure what testing I undertook and
exactly what the results were but I have run it this morning just to ensure
that
it does not crash the PC and all seems to be well. This section has been cut
from a much larger class that is full of other test code that I play around
with peridocically. Everything you need is there I believe but on the
off-chance
that it calls another method whose source I have neglected to include, just
drop
an email to the list please.
Currently, I am running POI version 3.5 beta 7 on a PC operating under
Windows XP SP2. Office 2007 is installed now and it seems able to open the
files this code produces quite happily. In the back of my mind, I seem to
remember that the files produced by some search and replace code I put
together could be opened but not modified; I tested that problem this
morning
and the files this code produces seem fine, I can open them, make changes
and
then save the results again. But do please be prepared for problems like
that.
Again, can I emphasise this is test code; it is scruffy and there are going
to
be variables I put in there so that I could monitor the progress of the code
by dumping messages to the screen. As you go through, if something seems to
be superfluous, then this is likely the reason and you can comment it out or
delete it.
Good luck and I do hope it all works. If you have any problems, just drop a
message onto the list.
Yours
Mark B
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedInputStream;
import java.util.HashMap;
import java.util.Iterator;
/**
* This code is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR
* CONDITIONS OF ANY KIND, either express or implied. It is not intended to
* be used in a 'production' environment without undergoing rigorous
testing.
*
* With that out of the way, an instance of this class can be used to search
for
* and replace Strings of text within a Word document. To see how the code
may
* be used, look into the main() method for examples.
*
* Note the replacements made by the code contained within this class ignore
* any formatting that may have been applied to the text that is replaced.
That
* is to say that if the text was originally formatted to use the Arial
font,
* was sized to 24 points, emboldened, underlined and red in colour, then
all
* of this will be lost if it is replaced. Further if any text is replaced
in a
* Paragraph, all the formatting applied to that Paragraph's contents is
likely
* to be lost.
*
* @author Mark Beardsley [msb at apache.org]
* @version 1.00 8th August 2009 (cannot remember when originally put
together)
*/
public class SearchReplace {
/**
* Search for and replace a single occurrence of a string of text
within a
* Word document.
*
* Note that no checks are made on the parameter's values; that is to
say
* that the file named in the InputFilename parameter will not be
checked
* to ensure the file exists and neither of the searchTerm nor
* replacementTerm parameters will be checked to ensure they are not
null.
* Also, note that I have never tested passing the same String to the
* inputFilename and outputFilename parameters but cannot see why that
* should not be possible.
*
* @param inputFilename An instance of the String class that
encapsulates
* the name of and path to a Word document which is
* in the binary (OLE2CDF) format. The contents of
this
* document will be searched for occurrences of the
* search term.
* @param outputFilename An instance of the String class that
encapsulates
* the name of and path to a Word document which
is
* in the binary (OLE2CDF) format. This document
will
* contain the results of the search and replace
* operation.
* @param searchTerm An instance of the String class that encapsulates a
* series of characters, a word or words. The document
* will be searched for occurrences of this String.
* @param replacementTerm An instance of the String class that contains
a
* series of characters, a word or words. The
String
* encapsulated by the searchTerm parameter will
be
* replaced by the 'contents' of this parameter.
*
*/
public void searchAndReplace(String inputFilename,
String outputFilename,
String searchTerm,
String replacementText) {
File inputFile = null;
File outputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
// Get the overall Range object for the document. Note the
// use of the getRange() method and not the getOverallRange()
// method, this is just historic - when the code was originally
// written, I do not believe the latter method was part of the
API.
docRange = document.getRange();
// Get the number of Paragraph(s) in the overall range and
iterate
// through them
numParagraphs = docRange.numParagraphs();
for(int i = 0; i < numParagraphs; i++) {
// Get a Paragraph and recover the text from it. This step is
far from
// necessary and I think I only got the text so that I could
print
// it to screen as a diagnostic check to ensure that the
Paragraph
// contained the text I was searching for. Experiment with
this.
paragraph = docRange.getParagraph(i);
text = paragraph.text();
// Get the number of CharacterRuns in the Paragraph
numCharRuns = paragraph.numCharacterRuns();
for(int j = 0; j < numCharRuns; j++) {
// Get a character run and recover it's text - note that
// the same text variable is used as for the Paragraph
above.
// So, it MUST be safe to remove the text =
paragraph.text()
// line above.
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
// Check to see if the text of the CharacterRun contains
the
// search term. If it does, find out where that term
starts
// and call the replaceText() method passing the index.
// Maybe this is the key difference between what we are
// doing.
if(text.contains(searchTerm)) {
int start = text.indexOf(searchTerm);
charRun.replaceText(searchTerm, replacementText,
start);
}
}
}
// Close the InputStream
bufIStream.close();
bufIStream = null;
// Open an OutputStream and write the document away.
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
}
catch(Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
finally {
if(bufOStream != null) {
try {
//bufOStream.flush();
bufOStream.close();
bufOStream = null;
}
catch(Exception ex) {
}
}
if(bufIStream != null) {
try {
bufIStream.close();
bufIStream = null;
}
catch(Exception ex) {
// I G N O R E //
}
}
}
}
/**
* Search for and replace a single occurrence of a string of text
within a
* Word document.
*
* Note that no checks are made on the parameter's values; that is to
say
* that the file named in the InputFilename parameter will not be
checked
* to ensure the file exists and neither of the searchTerm nor
* replacementTerm pare,eters will be checked to ensure they are not
null.
* Also, note that I have never tested passing the same String to the
* inputFilename and outputFilename parameters but cannot see why that
* should not be possible.
*
* @param inputFilename An instance of the String class that
encapsulates
* the name of and path to a Word document which is
* in the binary (OLE2CDF) format. The contents of
this
* document will be searched for occurrences of the
* search term.
* @param outputFilename An instance of the String class that
encapsulates
* the name of and path to a Word document which
is
* in the binary (OLE2CDF) format. This document
will
* contain the results of the search and replace
* operation.
* @param replacements An instance of the java.util.HashMap class that
* contains a series of key, value pairs. Each key
* is an instance of the String class that
encapsulates
* a series of characters, a word or words that the
* code will search for and the accompanying value
is
* also an instance of the String class that
likewise
* encapsulates a series of characters, a word or
words.
* The 'contents' of the value's String will be
used to
* replace the contents of the key's String if an
* occurrence of the latter is found.
*/
public void searchAndReplace(String inputFilename,
String outputFilename,
HashMap<String, String> replacements) {
File inputFile = null;
File outputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
// Get a reference to the overall Range for the document
// and discover how many Paragraphs objects there are
// in the document.
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
// Recover a Set of the keys in the HashMap
keySet = replacements.keySet();
// Step through each Paragraph
for(int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
// This line can almost certainly be removed - see
// the comments in the method above.
text = paragraph.text();
// Get the number of CharacterRuns in the Paragraph
// and step through each one.
numCharRuns = paragraph.numCharacterRuns();
for(int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
// Get the text from the CharacterRun and recover an
// Iterator to step through the Set of keys.
text = charRun.text();
keySetIterator = keySet.iterator();
while(keySetIterator.hasNext()) {
// Get the key - which is also the search term - and
// check to see if it can be found within the
// CharacterRuns text.
key = keySetIterator.next();
if(text.contains(key)) {
// If the search term was found in the text, get
the
// matching value from the HashMap, find out
whereabouts
// in the CharacterRuns text the search term is
// and call the replaceText() method to
substitute
// the replacement term for the search term.
value = replacements.get(key);
int start = text.indexOf(key);
charRun.replaceText(key, value, start);
// Note that this code was added to test whether
// it was possible to replace multiple
occurrences
// of the search term. I cannot remember if I
tested
// it but believe that it did work; either way,
// it could be tested now and if succeeds, then
the
// searchAndReplace() method above could be
modified
// to include this.
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
// Close the InputStream
bufIStream.close();
bufIStream = null;
// Open an OutputStream and save the modified document away.
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
}
catch(Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
finally {
if(bufIStream != null) {
try {
bufIStream.close();
bufIStream = null;
}
catch(Exception ex) {
// I G N O R E //
}
}
if(bufOStream != null) {
try {
bufOStream.flush();
bufOStream.close();
bufOStream = null;
}
catch(Exception ex) {
}
}
}
}
/**
* The main entry point to the program demonstrating how the code may
* be utilised.
*
* @param args An array of type String containing argumnets passed to
the
* program on execution.
*/
public static void main(String[] args) {
SearchReplace replacer = new SearchReplace();
// To serach for and replace single items. Note, the code has
not, at
// least as far as I can remember, been tested by passing the
same
// file to both the searchTerm and replacementTerm parameters.
It ought
// to work but has NOT been tested I believe.
replacer.searchAndReplace("Document.doc", // Source
Document
"Replaced Document.doc", //
Result Document
"search term", //
Search term
"replacement term"); //
Replacement term
// To search for and replace a series of items
HashMap<String, String> searchTerms = new HashMap<String,
String>();
searchTerms.put("search term 1", "replacement term 1");
searchTerms.put("search term 2", "replacement term 2");
searchTerms.put("search term 3", "replacement term 3");
searchTerms.put("search term 4", "replacement term 4");
replacer.searchAndReplace("Document.doc", // Source Document
"Replaced Document.doc", // Result
Document
searchTerms) //
Search/replacement items
}
}
karthik-33 wrote:
>
> Thanks for the reply mark.
> I dont think i need to preserve text formatting, but i would like to try
> your code and see how it works.
> I think that would help me too.
>
> I cant go the open office since my business requirement is to use
> microsoft
> word documents.
> I will be using this search and replace function in the same PC as that of
> the application.
>
> If you can send me that code, i will try and let u know how it works.
>
> Thanks
> Karthik
>
>
> On Fri, Aug 7, 2009 at 11:49 AM, MSB <[email protected]> wrote:
>
>>
>> Can I ask two questions please?
>>
>> Do you need to preserve the formatting applied to the text? If not, then
>> I
>> think that somewhere I have a piece of HWPF code that does a search and
>> replace. I am not at all certain about the state of the code and cannot
>> remember if I hit the same problem as you - and I may well have - but I
>> am
>> willing to look it out if you think it might help.
>>
>> Secondly, do you have to use HWPF/XWPF? The API is still immature and it
>> is
>> really only suitable for realtively simple tasks. Better alternatives
>> might
>> be OpenOffice which you can 'control' through it's UNO API or Word itself
>> that can be manipulated using OLE. You can ONLY use OLE if you are
>> working
>> on a windows based PC and you have Word installed on that PC. OpenOffice
>> is
>> more flexible but it still cannot be used - at least as far as I am aware
>> -
>> as a document server, so it is best to have that application installed on
>> the PC you will be using for the search/replace operation.
>>
>> Yours
>>
>> Mark B
>>
>>
>> karthik-33 wrote:
>> >
>> > I have microsoft office 2007 and while saving the document, i save it
>> as
>> > microsoft 2003 document.
>> > Iam trying to replace the text using replaceText method in Paragraph.
>> > It works fine when the replacement text and search text are of equal
>> > length.
>> > It corrupts the document, when the length of the string is either
>> greater
>> > or
>> > less.
>> > If anyone has gone through the issue and resolved or have any idea.
>> Please
>> > let me know, it will be useful for me..
>> > Iam not sure what is causing the problem to corrupt the document
>> >
>> > Code is:
>> >
>> > String replaceTxt = "Replacement";
>> > String searchText = "Orginial";
>> > POIFSFileSystem ps = new POIFSFileSystem (new
>> > FileInputStream("C:/Document.doc"));
>> > HWPFDocument doc = new HWPFDocument ();
>> > Range range = doc.getRange();
>> > for(int x=0;x<range.numSections();x++)
>> > {
>> > Section s = range.getSection(x);
>> > for(int y=0;y<s.numParagraphs();y++)
>> > {
>> > Paragraph p = s.getParagraph(y);
>> > String paraText = p.text();
>> > int offset = paraText.indexOf(searchText );
>> > if(offset != -1)
>> > {
>> > p.replaceText(searchText,replaceTxt,offset);
>> >
>> > }
>> > }
>> >
>> > }
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24867251.html
>> Sent from the POI - User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> karthik
>
>
--
View this message in context:
http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24876942.html
Sent from the POI - User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]