Re: Replace Text Problem (Document Corrupt) - POI HWPFDocument

MSB Sun, 09 Aug 2009 03:07:41 -0700

This morning, I had the chance to try what I spoke of in my previous email -
using Java to replace text in a copy of the Paragraph's text and then
replacing all of the text in the Paragraph with the modified copy. It worked
but only up to a point; if the replacement text was exactly the same length
as the search term, this technique worked but any differences in length
rendered the resulting file corrupt; and I am guessing that this is the same
problem as you originally encountered.


So, I think I can conclude that if you are trying to replace the search term
with a String of text that is longer than it, you will run into problems. As
long as the replacement is shorter than the search term - or at most the
same length - then the previous piece of code I posted seems to work well
enough.

To my mind then you have two options. Option 1 would be to patch HWPF so
that it will work as you wish - the API is very immature and has not been
the focus of the same sort of development effort as has HSSF for example.
Option 2 is to use an alternative such as SWT/OLE or OpenOffice. The
limitation with the OpenOffice approach is that whilst it can read OpenXML
documents - Office 2007 and beyond with the .docx or similar extension - it
cannot save a document in this format.

Sorry for the bad news.

Yours

Mark B


karthik-33 wrote:
> 
> Hi Mark, Thanks for sending this program.
> I tried this program with POI 3.2 final version, which iam currently
> using.
> CharaterRun doesnt behave consistently the same way, sometimes it splits
> the
> paragraph text into more number of Character run, sometimes it doesnt
> split
> and i see the whole paragraph text in one character run. So the search
> text
> is not getting replaced.
> Is there anyway to solve this issue?
> On Sat, Aug 8, 2009 at 6:34 AM, MSB <[email protected]> wrote:
> 
>>
>> Here is the HWPF based code that I put together to play around with. It
>> was
>> written a very long time ago so I am not sure what testing I undertook
>> and
>> exactly what the results were but I have run it this morning just to
>> ensure
>> that
>> it does not crash the PC and all seems to be well. This section has been
>> cut
>> from a much larger class that is full of other test code that I play
>> around
>> with peridocically. Everything you need is there I believe but on the
>> off-chance
>> that it calls another method whose source I have neglected to include,
>> just
>> drop
>> an email to the list please.
>>
>> Currently, I am running POI version 3.5 beta 7 on a PC operating under
>> Windows XP SP2. Office 2007 is installed now and it seems able to open
>> the
>> files this code produces quite happily. In the back of my mind, I seem to
>> remember that the files produced by some search and replace code I put
>> together could be opened but not modified; I tested that problem this
>> morning
>> and the files this code produces seem fine, I can open them, make changes
>> and
>> then save the results again. But do please be prepared for problems like
>> that.
>>
>> Again, can I emphasise this is test code; it is scruffy and there are
>> going
>> to
>> be variables I put in there so that I could monitor the progress of the
>> code
>> by dumping messages to the screen. As you go through, if something seems
>> to
>> be superfluous, then this is likely the reason and you can comment it out
>> or
>> delete it.
>>
>> Good luck and I do hope it all works. If you have any problems, just drop
>> a
>> message onto the list.
>>
>> Yours
>>
>> Mark B
>>
>>
>> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
>> import org.apache.poi.hwpf.HWPFDocument;
>> import org.apache.poi.hwpf.usermodel.Range;
>> import org.apache.poi.hwpf.usermodel.Paragraph;
>> import org.apache.poi.hwpf.usermodel.CharacterRun;
>>
>> import java.io.File;
>> import java.io.FileOutputStream;
>> import java.io.FileInputStream;
>> import java.io.BufferedOutputStream;
>> import java.io.BufferedInputStream;
>> import java.util.HashMap;
>> import java.util.Iterator;
>>
>> /**
>>  * This code is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR
>>  * CONDITIONS OF ANY KIND, either express or implied. It is not intended
>> to
>>  * be used in a 'production' environment without undergoing rigorous
>> testing.
>>  *
>>  * With that out of the way, an instance of this class can be used to
>> search
>> for
>>  * and replace Strings of text within a Word document. To see how the
>> code
>> may
>>  * be used, look into the main() method for examples.
>>  *
>>  * Note the replacements made by the code contained within this class
>> ignore
>>  * any formatting that may have been applied to the text that is
>> replaced.
>> That
>>  * is to say that if the text was originally formatted to use the Arial
>> font,
>>  * was sized to 24 points, emboldened, underlined and red in colour, then
>> all
>>  * of this will be lost if it is replaced. Further if any text is
>> replaced
>> in a
>>  * Paragraph, all the formatting applied to that Paragraph's contents is
>> likely
>>  * to be lost.
>>  *
>>  * @author Mark Beardsley [msb at apache.org]
>>  * @version 1.00 8th August 2009 (cannot remember when originally put
>> together)
>>  */
>> public class SearchReplace {
>>
>>
>>        /**
>>         * Search for and replace a single occurrence of a string of text
>> within a
>>         * Word document.
>>         *
>>         * Note that no checks are made on the parameter's values; that is
>> to say
>>         * that the file named in the InputFilename parameter will not be
>> checked
>>         * to ensure the file exists and neither of the searchTerm nor
>>         * replacementTerm parameters will be checked to ensure they are
>> not
>> null.
>>         * Also, note that I have never tested passing the same String to
>> the
>>         * inputFilename and outputFilename parameters but cannot see why
>> that
>>         * should not be possible.
>>         *
>>         * @param inputFilename An instance of the String class that
>> encapsulates
>>         *                      the name of and path to a Word document
>> which is
>>         *                      in the binary (OLE2CDF) format. The
>> contents
>> of
>> this
>>         *                      document will be searched for occurrences
>> of
>> the
>>         *                      search term.
>>         * @param outputFilename An instance of the String class that
>> encapsulates
>>         *                       the name of and path to a Word document
>> which is
>>         *                       in the binary (OLE2CDF) format. This
>> document will
>>         *                       contain the results of the search and
>> replace
>>         *                       operation.
>>         * @param searchTerm An instance of the String class that
>> encapsulates a
>>         *                   series of characters, a word or words. The
>> document
>>         *                   will be searched for occurrences of this
>> String.
>>         * @param replacementTerm An instance of the String class that
>> contains a
>>         *                        series of characters, a word or words.
>> The
>> String
>>         *                        encapsulated by the searchTerm parameter
>> will be
>>         *                        replaced by the 'contents' of this
>> parameter.
>>         *
>>         */
>>        public void searchAndReplace(String inputFilename,
>>                                 String outputFilename,
>>                                 String searchTerm,
>>                                 String replacementText) {
>>
>>        File inputFile = null;
>>        File outputFile = null;
>>        FileInputStream fileIStream = null;
>>        FileOutputStream fileOStream = null;
>>        BufferedInputStream bufIStream = null;
>>        BufferedOutputStream bufOStream = null;
>>        POIFSFileSystem fileSystem = null;
>>        HWPFDocument document = null;
>>        Range docRange = null;
>>        Paragraph paragraph = null;
>>        CharacterRun charRun = null;
>>        int numParagraphs = 0;
>>        int numCharRuns = 0;
>>        String text = null;
>>
>>        try {
>>            // Create an instance of the POIFSFileSystem class and
>>            // attach it to the Word document using an InputStream.
>>            inputFile = new File(inputFilename);
>>            fileIStream = new FileInputStream(inputFile);
>>            bufIStream = new BufferedInputStream(fileIStream);
>>            fileSystem = new POIFSFileSystem(bufIStream);
>>            document = new HWPFDocument(fileSystem);
>>
>>            // Get the overall Range object for the document. Note the
>>            // use of the getRange() method and not the getOverallRange()
>>            // method, this is just historic - when the code was
>> originally
>>            // written, I do not believe the latter method was part of the
>> API.
>>            docRange = document.getRange();
>>
>>            // Get the number of Paragraph(s) in the overall range and
>> iterate
>>            // through them
>>            numParagraphs = docRange.numParagraphs();
>>            for(int i = 0; i < numParagraphs; i++) {
>>
>>                // Get a Paragraph and recover the text from it. This step
>> is
>> far from
>>                // necessary and I think I only got the text so that I
>> could
>> print
>>                // it to screen as a diagnostic check to ensure that the
>> Paragraph
>>                // contained the text I was searching for. Experiment with
>> this.
>>                paragraph = docRange.getParagraph(i);
>>                text = paragraph.text();
>>
>>                // Get the number of CharacterRuns in the Paragraph
>>                numCharRuns = paragraph.numCharacterRuns();
>>                for(int j = 0; j < numCharRuns; j++) {
>>
>>                        // Get a character run and recover it's text -
>> note
>> that
>>                        // the same text variable is used as for the
>> Paragraph
>> above.
>>                        // So, it MUST be safe to remove the text =
>> paragraph.text()
>>                        // line above.
>>                    charRun = paragraph.getCharacterRun(j);
>>                    text = charRun.text();
>>
>>                    // Check to see if the text of the CharacterRun
>> contains
>> the
>>                    // search term. If it does, find out where that term
>> starts
>>                    // and call the replaceText() method passing the
>> index.
>>                    // Maybe this is the key difference between what we
>> are
>>                    // doing.
>>                    if(text.contains(searchTerm)) {
>>                        int start = text.indexOf(searchTerm);
>>                        charRun.replaceText(searchTerm, replacementText,
>> start);
>>                    }
>>                }
>>            }
>>
>>            // Close the InputStream
>>            bufIStream.close();
>>            bufIStream = null;
>>
>>            // Open an OutputStream and write the document away.
>>            outputFile = new File(outputFilename);
>>            fileOStream = new FileOutputStream(outputFile);
>>            bufOStream = new BufferedOutputStream(fileOStream);
>>
>>            document.write(bufOStream);
>>
>>        }
>>        catch(Exception ex) {
>>            System.out.println("Caught an: " + ex.getClass().getName());
>>            System.out.println("Message: " + ex.getMessage());
>>            System.out.println("Stacktrace follows.............");
>>            ex.printStackTrace(System.out);
>>        }
>>        finally {
>>            if(bufOStream != null) {
>>                try {
>>                    //bufOStream.flush();
>>                    bufOStream.close();
>>                    bufOStream = null;
>>                }
>>                catch(Exception ex) {
>>
>>                }
>>            }
>>            if(bufIStream != null) {
>>                try {
>>                    bufIStream.close();
>>                    bufIStream = null;
>>                }
>>                catch(Exception ex) {
>>                    // I G N O R E //
>>                }
>>            }
>>        }
>>
>>    }
>>
>>    /**
>>         * Search for and replace a single occurrence of a string of text
>> within a
>>         * Word document.
>>         *
>>         * Note that no checks are made on the parameter's values; that is
>> to say
>>         * that the file named in the InputFilename parameter will not be
>> checked
>>         * to ensure the file exists and neither of the searchTerm nor
>>         * replacementTerm pare,eters will be checked to ensure they are
>> not
>> null.
>>         * Also, note that I have never tested passing the same String to
>> the
>>         * inputFilename and outputFilename parameters but cannot see why
>> that
>>         * should not be possible.
>>         *
>>         * @param inputFilename An instance of the String class that
>> encapsulates
>>         *                      the name of and path to a Word document
>> which is
>>         *                      in the binary (OLE2CDF) format. The
>> contents
>> of
>> this
>>         *                      document will be searched for occurrences
>> of
>> the
>>         *                      search term.
>>         * @param outputFilename An instance of the String class that
>> encapsulates
>>         *                       the name of and path to a Word document
>> which is
>>         *                       in the binary (OLE2CDF) format. This
>> document will
>>         *                       contain the results of the search and
>> replace
>>         *                       operation.
>>         * @param replacements An instance of the java.util.HashMap class
>> that
>>         *                     contains a series of key, value pairs. Each
>> key
>>         *                     is an instance of the String class that
>> encapsulates
>>         *                     a series of characters, a word or words
>> that
>> the
>>         *                     code will search for and the accompanying
>> value is
>>         *                     also an instance of the String class that
>> likewise
>>         *                     encapsulates a series of characters, a word
>> or
>> words.
>>         *                     The 'contents' of the value's String will
>> be
>> used to
>>         *                     replace the contents of the key's String if
>> an
>>         *                     occurrence of the latter is found.
>>         */
>>    public void searchAndReplace(String inputFilename,
>>                                 String outputFilename,
>>                                 HashMap<String, String> replacements) {
>>
>>        File inputFile = null;
>>        File outputFile = null;
>>        FileInputStream fileIStream = null;
>>        FileOutputStream fileOStream = null;
>>        BufferedInputStream bufIStream = null;
>>        BufferedOutputStream bufOStream = null;
>>        POIFSFileSystem fileSystem = null;
>>        HWPFDocument document = null;
>>        Range docRange = null;
>>        Paragraph paragraph = null;
>>        CharacterRun charRun = null;
>>        Set<String> keySet = null;
>>        Iterator<String> keySetIterator = null;
>>        int numParagraphs = 0;
>>        int numCharRuns = 0;
>>        String text = null;
>>        String key = null;
>>        String value = null;
>>
>>        try {
>>            // Create an instance of the POIFSFileSystem class and
>>            // attach it to the Word document using an InputStream.
>>            inputFile = new File(inputFilename);
>>            fileIStream = new FileInputStream(inputFile);
>>            bufIStream = new BufferedInputStream(fileIStream);
>>            fileSystem = new POIFSFileSystem(bufIStream);
>>            document = new HWPFDocument(fileSystem);
>>
>>                        // Get a reference to the overall Range for the
>> document
>>                        // and discover how many Paragraphs objects there
>> are
>>                        // in the document.
>>            docRange = document.getRange();
>>            numParagraphs = docRange.numParagraphs();
>>
>>            // Recover a Set of the keys in the HashMap
>>            keySet = replacements.keySet();
>>
>>            // Step through each Paragraph
>>            for(int i = 0; i < numParagraphs; i++) {
>>                paragraph = docRange.getParagraph(i);
>>                // This line can almost certainly be removed - see
>>                // the comments in the method above.
>>                text = paragraph.text();
>>
>>                // Get the number of CharacterRuns in the Paragraph
>>                // and step through each one.
>>                numCharRuns = paragraph.numCharacterRuns();
>>                for(int j = 0; j < numCharRuns; j++) {
>>                    charRun = paragraph.getCharacterRun(j);
>>
>>                    // Get the text from the CharacterRun and recover an
>>                    // Iterator to step through the Set of keys.
>>                    text = charRun.text();
>>                    keySetIterator = keySet.iterator();
>>                    while(keySetIterator.hasNext()) {
>>
>>                        // Get the key - which is also the search term -
>> and
>>                        // check to see if it can be found within the
>>                        // CharacterRuns text.
>>                        key = keySetIterator.next();
>>                        if(text.contains(key)) {
>>
>>                                // If the search term was found in the
>> text,
>> get
>> the
>>                                // matching value from the HashMap, find
>> out
>> whereabouts
>>                                // in the CharacterRuns text the search
>> term
>> is
>>                                // and call the replaceText() method to
>> substitute
>>                                // the replacement term for the search
>> term.
>>                            value = replacements.get(key);
>>                            int start = text.indexOf(key);
>>                            charRun.replaceText(key, value, start);
>>
>>                            // Note that this code was added to test
>> whether
>>                            // it was possible to replace multiple
>> occurrences
>>                            // of the search term. I cannot remember if I
>> tested
>>                            // it but believe that it did work; either
>> way,
>>                            // it could be tested now and if succeeds,
>> then
>> the
>>                            // searchAndReplace() method above could be
>> modified
>>                            // to include this.
>>                            docRange = document.getRange();
>>                            paragraph = docRange.getParagraph(i);
>>                            charRun = paragraph.getCharacterRun(j);
>>                            text = charRun.text();
>>                        }
>>                    }
>>                }
>>            }
>>
>>            // Close the InputStream
>>            bufIStream.close();
>>            bufIStream = null;
>>
>>            // Open an OutputStream and save the modified document away.
>>            outputFile = new File(outputFilename);
>>            fileOStream = new FileOutputStream(outputFile);
>>            bufOStream = new BufferedOutputStream(fileOStream);
>>            document.write(bufOStream);
>>        }
>>        catch(Exception ex) {
>>            System.out.println("Caught an: " + ex.getClass().getName());
>>            System.out.println("Message: " + ex.getMessage());
>>            System.out.println("Stacktrace follows.............");
>>            ex.printStackTrace(System.out);
>>        }
>>        finally {
>>            if(bufIStream != null) {
>>                try {
>>                    bufIStream.close();
>>                    bufIStream = null;
>>                }
>>                catch(Exception ex) {
>>                    // I G N O R E //
>>                }
>>            }
>>            if(bufOStream != null) {
>>                try {
>>                    bufOStream.flush();
>>                    bufOStream.close();
>>                    bufOStream = null;
>>                }
>>                catch(Exception ex) {
>>
>>                }
>>            }
>>        }
>>
>>    }
>>
>>        /**
>>         * The main entry point to the program demonstrating how the code
>> may
>>         * be utilised.
>>         *
>>         * @param args An array of type String containing argumnets passed
>> to the
>>         *             program on execution.
>>         */
>>        public static void main(String[] args) {
>>                SearchReplace replacer = new SearchReplace();
>>
>>                // To serach for and replace single items. Note, the code
>> has not, at
>>                // least as far as I can remember, been tested by passing
>> the same
>>                // file to both the searchTerm and replacementTerm
>> parameters. It ought
>>                // to work but has NOT been tested I believe.
>>                replacer.searchAndReplace("Document.doc",            //
>> Source Document
>>                        "Replaced Document.doc",                        
>> //
>> Result Document
>>                        "search term",                                  
>> //
>> Search term
>>                        "replacement term");                            
>> //
>> Replacement term
>>
>>                // To search for and replace a series of items
>>                HashMap<String, String> searchTerms = new HashMap<String,
>> String>();
>>                searchTerms.put("search term 1", "replacement term 1");
>>                searchTerms.put("search term 2", "replacement term 2");
>>                searchTerms.put("search term 3", "replacement term 3");
>>                searchTerms.put("search term 4", "replacement term 4");
>>
>>                replacer.searchAndReplace("Document.doc",    // Source
>> Document
>>                        "Replaced Document.doc",                 // Result
>> Document
>>                        searchTerms)                             //
>> Search/replacement items
>>         }
>> }
>>
>>
>>
>> karthik-33 wrote:
>> >
>> > Thanks for the reply mark.
>> > I dont think i need to preserve text formatting, but i would like to
>> try
>> > your code and see how it works.
>> > I think that would help me too.
>> >
>> > I cant go the open office since my business requirement is to use
>> > microsoft
>> > word documents.
>> > I will be using this search and replace function in the same PC as that
>> of
>> > the application.
>> >
>> > If you can send me that code, i will try and let u know how it works.
>> >
>> > Thanks
>> > Karthik
>> >
>> >
>> > On Fri, Aug 7, 2009 at 11:49 AM, MSB <[email protected]> wrote:
>> >
>> >>
>> >> Can I ask two questions please?
>> >>
>> >> Do you need to preserve the formatting applied to the text? If not,
>> then
>> >> I
>> >> think that somewhere I have a piece of HWPF code that does a search
>> and
>> >> replace. I am not at all certain about the state of the code and
>> cannot
>> >> remember if I hit the same problem as you - and I may well have - but
>> I
>> >> am
>> >> willing to look it out if you think it might help.
>> >>
>> >> Secondly, do you have to use HWPF/XWPF? The API is still immature and
>> it
>> >> is
>> >> really only suitable for realtively simple tasks. Better alternatives
>> >> might
>> >> be OpenOffice which you can 'control' through it's UNO API or Word
>> itself
>> >> that can be manipulated using OLE. You can ONLY use OLE if you are
>> >> working
>> >> on a windows based PC and you have Word installed on that PC.
>> OpenOffice
>> >> is
>> >> more flexible but it still cannot be used - at least as far as I am
>> aware
>> >> -
>> >> as a document server, so it is best to have that application installed
>> on
>> >> the PC you will be using for the search/replace operation.
>> >>
>> >> Yours
>> >>
>> >> Mark B
>> >>
>> >>
>> >> karthik-33 wrote:
>> >> >
>> >> > I have microsoft office 2007 and while saving the document, i save
>> it
>> >> as
>> >> > microsoft 2003 document.
>> >> > Iam trying to replace the text using replaceText method in
>> Paragraph.
>> >> > It works fine when the replacement text and search text are of equal
>> >> > length.
>> >> > It corrupts the document, when the length of the string is either
>> >> greater
>> >> > or
>> >> > less.
>> >> > If anyone has gone through the issue and resolved or have any idea.
>> >> Please
>> >> > let me know, it will be useful for me..
>> >> > Iam not sure what is causing the problem to corrupt the document
>> >> >
>> >> > Code is:
>> >> >
>> >> > String replaceTxt = "Replacement";
>> >> > String searchText = "Orginial";
>> >> > POIFSFileSystem ps = new POIFSFileSystem (new
>> >> > FileInputStream("C:/Document.doc"));
>> >> > HWPFDocument doc = new HWPFDocument ();
>> >> > Range range = doc.getRange();
>> >> > for(int x=0;x<range.numSections();x++)
>> >> > {
>> >> >     Section s = range.getSection(x);
>> >> >     for(int y=0;y<s.numParagraphs();y++)
>> >> >     {
>> >> >         Paragraph p = s.getParagraph(y);
>> >> >         String paraText = p.text();
>> >> >         int offset = paraText.indexOf(searchText );
>> >> >         if(offset != -1)
>> >> >         {
>> >> >              p.replaceText(searchText,replaceTxt,offset);
>> >> >
>> >> >         }
>> >> >     }
>> >> >
>> >> > }
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24867251.html
>> >> Sent from the POI - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>> > --
>> > karthik
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24876942.html
>>  Sent from the POI - User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 
> -- 
> karthik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Replace-Text-Problem-%28Document-Corrupt%29---POI-HWPFDocument-tp24864855p24885699.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Replace Text Problem (Document Corrupt) - POI HWPFDocument

Reply via email to