Re: How to read the value of bookmarks? (docx)

Mark Beardsley Tue, 19 Jun 2012 06:42:29 -0700

It ought to work for bookmarks contained within tables. If you take a look at
the getBookmarkText() method, it contains code that recovers all of the
paragraphs from tables and processes them for bookmarks.


If the code does not work, can you let me have a file so that I can use it
for testing purposes please? You can send it to me at [email protected]

Yours

Mark B


import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTMarkupRange;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTRPr;
import java.util.List;
import java.util.Iterator;
import org.apache.xmlbeans.XmlCursor;
import org.apache.xmlbeans.XmlException;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

/**
 * Second attempt at inserting text at a bookmark defined within a Word
 * document. Note that there is one SERIOUS limitations with the code as it
 * stands; at least only one as far as I am aware: nested bookmarks.
 *
 * It is possible to create a document and to nest one bookmark within
another.
 * Typically, a bookmark is inserted into a piece of text, that is then
selected
 * and another bookmark is added to that selection. The XML markup might
look
 * something like this
 *
 * <pre>
 * <w:p w:rsidR="00945150" w:rsidRDefault="00945150">
 *   <w:r>
 *     <w:t xml:space="preserve">
 *     Imagine I want to insert one bookmark at the start of this
 *     </w:t>
 *   </w:r>
 *     <w:bookmarkStart w:id="0" w:name="OUTER"/>
 *       <w:r>
 *         <w:t xml:space="preserve">piece of text and another just
 *         </w:t>
 *     </w:r>
 *   <w:proofErr w:type="gramStart"/>
 *   <w:r>
 *     <w:t xml:space="preserve">here
 *     </w:t>
 *   </w:r>
 *   <w:bookmarkStart w:id="1" w:name="INNER"/>
 *   <w:bookmarkEnd w:id="1"/>
 *     <w:r>
 *       <w:t>.
 *       </w:t>
 *     </w:r>
 *   <w:bookmarkEnd w:id="0"/>
 *   <w:proofErr w:type="gramEnd"/>
 * </w:p>
 * </pre>
 *
 * Using Word macros to conduct tests, a number of things have become
apparent.
 * The first is that within the Word Object Model, a Bookmark is considered
to
 * be a Range object and this limits the operations that can be performed on
it.
 * It is quite possible to insert some text in front of the bookmarks start
tag
 * or behind the bookmarks end tag. The text is never inserted into the
markup
 * between the bookmarkStart and bookmarkEnd rags and, therefore, will not
 * replace any text that does already appear between them. With regard to
 * styling, it seems as though the following guidelines hold true;
 *
 * 1. If text is being inserted before the bookmark, then it will 'inherit'
any
 * style information from the character run that immediately precedes it, if
 * any. 2. If the text is being inserted after the bookmarkEnd tag then it
will
 * inherit it's styling from the nearest run element that is contained
between
 * the bookmarkEnd tag and it's matching bookmarkStart tag, if any.
 *
 * Currently, I am unsure on a couple of points;
 *
 * 1. Whether it is possible for a document to contain two or more bookmarks
 * with the same name. Initial testing suggested that this is not possible
but
 * the code has been written as if it is. That is to say that once a
bookmark is
 * found the search for a subsequent bookmark with the same name will
continue.
 * This behaviour is easy to amend however. 2. Should the code offer a third
 * option, to replace the text, if any, contained between the bookmarkStart
and
 * bookmarkEnd tags? If so, what should happen to any bookmarks that are
 * contained between the start and end tags?
 *
 * @author Mark Beardsley
 * @version 1.00 16th June 2012
 */
public class DOCXTest {

    public static final int INSERT_BEFORE = 0;
    public static final int INSERT_AFTER = 1;
    private XWPFDocument document = null;

    public DOCXTest() {
    }

    /**
     * Opens a Word OOXML file.
     *
     * @param filename An instance of the String class that encapsulates the
     * path to and name of a Word OOXML (.docx) file.
     * @throws IOException Thrown if a problem occurs within the underlying
file
     * system.
     */
    public final void openFile(String filename) throws IOException {
        File file = null;
        FileInputStream fis = null;
        try {
            // Simply open the file and store a reference into the
'document'
            // local variable.
            file = new File(filename);
            fis = new FileInputStream(file);
            this.document = new XWPFDocument(fis);
        } finally {
            try {
                if (fis != null) {
                    fis.close();
                    fis = null;
                }
            } catch (IOException ioEx) {
                // Swallow this exception. It would have occured onyl
                // when releasing the file handle and should not pose
                // problems to later processing.
            }
        }
    }

    /**
     * Saves a Word OOXML file away under the name, and to the location,
     * specified.
     *
     * @param filename An instance of the String class that encapsulates the
of
     * the file and the location into which it should be stored.
     * @throws IOException Thrown if a problem occurs in the underlying file
     * system.
     */
    public final void saveAs(String filename) throws IOException {
        File file = null;
        FileOutputStream fos = null;
        try {
            file = new File(filename);
            fos = new FileOutputStream(file);
            this.document.write(fos);
        } finally {
            if (fos != null) {
                fos.close();
                fos = null;
            }
        }
    }

    /**
     * Inserts a value at a location within the Word document specified by a
     * named bookmark.
     *
     * @param bookmarkName An instance of the String class that encapsulates
the
     * name of the bookmark. Note that case is important and the case of the
     * bookmarks name within the document and that of the value passed to
this
     * parameter must match.
     * @param bookmarkValue An instance of the String class that
encapsulates
     * the value that should be inserted into the document at the location
     * specified by the bookmark.
     * @param where A primitive int whose value indicates whether the text
     * should be inserted before or after the bookmark. Note that constants
have
     * been defined - DOCXTest.INSERT_BEFORE and DOCXTest.INSERT_AFTER - for
     * this purpose.
     */
    public final void insertAtBookmark(String bookmarkName,
            String bookmarkValue, int where) throws XmlException {
        List<XWPFTable> tableList = null;
        Iterator<XWPFTable> tableIter = null;
        List<XWPFTableRow> rowList = null;
        Iterator<XWPFTableRow> rowIter = null;
        List<XWPFTableCell> cellList = null;
        Iterator<XWPFTableCell> cellIter = null;
        XWPFTable table = null;
        XWPFTableRow row = null;
        XWPFTableCell cell = null;

        // Firstly, deal with any paragraphs in the body of the document.
        this.procParaList(this.document.getParagraphs(), bookmarkName,
bookmarkValue, where);

        // Then check to see if there are any bookmarks in table cells. To
do this
        // it is necessary to get at the list of paragraphs 'stored' within
the
        // individual table cell, hence this code which get the tables from
the
        // document, the rows from each table, the cells from each row and
the 
        // paragraphs from each cell.
        tableList = this.document.getTables();
        tableIter = tableList.iterator();
        while (tableIter.hasNext()) {
            table = tableIter.next();
            rowList = table.getRows();
            rowIter = rowList.iterator();
            while (rowIter.hasNext()) {
                row = rowIter.next();
                cellList = row.getTableCells();
                cellIter = cellList.iterator();
                while (cellIter.hasNext()) {
                    cell = cellIter.next();
                    this.procParaList(cell.getParagraphs(),
                            bookmarkName,
                            bookmarkValue,
                            where);
                }
            }
        }
    }

    /**
     * Inserts text into the document at the position indicated by a
specific
     * bookmark. Note that the current implementation does not take account
of
     * nested bookmarks, that is bookmarks that contain other bookmarks.
Note
     * also that any text contained within the bookmark itself will be
removed.
     *
     * @param paraList An instance of a class that implements the List
interface
     * and which encapsulates references to one or more instances of the
     * XWPFParagraph class.
     * @param bookmarkName An instance of the String class that encapsulates
the
     * name of the bookmark that identifies the position within the document
     * some text should be inserted.
     * @param bookmarkValue An instance of the AString class that
encapsulates
     * the text that should be inserted at the location specified by the
     * bookmark.
     * @param where A primitive int whose value indicates where the text
should
     * be inserted relative to the bookmark, i.e. before or after the
bookmark.
     */
    private final void procParaList(List<XWPFParagraph> paraList,
            String bookmarkName, String bookmarkValue, int where) throws
XmlException {
        Iterator<XWPFParagraph> paraIter = null;
        XWPFParagraph para = null;
        List<CTBookmark> bookmarkList = null;
        Iterator<CTBookmark> bookmarkIter = null;
        CTBookmark bookmark = null;
        XWPFRun run = null;

        // Get an Iterator for the XWPFParagraph object and step through
them
        // one at a time.
        paraIter = paraList.iterator();
        while (paraIter.hasNext()) {
            para = paraIter.next();

            // Get a List of the CTBookmark object sthat the paragraph
            // 'contains' and step through these one at a time.
            bookmarkList = para.getCTP().getBookmarkStartList();
            bookmarkIter = bookmarkList.iterator();
            while (bookmarkIter.hasNext()) {
                bookmark = bookmarkIter.next();

                // If the name of the CTBookmakr object matches the value
                // encapsulated within the argumnet passed to the
bookmarkName
                // parameter then this is where the text should be inserted.
                if (bookmark.getName().equals(bookmarkName)) {

                    // Create a new character run to hold the value
encapsulated
                    // within the argument passed to the bookmarkValue
parameter
                    // and then test whether this new run shouold be
inserted
                    // into the document before or after the bookmark.
                    run = para.createRun();
                    run.setText(bookmarkValue);
                    if (where == DOCXTest.INSERT_BEFORE) {
                        this.insertBeforeBookmark(bookmark, run, para);
                    } else {
                        this.insertAfterBookmark(bookmark, run, para);
                    }
                }
            }
        }
    }

    /**
     * Inserts some text into a Word document in a position that is
immediately
     * after a named bookmark.
     *
     * Bookmarks can take two forms, they can either simply mark a location
     * within a document or they can do this but contain some text. The
     * difference is obvious from looking at some XML markup. The simple
     * placeholder bookmark will look like this;
     *
     * <pre>
     *
     * <w:bookmarkStart w:name="AllAlone" w:id="0"/><w:bookmarkEnd
w:id="0"/>
     *
     * </pre>
     *
     * Simply a pair of tags where one tag has the name bookmarkStart, the
other
     * the name bookmarkEnd and both share matching id attributes. In this
case,
     * the text will simply be inserted into the document at a point
immediately
     * after the bookmarkEnd tag. No styling will be applied to the text, it
     * will simply inherit the documents defaults.
     *
     * The more complex case looks like this;
     *
     * <pre>
     *
     * <w:bookmarkStart w:name="InStyledText" w:id="3"/>
     *   <w:r w:rsidRPr="00DA438C">
     *     <w:rPr>
     *       <w:rFonts w:hAnsi="Engravers MT" w:ascii="Engravers MT"
w:cs="Arimo"/>
     *       <w:color w:val="FF0000"/>
     *     </w:rPr>
     *     <w:t>text</w:t>
     *   </w:r>
     * <w:bookmarkEnd w:id="3"/>
     *
     * </pre>
     *
     * Here, the user has selected the word 'text' and chosen to insert a
     * bookmark into the document at that point. So, the bookmark tags
'contain'
     * a character run that is styled. Inserting any text after this
bookmark,
     * it is important to ensure that the styling is preserved and copied
over
     * to the newly inserted text.
     *
     * The approach taken to dealing with both cases is similar but slightly
     * different. In both cases, the code simply steps along the document
nodes
     * until it finds the bookmarkEnd tag whose ID matches that of the
     * bookmarkStart tag. Then, it will look to see if there is one further
node
     * following the bookmarkEnd tag. If there is, it will insert the text
into
     * the paragraph immediately in front of this node. If, on the other
hand,
     * there are no more nodes following the bookmarkEnd tag, then the new
run
     * will simply be positioned at the end of the paragraph.
     *
     * Styles are dealt with by 'looking' for a 'w:rPr' element whilst
iterating
     * through the nodes. If one is found, its details will be captured and
     * applied to the run before the run is inserted into the paragraph. If
     * there are multiple runs between the bookmarkStart and bookmarkEnd
tags
     * and these have different styles applied to them, then the style
applied
     * to the last run before the bookmarkEnd tag - if any - will be cloned
and
     * applied to the newly inserted text.
     *
     * @param bookmark An instance of the CTBookmark class that encapsulates
     * information about the bookmark.
     * @param run An instance of the XWPFRun class that encapsulates the
text
     * that is to be inserted into the document following the bookmark.
     * @param para An instance of the XWPFParagraph class that encapsulates
that
     * part of the document, a paragraph, into which the run will be
inserted.
     */
    private void insertAfterBookmark(CTBookmark bookmark, XWPFRun run,
            XWPFParagraph para) {
        Node nextNode = null;
        Node insertBeforeNode = null;
        Node styleNode = null;
        int bookmarkStartID = 0;
        int bookmarkEndID = -1;

        // Capture the id of the bookmarkStart tag. The code will step
through
        // the document nodes 'contained' within the start and end tags that
have
        // matching id numbers.
        bookmarkStartID = bookmark.getId().intValue();

        // Get the node for the bookmark start tag and then enter a loop
that
        // will step from one node to the next until the bookmarkEnd tag
with
        // a matching id is fouind.
        nextNode = bookmark.getDomNode();
        while (bookmarkStartID != bookmarkEndID) {

            // Get the next node along and check to see if it is a
bookmarkEnd
            // tag. If it is, get its id so that the containing while loop
can
            // be terminated once the correct end tag is found. Note that
the
            // id will be obtained as a String and must be converted into an
            // integer. This has been coded to fail safely so that if an
error
            // is encuntered converting the id to an int value, the while
loop
            // will still terminate.
            nextNode = nextNode.getNextSibling();
            if (nextNode.getNodeName().contains("bookmarkEnd")) {
                try {
                    bookmarkEndID = Integer.parseInt(
                           
nextNode.getAttributes().getNamedItem("w:id").getNodeValue());
                } catch (NumberFormatException nfe) {
                    bookmarkEndID = bookmarkStartID;
                }
            } // If we are not dealing with a bookmarkEnd node, are we
dealing
            // with a run node that MAY contains styling information. If so, 
            // then get that style information from the run.
            else {
                if (nextNode.getNodeName().equals("w:r")) {
                    styleNode = this.getStyleNode(nextNode);
                }
            }
        }

        // After the while loop completes, it should have located the
correct 
        // bookmarkEnd tag but we cannot perform an insert after only an
insert
        // before operation and must, therefore, get the next node. 
        insertBeforeNode = nextNode.getNextSibling();

        // Style the newly inserted text. Note that the code copies or
clones
        // the style it found in another run, failure to do this would
remove the
        // style from one node and apply it to another.
        if (styleNode != null) {
            run.getCTR().getDomNode().insertBefore(
                    styleNode.cloneNode(true),
run.getCTR().getDomNode().getFirstChild());
        }

        // Finally, check to see if there was a node after the bookmarkEnd
        // tag. If there was, then this code will insert the run in front of
        // that tag. If there was no node following the bookmarkEnd tag then
the
        // run will be inserted at the end of the paragarph and this was
taken
        // care of at the point of creation.
        if (insertBeforeNode != null) {
            para.getCTP().getDomNode().insertBefore(
                    run.getCTR().getDomNode(), insertBeforeNode);
        }
    }

    /**
     * Inserts some text into a Word document immediately in front of the
     * location of a named bookmark.
     *
     * This case is slightly more straightforward than inserting after the
     * bookmark. For example, it is possible only to insert a new node in
front
     * of an existing node. When inserting after the bookmark, then end node
had
     * to be located whereas, in this case, the node is already known, it is
the
     * CTBookmark itself. The only information that must be discovered is
     * whether there is a run immediately in front of the boookmarkStart tag
and
     * whether that run is styled. If there is and if it is, then this style
     * must be cloned and applied the text which will be inserted into the
     * paragraph.
     *
     * @param bookmark An instance of the CTBookmark class that encapsulates
     * information about the bookmark.
     * @param run An instance of the XWPFRun class that encapsulates the
text
     * that is to be inserted into the document following the bookmark.
     * @param para An instance of the XWPFParagraph class that encapsulates
that
     * part of the document, a paragraph, into which the run will be
inserted.
     */
    private void insertBeforeBookmark(CTBookmark bookmark, XWPFRun run,
            XWPFParagraph para) {
        Node insertBeforeNode = null;
        Node childNode = null;
        Node styleNode = null;

        // Get the dom node from the bookmarkStart tag and look for another
        // node immediately preceding it.
        insertBeforeNode = bookmark.getDomNode();
        childNode = insertBeforeNode.getPreviousSibling();

        // If a node is found, try to get the styling from it.
        if (childNode != null) {
            styleNode = this.getStyleNode(childNode);

            // If that previous node was styled, then apply this style to
the
            // text which will be inserted.
            if (styleNode != null) {
                run.getCTR().getDomNode().insertBefore(
                        styleNode.cloneNode(true),
run.getCTR().getDomNode().getFirstChild());
            }
        }

        // Insert the text into the paragraph immediately in front of the
        // bookmarkStart tag.
        para.getCTP().getDomNode().insertBefore(
                run.getCTR().getDomNode(), insertBeforeNode);
    }

    /**
     * Recover styling information - if any - from another document node.
Note
     * that it is only possible to accomplish this if the node is a run
(w:r)
     * and this could be tested for in the code that calls this method.
However,
     * a check is made in the calling code as to whether a style has been
found
     * and only if a style is found is it applied. This method always
returns
     * null if it does nto find a style making that checking process easier.
     *
     * @param parentNode An instance of the Node class that encapsulates a
     * reference to a document node.
     * @return An instance of the Node class that encapsulates the styling
     * information applied to a character run. Note that if no styling
     * information is found in the run OR if the node passed as an argument
to
     * the parentNode parameter is NOT a run, then a null value will be
     * returned.
     */
    private Node getStyleNode(Node parentNode) {
        Node childNode = null;
        Node styleNode = null;
        if (parentNode != null) {

            // If the node represents a run and it has child nodes then
            // it can be processed further. Note, whilst testing the code,
it
            // was observed that although it is possible to get a list of a
nodes
            // children, even when a node did have children, trying to
obtain this
            // list would often return a null value. This is the reason why
the
            // technique of stepping from one node to the next is used here.
            if (parentNode.getNodeName().equalsIgnoreCase("w:r")
                    && parentNode.hasChildNodes()) {

                // Get the first node and catch it's reference for return if
                // the first child node is a style node (w:rPr).
                childNode = parentNode.getFirstChild();
                if (childNode.getNodeName().equals("w:rPr")) {
                    styleNode = childNode;
                } else {
                    // If the first node was not a style node and there are
other
                    // child nodes remaining to be checked, then step
through
                    // the remaining child nodes until either a style node
is
                    // found or until all child nodes have been processed.
                    while ((childNode = childNode.getNextSibling()) != null)
{
                        if (childNode.getNodeName().equals("w:rPr")) {
                            styleNode = childNode;
                            // Note setting to null here if a style node is 
                            // found in order order to terminate any further
                            // checking
                            childNode = null;
                        }
                    }
                }
            }
        }
        return (styleNode);
    }

    /**
     * Recover and return any text that may exist within the document
between
     * the opening and closing brackets ([ and ]) of the named bookmark.
     *
     * @param bookmarkName An instance of the String class that encapsulates
the
     * name of the bookmark.
     * @return An instance of the String class that encapsulates the text
     * discovered between the opening and closing brackets (as seen when
viewing
     * the document with Word), if any. Note that a null value will be
returned
     * if the bookmark cannot be found. Also note that the code will look
for
     * bookmarks in the body of the document and individual table cells.
     * @throws XmlException Thrown if a problem is encountered parsing the
XML
     * markup recovered from the document.
     * @throws IOException Thrown if a problem is encountered within the
     * underlying file system.
     */
    public String getBookmarkText(String bookmarkName) throws XmlException,
            IOException {
        List<XWPFTable> tableList = null;
        Iterator<XWPFTable> tableIter = null;
        List<XWPFTableRow> rowList = null;
        Iterator<XWPFTableRow> rowIter = null;
        List<XWPFTableCell> cellList = null;
        Iterator<XWPFTableCell> cellIter = null;
        XWPFTable table = null;
        XWPFTableRow row = null;
        XWPFTableCell cell = null;
        String text = null;

        // Firstly, deal with any paragraphs in the body of the document.
        text = this.procParasForBookmarkText(this.document.getParagraphs(),
                bookmarkName);

        // Then check to see if there are any bookmarks in table cells. To
do this
        // it is necessary to get at the list of paragraphs 'stored' within
the
        // individual table cell, hence this code which get the tables from
the
        // document, the rows from each table, the cells from each row and
the 
        // paragraphs from each cell.
        if (text == null) {
            tableList = this.document.getTables();
            tableIter = tableList.iterator();
            while (tableIter.hasNext()) {
                table = tableIter.next();
                rowList = table.getRows();
                rowIter = rowList.iterator();
                while (rowIter.hasNext()) {
                    row = rowIter.next();
                    cellList = row.getTableCells();
                    cellIter = cellList.iterator();
                    while (cellIter.hasNext()) {
                        cell = cellIter.next();
                        text =
this.procParasForBookmarkText(cell.getParagraphs(),
                                bookmarkName);
                    }
                }
            }
        }
        return (text);
    }

    /**
     * Processes a List of XWPFParagraph objects searching for the named
     * bookmark. When the bookmark is found, any text that would appear
between
     * a bookmarks enclosing brackets ([ and ]) in the document as viewed
using
     * Word will actually be contained within one or more character run
(w:r)
     * elements that appear in the XML markup between the bookmarkStart and
     * bookmarkEnd tags, a little like this;
     *
     * <pre>
     *
     * <w:bookmarkStart w:id="3" w:name="InStyledText"/>
     *   <w:r w:rsidRPr="00DA438C">
     *     <w:rPr>
     *       <w:rFonts w:ascii="Engravers MT" w:hAnsi="Engravers MT"
w:cs="Arimo"/>
     *       <w:color w:val="FF0000"/>
     *     </w:rPr>
     *     <w:t>
     *       text
     *     </w:t>
     *   </w:r>
     * <w:bookmarkEnd w:id="3"/>
     *
     * </pre>
     *
     * which shows the markup for a bookmark called InStyledText. It has a
     * single run that has a style applied to it and which contains a single
     * piece of text. This text is held in a child node (w:t) and it is this
     * child node (or these child nodes in case the run contains mode that
one
     * piece of text) that this code recovers.
     *
     * @param paraList A List containing one or more instances of the
     * XWPFParagraph class. These are to be searched for the named bookmark.
     * @param bookmarkName An instance of the String class that encapsulates
the
     * name of the bookmark.
     * @return An instance of the String class encapsulating the text - if
any -
     * found between the bookmarks start and end tags. A null value will be
     * returned if the bookmark cannot be found.
     * @throws XmlException Thrown if a problem is encountered parsing the
XML
     * markup recovered from the document in order to construct a CTText
     * instance which is required to obtain the bookmarks text.
     * @throws IOException An OutputStream is used to read the contents of
the
     * CTText object and an IOException will be thrown if any problems are
     * encountered.
     */
    public String procParasForBookmarkText(List<XWPFParagraph> paraList,
            String bookmarkName) throws XmlException, IOException {
        Iterator<XWPFParagraph> paraIter = null;
        XWPFParagraph para = null;
        XWPFRun run = null;
        List<CTBookmark> bookmarkList = null;
        Iterator<CTBookmark> bookmarkIter = null;
        CTBookmark bookmark = null;
        StringBuilder builder = null;

        // Get an Iterator to step through the contents of the paragraph
list.
        paraIter = paraList.iterator();
        while (paraIter.hasNext()) {

            // Get the paragraph, a llist of CTBookmark objects and an
Iterator
            // to step through the list of CTBookmarks.
            para = paraIter.next();
            bookmarkList = para.getCTP().getBookmarkStartList();
            bookmarkIter = bookmarkList.iterator();
            while (bookmarkIter.hasNext()) {

                // Get a Bookmark and check it's name. If the name of the
                // bookmark matches the name the user has specified then get
the
                // bookmarks ID. This is required to cope with the situation
where
                // one bookmark is nested within another; we do not want to
end
                // processing until we hit the matching bookmarkEnd tag.
                bookmark = bookmarkIter.next();
                if (bookmark.getName().equals(bookmarkName)) {
                    builder = this.getTextFromBookmark(bookmark);
                }
            }
        }
        return (builder == null ? null : builder.toString());
    }

    /**
     * There are two types of bookmarks. One is a simple placeholder whilst
the
     * second is still a placeholder but it 'contains' some text. In the
second
     * instance, the creator of the document has selected some text and then
     * chosen to insert a bookmark there and the difference if obvious when
     * looking at the XML markup.
     *
     * The simple case;
     *
     * <pre>
     *
     * <w:bookmarkStart w:name="AllAlone" w:id="0"/><w:bookmarkEnd
w:id="0"/>
     *
     * </pre>
     *
     * The more complex case;
     *
     * <pre>
     *
     * <w:bookmarkStart w:name="InStyledText" w:id="3"/>
     *   <w:r w:rsidRPr="00DA438C">
     *     <w:rPr>
     *       <w:rFonts w:hAnsi="Engravers MT" w:ascii="Engravers MT"
w:cs="Arimo"/>
     *       <w:color w:val="FF0000"/>
     *     </w:rPr>
     *     <w:t>text</w:t>
     *   </w:r>
     * <w:bookmarkEnd w:id="3"/>
     *
     * </pre>
     *
     * This method assumes that the user wishes to recover the content from
any
     * character run that appears in the markup between a matching pair of
     * bookmarkStart and bookmarkEnd tags; thus, using the example above
again,
     * this method would return the String 'text' to the user. It is
possible
     * however for a bookmark to contain more than one run and for a
bookmark to
     * contain other bookmarks. In both of these cases, this code will
return
     * the text contained within any and all runs that appear in the XML
markup
     * between matching bookmarkStart and bookmarkEnd tags. The term
'matching
     * bookmarkStart and bookmarkEndtags' here means tags whose id
attributes
     * have matching value.
     *
     * @param bookmark An instance of the CTBookmark class that encapsulates
     * information about a bookmark in a Word document.
     * @return An instance of the StringBuilder class encapsulating the text
     * recovered from any character run elements found between the
bookmark's
     * start and end tags. If no text is found then a null value will be
     * returned.
     * @throws XmlException Thrown if a problem is encountered parsing the
XML
     * markup recovered from the document in order to construct a CTText
     * instance which is required to obtain the bookmarks text.
     * @throws IOException An OutputStream is used to read the contents of
the
     * CTText object and an IOException will be thrown if any problems are
     * encountered.
     */
    private StringBuilder getTextFromBookmark(CTBookmark bookmark)
            throws IOException, XmlException {
        int startBookmarkID = 0;
        int endBookmarkID = -1;
        Node nextNode = null;
        Node childNode = null;
        CTText text = null;
        ByteArrayOutputStream baos = null;
        StringBuilder builder = null;
        String rawXML = null;

        // Get the ID of the bookmark from it's start tag, the DOM node from
the
        // bookmark (to make looping easier) and initialise the
StringBuilder.
        startBookmarkID = bookmark.getId().intValue();
        nextNode = bookmark.getDomNode();
        builder = new StringBuilder();

        // Loop through the nodes held between the bookmark's start and end 
        // tags.
        while (startBookmarkID != endBookmarkID) {

            // Get the next node and, if it is a bookmarkEnd tag, get it's
ID
            // as matching ids will terminate the while loop..
            nextNode = nextNode.getNextSibling();
            if (nextNode.getNodeName().contains("bookmarkEnd")) {

                // Get the ID attribute from the node. It is a String that
must
                // be converted into an int. An exception could be thrown
and so
                // the catch clause will ensure the loop ends neatly even if
the
                // value might be incorrect. Must inform the user.
                try {
                    endBookmarkID = Integer.parseInt(
                            nextNode.getAttributes().
                            getNamedItem("w:id").getNodeValue());
                } catch (NumberFormatException nfe) {
                    endBookmarkID = startBookmarkID;
                }
            } else {
                // This is not a bookmarkEnd node and can processed it for
any
                // text it may contain. Note the check for both type - it
must 
                // be a run - and contain children. Interestingly, it seems
as 
                // though the node may contain children and yet the call to 
                // nextNode.getChildNodes() will still return an empty list, 
                // hence the need to step through the child nodes.
                if (nextNode.getNodeName().equals("w:r")
                        && nextNode.hasChildNodes()) {

                    // Get the text from the child nodes.
                    builder.append(this.getTextFromChildNodes(nextNode));
                }
            }
        }
        return (builder);
    }

    /**
     * Iterates through all and any children of the Node whose reference
will be
     * passed as an argument to the node parameter, and recover the contents
of
     * any text nodes. Testing revealed that a node can be called a text
node
     * and yet report it's type as being something different, an element
node
     * for example. Calling the getNodeValue() method on a text node will
return
     * the text the node encapsulates but doing the same on an element node
will
     * not. In fact, the call will simply return a null value. As a result,
this
     * method will test the nodes name to catch all text nodes - those whose
     * name is to 'w:t' and then it's type. If the type is reported to be a
text
     * node, it is a trivial task to get at it's contents. However, if the
type
     * is not reported as a text type, then it is necessary to parse the raw
XML
     * markup for the node to recover it's value.
     *
     * @param node An instance of the Node class that encapsulates a
reference
     * to a node recovered from the document being processed. It should be
     * passed a reference to a character run - 'w:r' - node.
     * @return An instance of the String class that encapsulates the text
     * recovered from the nodes children, if they are text nodes.
     * @throws IOException Thrown if a problem occurs in the underlying file
     * system and only necessary as a stream may be used to recover the raw
XML
     * markup for a child node.
     * @throws XmlException Thrown if a problem is encountered parsing a
nodes
     * raw XML markup in order to construct a openxml4j CTText object.
     */
    private String getTextFromChildNodes(Node node) throws IOException,
            XmlException {
        NodeList childNodes = null;
        Node childNode = null;
        CTText text = null;
        StringBuilder builder = new StringBuilder();
        int numChildNodes = 0;

        // Get a list of chid nodes from the node passed to the method and
        // find out how many children there are in the list.
        childNodes = node.getChildNodes();
        numChildNodes = childNodes.getLength();

        // Iterate through the children one at a time - it is possible for a
        // run to ciontain zero, one or more text nodes - and recover the
text
        // from an text type child nodes.
        for (int i = 0; i < numChildNodes; i++) {

            // Get a node and check it's name. If this is 'w:t' then process
as
            // text type node.
            childNode = childNodes.item(i);

            if (childNode.getNodeName().equals("w:t")) {

                // If the node reports it's type as txet, then simply call
the
                // getNodeValue() method to get at it's text.
                if (childNode.getNodeType() == Node.TEXT_NODE) {
                    builder.append(childNode.getNodeValue());
                } else {
                    // Correct the type by parsing the node's XML markup and
                    // creating a CTText object. Call the getStringValue()
                    // method on that to get the text.
                    text = CTText.Factory.parse(childNode);
                    builder.append(text.getStringValue());
                }
            }
        }
        return (builder.toString());
    }

    public static void main(String[] args) {
        try {
            DOCXTest docxTest = new DOCXTest();
            docxTest.openFile("C:/temp/Doc2.docx");
            System.out.println(docxTest.getBookmarkText("InStyledText"));
            System.out.println(docxTest.getBookmarkText("AnotherOuter"));
            System.out.println(docxTest.getBookmarkText("MultiStyles"));
        } catch (Exception ex) {
            System.out.println("Caught a: " + ex.getClass().getName());
            System.out.println("Message: " + ex.getMessage());
            System.out.println("Stacktrace follows:.....");
            ex.printStackTrace(System.out);
        }
    }
}


--
View this message in context: 
http://apache-poi.1045710.n5.nabble.com/How-to-read-the-value-of-bookmarks-docx-tp5710184p5710237.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to read the value of bookmarks? (docx)

Reply via email to