I would like to manipulate text at the word level in a docx file based upon
neighboring words (1 word to the left, 1 word to the right).
With a txt file this process is very simple. Now I would like to do the same
with docx files and then later doc files.
I spent quite a bit of time searching for an example to solve this problem that
I could reproduce however to no avail. Therefore I thought of a possible hack
to achieve this and would like some feedback.
Assuming each word has a consistent styling...1. Change the file extension from
.docx to .zip2. Unzip the file3. Open the word folder and open the document.xml
file (I assume this is where all the content is?)4. Ignore the content
contained in "<...>" and concatenate each remaining fragment making sure to
separate fragments with a space. 5. Split the single concatenated string into
words, then map each word to it's desired form.6. Go back through the original
document.xml and change each word to it's mapped value.7. Change the file
extension from .zip to .docx8. Finished
To reiterate, I am yet to try this approach as it's merely an idea. Hopefully
someone can set me in the right direction. It's also note important in this
case to separate titles from paragraphs if that makes things easier.
Thanks