To clarify, I would like to manipulate text at the word level in an arbitrary
docx file and preserve formatting/styling. The resulting docx file should still
be editable, essentially preserved apart from the word manipulations.
In terms of word manipulations, I have constructed several algorithms, for
example an algorithm that restores capitalization to words which may not be
present in the docx file. These algorithms depend on looking at neighboring
words for each focus word, usually a window of 1-2 words to the left and right.
These algorithms can be applied to all words so searching for a particular word
and then finding it's context would not work in this scenario. What is required
is that every single word in the document is inspected and its neighboring
context (left and right words) determined. To determine the left word(s) for
the first word in a paragraph, it is OK to use the last word(s) of the previous
paragraph. Therefore the entire text document can be treated by the algorithm
as one continuous text.
I have a method to split text into tokenized units such as words and
punctuation, but for simplicity we can just assume that the input is tokenized
by whitespace.
Thanks