Couldn't you just run an XSLT against the document.xml file and convert it to text? Then you would simply run the converted text through your existing code. Or am I missing something?
Regards, Mark F On Tue, Mar 8, 2011 at 6:04 AM, John C <[email protected]> wrote: > > I would like to manipulate text at the word level in a docx file based upon > neighboring words (1 word to the left, 1 word to the right). > With a txt file this process is very simple. Now I would like to do the > same with docx files and then later doc files. > I spent quite a bit of time searching for an example to solve this problem > that I could reproduce however to no avail. Therefore I thought of a > possible hack to achieve this and would like some feedback. > Assuming each word has a consistent styling...1. Change the file extension > from .docx to .zip2. Unzip the file3. Open the word folder and open the > document.xml file (I assume this is where all the content is?)4. Ignore the > content contained in "<...>" and concatenate each remaining fragment making > sure to separate fragments with a space. 5. Split the single concatenated > string into words, then map each word to it's desired form.6. Go back > through the original document.xml and change each word to it's mapped > value.7. Change the file extension from .zip to .docx8. Finished > To reiterate, I am yet to try this approach as it's merely an idea. > Hopefully someone can set me in the right direction. It's also note > important in this case to separate titles from paragraphs if that makes > things easier. > Thanks
