RE: Questions about manipulating text or words in a docx file.

John C Tue, 08 Mar 2011 17:19:23 -0800

Thanks Dave. I am quite comfortable implementing the approach you described, in 
fact I had a little play around with it last night. I was just hoping there was 
a more efficient way of doing it.


> Subject: Re: Questions about manipulating text or words in a docx file.
> From: [email protected]
> Date: Tue, 8 Mar 2011 17:03:54 -0800
> To: [email protected]
> 
> > To clarify, I would like to manipulate text at the word level in an 
> > arbitrary docx file and preserve formatting/styling. The resulting docx 
> > file should still be editable, essentially preserved apart from the word 
> > manipulations.
> > In terms of word manipulations, I have constructed several algorithms, for 
> > example an algorithm that restores capitalization to words which may not be 
> > present in the docx file. These algorithms depend on looking at neighboring 
> > words for each focus word, usually a window of 1-2 words to the left and 
> > right. These algorithms can be applied to all words so searching for a 
> > particular word and then finding it's context would not work in this 
> > scenario. What is required is that every single word in the document is 
> > inspected and its neighboring context (left and right words) determined. To 
> > determine the left word(s) for the first word in a paragraph, it is OK to 
> > use the last word(s) of the previous paragraph. Therefore the entire text 
> > document can be treated by the algorithm as one continuous text.
> > I have a method to split text into tokenized units such as words and 
> > punctuation, but for simplicity we can just assume that the input is 
> > tokenized by whitespace.
> > Thanks                                        
> 
>  
> I think you need to take a two step approach.
> 
> (1) You need an un-styled run of text to do your analysis. There is a project 
> which grew out of Apache Lucene called Apache Tika. Apache Tika is all about 
> getting text out of any document type. Tika depends on POI for Office 
> Documents.
> 
> See http://poi.apache.org/text-extraction.html
> 
> Text extraction doesn't care about formatting It should give you the text 
> view you need to do your analysis.
> 
> (2) If you then grab the document.xml part of your docx you can then simply 
> find and modify the pieces of content. As long as you are preserving style 
> and just replacing characters you should be able to do it.
> 
> Others here should be able to help with the details, I only have the time and 
> knowledge to suggest an approach.
> 
> Regards,
> Dave
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

RE: Questions about manipulating text or words in a docx file.

Reply via email to