Hi. Thank you for prompt reply. Unfortunately, getParagraph() function fetches not paragraphs, but single lines, that ends with newline<http://en.wikipedia.org/wiki/Newline>symbols. I consider a paragraph as any text with preceding newline <http://en.wikipedia.org/wiki/Newline> or pagebreak<http://en.wikipedia.org/wiki/Page_break>symbols. Anyhow, allthought I'm not sure I can read complicated professional code, I will try to analyse it and to do my best of understanding it and finding the ways of resolving the issue above.
On Tue, Jan 10, 2012 at 6:56 PM, Nick Burch <[email protected]> wrote: > On Tue, 10 Jan 2012, Andrei Khveras wrote: > >> I'm trying to use the class org.apache.poi.hwpf.extractor.**WordExtractor, >> what I downloaded as a part of Apache POI <http://poi.apache.org/** >> download.html <http://poi.apache.org/download.html>>. >> >> *Could somebody, please*, kindly help me to resolve this little issue. My >> goal is to get MS Word file contents as one single String, containing all >> control characters. I need it for further (hand-made!) splitting text into >> paragraphs, words, etc. >> > > Why not fetch the paragraphs directly then? That'd give you full control > over which bit of text is in which paragraph, and will let you decide if > you want to display or hide control characters etc > > I'd suggest you look at the code for WordExtractor to get an idea of how > to go about doing it, then do your own version that implements your > required logic > > Nick > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: > [email protected].**org<[email protected]> > For additional commands, e-mail: [email protected] > > -- *С уважением* *Андрей * * 229-507-907 <http://wwp.icq.com/scripts/contact.dll?msgto=229507907>* *Skype: tenety BOOKRIVER.RU <http://bookriver.ru> *
