Re: Tab symbols parsing in WORD document issue: org.apache.poi.hwpf.extractor.WordExtractor

Andrei Khveras Tue, 10 Jan 2012 08:18:14 -0800

Hi. Thank you for prompt reply. Unfortunately, getParagraph() function
fetches not paragraphs, but single lines, that ends with
newline<http://en.wikipedia.org/wiki/Newline>symbols. I consider a
paragraph as any text with preceding
newline <http://en.wikipedia.org/wiki/Newline> or
pagebreak<http://en.wikipedia.org/wiki/Page_break>and
TAB <http://en.wikipedia.org/wiki/Tab_key> symbols. Anyhow, allthought I'm
not sure I can read complicated professional code, I will try to analyse it
and to do my best of understanding it and finding the ways of resolving the
issue above.


On Tue, Jan 10, 2012 at 6:56 PM, Nick Burch <[email protected]> wrote:

> On Tue, 10 Jan 2012, Andrei Khveras wrote:
>
>> I'm trying to use the class org.apache.poi.hwpf.extractor.**WordExtractor,
>> what I downloaded as a part of Apache POI <http://poi.apache.org/**
>> download.html <http://poi.apache.org/download.html>>.
>>
>> *Could somebody, please*, kindly help me to resolve this little issue. My
>> goal is to get MS Word file contents as one single String, containing all
>> control characters. I need it for further (hand-made!) splitting text into
>> paragraphs, words, etc.
>>
>
> Why not fetch the paragraphs directly then? That'd give you full control
> over which bit of text is in which paragraph, and will let you decide if
> you want to display or hide control characters etc
>
> I'd suggest you look at the code for WordExtractor to get an idea of how
> to go about doing it, then do your own version that implements your
> required logic
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> [email protected].**org<[email protected]>
> For additional commands, e-mail: [email protected]
>
>


-- 
*С уважением*
*Андрей
*

* 229-507-907 <http://wwp.icq.com/scripts/contact.dll?msgto=229507907>*
*Skype: tenety

BOOKRIVER.RU <http://bookriver.ru>

*

Re: Tab symbols parsing in WORD document issue: org.apache.poi.hwpf.extractor.WordExtractor

Reply via email to