Re: Rubbish in extracted text

Rainer Schwarze Thu, 15 May 2008 15:30:48 -0700

tamski wrote:

Hi.
When I'm trying to extract pure text from doc-file with
org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like


\* ARABIC 3
PAGE

Hi,

these are fields. A quick solution is this: Pass the extracted textstring through a filter which removes the field codes. Fields aredelimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. Withfields which don't have a separator (0x14), remove all from 0x13 to0x15. If a separator exists between start and end, remove from 0x13 to0x14 and then remove the 0x15 (keep text between 0x14 and 0x15).However, beware that fields can be nested, so you can well encountersequences like 0x13 ... 0x13 ... 0x15 ... 0x15 and much more complicatedstuff.


Best wishes, Rainer

PAGE  7

and other unreadable characters.

Is it possible to restrict it while extracting or by using some additional
POI tools?

Thanks in advance.



--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Rubbish in extracted text

Reply via email to