tamski wrote:
Hi.
When I'm trying to extract pure text from doc-file with
org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like


\* ARABIC 3
PAGE  

Hi,

these are fields. A quick solution is this: Pass the extracted text string through a filter which removes the field codes. Fields are delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With fields which don't have a separator (0x14), remove all from 0x13 to 0x15. If a separator exists between start and end, remove from 0x13 to 0x14 and then remove the 0x15 (keep text between 0x14 and 0x15). However, beware that fields can be nested, so you can well encounter sequences like 0x13 ... 0x13 ... 0x15 ... 0x15 and much more complicated stuff.

Best wishes, Rainer



PAGE  7

and other unreadable characters.

Is it possible to restrict it while extracting or by using some additional
POI tools?

Thanks in advance.


--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to