tamski wrote:
Hi.
When I'm trying to extract pure text from doc-file with
org.apache.poi.hwpf.extractor.WordExtractor, I get text with rubbish like
\* ARABIC 3
PAGE
Hi,
these are fields. A quick solution is this: Pass the extracted text
string through a filter which removes the field codes. Fields are
delimited by 0x13 (start), 0x14 (separator) and 0x15 (end) bytes. With
fields which don't have a separator (0x14), remove all from 0x13 to
0x15. If a separator exists between start and end, remove from 0x13 to
0x14 and then remove the 0x15 (keep text between 0x14 and 0x15).
However, beware that fields can be nested, so you can well encounter
sequences like 0x13 ... 0x13 ... 0x15 ... 0x15 and much more complicated
stuff.
Best wishes, Rainer
PAGE 7
and other unreadable characters.
Is it possible to restrict it while extracting or by using some additional
POI tools?
Thanks in advance.
--
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]