Jackrabbit text extractors return Readers from their extractText methods.
In the case of PowerPoint files, I am finding that on Linux alone, I get the
following exception stack trace when I attempt to read anything from the Reader
returns from the MsPowerPointTextExtractor.extractText method:
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
at
sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:345)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:199)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:185)
at java.io.InputStreamReader.read(InputStreamReader.java:196)
Of course I have no control over what encoding any PowerPoint documents happen
to be in (nor can I determine the encoding without using some sort of parser to
read the file). I also know of no way to tell an InputStreamReader what
encoding to convert into. It simply appears that whatever the default encoding
of the operating system is (in this case, UTF8) will be used.
As of now, I have no way to reliably use the Jackrabbit
MsPowerPointTextExtractor on Linux at all -- it works fine for me on Windows.
Any suggestions?