Jackrabbit text extractors return Readers from their extractText methods.

In the case of PowerPoint files, I am finding that on Linux alone, I get the 
following exception stack trace when I attempt to read anything from the Reader 
returns from the MsPowerPointTextExtractor.extractText method:

sun.io.MalformedInputException
        at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
        at 
sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
        at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:345)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
        at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:199)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:185)
        at java.io.InputStreamReader.read(InputStreamReader.java:196)

Of course I have no control over what encoding any PowerPoint documents happen 
to be in (nor can I determine the encoding without using some sort of parser to 
read the file).  I also know of no way to tell an InputStreamReader what 
encoding to convert into.  It simply appears that whatever the default encoding 
of the operating system is (in this case, UTF8) will be used.

As of now, I have no way to reliably use the Jackrabbit 
MsPowerPointTextExtractor on Linux at all -- it works fine for me on Windows.  
Any suggestions?



      

Reply via email to