Hello!

I am trying to use Tika within Apache Sling as a way to parse through
documents uploaded into file nodes. Ideally, I want to be able to take an
uploaded file within Sling's file node structure, and parse it for text
contained within a document, which will be returned as a string. Currently,
I have my program set up in the following way:

1. Construct an InputStream from binary document data returned using
sling's node methods getBinary().getStream().
2. Use Tika.detect() to determine the mimeType of the input stream. Then,
set a constructed metadata object's content type to that mimetype.
3. Construct a new Reader of type ParsingReader, passing into it a newly
constructed AutoDetectParser, the Input Stream, the metadata object, and a
new ParseContext.
4. Read the character stream from the reader into a buffered character
array, which is eventually converted into a string. It is within a while
loop, which stops reading from the new Reader when the read function
returns -1.
5. Close the Reader.

The Problem:
As soon as I construct the reader, its first return from the read()
function is -1, suggesting that it has already reached the end of the
stream. The ready() function also always returns false, if called. I have
taken a look at the setup for ParsingReader on tika's website, as well as a
couple of other forums showing how to use it, and haven't seen any reason
why this methodology shouldn't work. I have also determined, through some
debugging, that the input stream does contain data when it is passed into
the ParsingReader constructor.

Do you have any ideas as to why this might be occurring? I am using Tika
version 1.3. Any help at all would be greatly appreciated, and more
information can be provided upon request/if needed.

Thank you!
Matthew Taylor

-- 
Matthew Taylor
Software Consultant
Behavioral Media Networks - http://launch.bmedianet.com/
Email: [email protected]

Reply via email to