Re: Issues extracting contents of .doc and .txt files after upgrading to Tika 1.11

Carlos A Fri, 05 Feb 2016 16:29:58 -0800

So SORRY about this! I've ended up commenting out the parsers from my maven
pom. Issue FIXED!


On Fri, Feb 5, 2016 at 5:24 PM, Carlos A <[email protected]> wrote:

> Hello all,
>
> This was not an issue before but now it is.
>
> I had tried to check the manual and online to see what has changed so I
> can update my code but no success, hence decided to email the users list
> with detail walk through of my code and the debugger.
>
> Basically I was doing the following quite successfully until 1.11:
>
> 1) First I read a file into bytes:
>
> String originalFilename = "/MyBio.doc";
>
> InputStream stream = this.getClass().getResourceAsStream(originalFilename);
> byte[] bytes;
> try {
>   bytes = IOUtils.toByteArray(stream);
> } catch (Exception e) {
> e.printStackTrace();
> }
>
> So far, so good as bytes are now filled.
>
> Then, used to work fine but not anymore.
>
>
> ByteArrayInputStream is = new ByteArrayInputStream(bytes);
> Metadata metadata = new Metadata();
> if (originalFilename.length() > 0) {
> metadata.set(Metadata.RESOURCE_NAME_KEY, originalFilename);
> }
> Parser parser = new AutoDetectParser(); // Should auto-detect!
> StringWriter textBuffer = new StringWriter();
> BodyContentHandler handler = new BodyContentHandler(textBuffer);
> ParseContext context = new ParseContext();
> parser.parse(is, handler, metadata, context);
> // How I did originally get the output
> System.out.println(textBuffer.toString());
> // Tried this doesn't work
> System.out.println(handler.toString());
>
> On the debugger all is fine. Metadata object is properly created.
>
> I have a BodyContentHandler initialized with an empyt textBuffer.
>
> It is passed to ther parser with the ByteArrayInputStream is (which is
> full), the handler, the metadate and the ParseContenxt.
>
> Looking inside the method parser.parse, I can see that the variables are
> correctly populated.
>
> The mediaType is properly identified as application/msword
>
> MetaData object as resourceName=/MyBio.doc Content-Type=application/msword
>
> The Stream object has the full buffer as passed on the call.
>
> From AutoDetectParser.parse() method:
>
> The TikaInputStream object has the stream as passed.
>
> The MediaType object is correctly : application/msword
>
>
>
> The SecureContentHandler is properly created at the line:
>
> // TIKA-216: Zip bomb prevention
>             SecureContentHandler sch =
>                 handler != null ? new SecureContentHandler(handler, tis) :
> null;
>
>
> From the CompositeParser instance on the parse() method I have:
>
> TikaInputStream taggedStream corrected populated with the stream contents.
>
> TaggedContentHandler taggedHandler gets the BodyContentHandler object
> passed and it is not null.
>
> However on the call:
>
> if (parser instanceof ParserDecorator){
>                 metadata.add("X-Parsed-By", ((ParserDecorator)
> parser).getWrappedParser().getClass().getName());
>             } else {
>                 metadata.add("X-Parsed-By", parser.getClass().getName());
>             }
>
> It goes to the else and puts the EmptyParser so now the Metada object
> reads:
>
> So value is now X-Parsed-By=org.apache.tika.parser.EmptyParser
> resourceName=/MyBio.doc Content-Type=application/msword
>
> No exceptions
>
> When the original call above parser.parse(is, handler, metadata, context);
> returns, the handler.toString() is empty as well as the
> textBuffer.toString(). It used to work really well before Tika 1.11
>
> I wonder if I need to do something so that the EmptyParser is not used as
> it was working before.
>
> Thank you,
>
> C.
>
>

Re: Issues extracting contents of .doc and .txt files after upgrading to Tika 1.11

Reply via email to