So SORRY about this! I've ended up commenting out the parsers from my maven pom. Issue FIXED!
On Fri, Feb 5, 2016 at 5:24 PM, Carlos A <[email protected]> wrote: > Hello all, > > This was not an issue before but now it is. > > I had tried to check the manual and online to see what has changed so I > can update my code but no success, hence decided to email the users list > with detail walk through of my code and the debugger. > > Basically I was doing the following quite successfully until 1.11: > > 1) First I read a file into bytes: > > String originalFilename = "/MyBio.doc"; > > InputStream stream = this.getClass().getResourceAsStream(originalFilename); > byte[] bytes; > try { > bytes = IOUtils.toByteArray(stream); > } catch (Exception e) { > e.printStackTrace(); > } > > So far, so good as bytes are now filled. > > Then, used to work fine but not anymore. > > > ByteArrayInputStream is = new ByteArrayInputStream(bytes); > Metadata metadata = new Metadata(); > if (originalFilename.length() > 0) { > metadata.set(Metadata.RESOURCE_NAME_KEY, originalFilename); > } > Parser parser = new AutoDetectParser(); // Should auto-detect! > StringWriter textBuffer = new StringWriter(); > BodyContentHandler handler = new BodyContentHandler(textBuffer); > ParseContext context = new ParseContext(); > parser.parse(is, handler, metadata, context); > // How I did originally get the output > System.out.println(textBuffer.toString()); > // Tried this doesn't work > System.out.println(handler.toString()); > > On the debugger all is fine. Metadata object is properly created. > > I have a BodyContentHandler initialized with an empyt textBuffer. > > It is passed to ther parser with the ByteArrayInputStream is (which is > full), the handler, the metadate and the ParseContenxt. > > Looking inside the method parser.parse, I can see that the variables are > correctly populated. > > The mediaType is properly identified as application/msword > > MetaData object as resourceName=/MyBio.doc Content-Type=application/msword > > The Stream object has the full buffer as passed on the call. > > From AutoDetectParser.parse() method: > > The TikaInputStream object has the stream as passed. > > The MediaType object is correctly : application/msword > > > > The SecureContentHandler is properly created at the line: > > // TIKA-216: Zip bomb prevention > SecureContentHandler sch = > handler != null ? new SecureContentHandler(handler, tis) : > null; > > > From the CompositeParser instance on the parse() method I have: > > TikaInputStream taggedStream corrected populated with the stream contents. > > TaggedContentHandler taggedHandler gets the BodyContentHandler object > passed and it is not null. > > However on the call: > > if (parser instanceof ParserDecorator){ > metadata.add("X-Parsed-By", ((ParserDecorator) > parser).getWrappedParser().getClass().getName()); > } else { > metadata.add("X-Parsed-By", parser.getClass().getName()); > } > > It goes to the else and puts the EmptyParser so now the Metada object > reads: > > So value is now X-Parsed-By=org.apache.tika.parser.EmptyParser > resourceName=/MyBio.doc Content-Type=application/msword > > No exceptions > > When the original call above parser.parse(is, handler, metadata, context); > returns, the handler.toString() is empty as well as the > textBuffer.toString(). It used to work really well before Tika 1.11 > > I wonder if I need to do something so that the EmptyParser is not used as > it was working before. > > Thank you, > > C. > >
