Good to hear. Let us know if you have any other questions or when you run into surprises.
From: yeshwanth kumar [mailto:[email protected]] Sent: Tuesday, July 01, 2014 10:23 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i forgot to change the BodyContentHandler to ToXMLContentHandler in RecursiveMetada, i changed it only in my calling method, now i am getting the entire document as the structure u specified. thanks a ton. -yeshwanth On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. <[email protected]<mailto:[email protected]>> wrote: Hmmm…. When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this: <div class="embedded" id="embed4.zip" /> <div class="package-entry"><h1>embed4.zip</h1> <div class="embedded" id="embed4.txt" /> <div class="package-entry"><h1>embed4.txt</h1> <p>embed_4</p> </div> </div> </div> </div> That’s a text file inside of a zip file that is itself embedded. I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element. If I committed TIKA-1329, would that be of any use to you? That returns a list of metadata objects. There is one metadata object per embedded file. The text content of each file can be retrieved from each metadata object by this key: “tika:content.” Best, Tim From: yeshwanth kumar [mailto:[email protected]<mailto:[email protected]>] Sent: Tuesday, July 01, 2014 9:00 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question output is same even with ToXMLHandler On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <[email protected]<mailto:[email protected]>> wrote: Did you try the ToXMLHandler? From: yeshwanth kumar [mailto:[email protected]<mailto:[email protected]>] Sent: Monday, June 30, 2014 4:50 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i tried in all possible ways, instead of reading entire zip file i parsed individual zipentries, but even then i faced exceptions such as org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33> Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document org.apache.tika.exception.TikaException: Unable to unpack document stream org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a> org.apache.tika.exception.TikaException: Error creating OOXML extractor any suggestions regarding these issues, thanks, yeshwanth On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <[email protected]<mailto:[email protected]>> wrote: hi tim, thanks, for sharing the resources but i am unable to figure out how to implement it in my code, what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser it still giving the same kind of output as filenames combined with content of the files, i am totally confused. On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <[email protected]<mailto:[email protected]>> wrote: Or use the ToXMLHandler and parse the XML? From: Allison, Timothy B. [mailto:[email protected]<mailto:[email protected]>] Sent: Monday, June 30, 2014 3:55 PM To: yeshwanth kumar Cc: [email protected]<mailto:[email protected]> Subject: RE: Stack Overflow Question Might want to look into RecursiveMetadata Parser http://wiki.apache.org/tika/RecursiveMetadata Or https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true From: yeshwanth kumar [mailto:[email protected]] Sent: Monday, June 30, 2014 3:24 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, thanks for quick reply, i changed the contenthandler to bodyContentHandler i got exception for maximum word limit, i used -1 in the bodycontenthandler constructor, now its another problem, filenames and content are present in string returned from handler.tostring() how can i map a fileName to its content. thanks, yeshwanth On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <[email protected]<mailto:[email protected]>> wrote: DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler. If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler. QUOTE: 0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2> i am using Apache Tika 1.5 for parsing the contents present in a zip file, here's my sample code Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); ContentHandler handler = new DefaultHandler(); Metadata metadata = new Metadata(); InputStream stream = null; try { stream = TikaInputStream.get(new File(zipFilePath)); } catch (FileNotFoundException e) { e.printStackTrace(); } try { parser.parse(stream, handler, metadata, context); logger.info<http://logger.info>("Content:\t" + handler.toString()); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } finally { try { stream.close(); } catch (IOException e) { e.printStackTrace(); } } in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367> i am missing something, unable to figure it out, looking for some help -----Original Message----- From: yeshwanth kumar [mailto:[email protected]<mailto:[email protected]>] Sent: Monday, June 30, 2014 1:28 PM To: [email protected]<mailto:[email protected]> Subject: Stack Overflow Question Unable tp read zipfile using Apache Tika http://stackoverflow.com/q/24495504/1899893?sem=2
