RE: Stack Overflow Question

Allison, Timothy B. Tue, 01 Jul 2014 07:47:01 -0700

Good to hear.  Let us know if you have any other questions or when you run into 
surprises.

From: yeshwanth kumar [mailto:[email protected]]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i forgot to change the BodyContentHandler to ToXMLContentHandler in 
RecursiveMetada, i changed it only in my
calling method,

now i am getting the entire document as the structure u specified.

thanks a ton.

-yeshwanth

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
Hmmm….

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see 
this:

<div class="embedded" id="embed4.zip" />
<div class="package-entry"><h1>embed4.zip</h1>
<div class="embedded" id="embed4.txt" />
<div class="package-entry"><h1>embed4.txt</h1>
<p>embed_4</p>
</div>
</div>
</div>
</div>

That’s a text file inside of a zip file that is itself embedded.  I could see 
doing some parsing on the XML to scrape out <div class=”package-entry”> 
contents and grab the file name from the <h1> element.

If I committed TIKA-1329, would that be of any use to you?   That returns a 
list of metadata objects.  There is one metadata object per embedded file.  The 
text content of each file can be retrieved from each metadata object by this 
key: “tika:content.”

Best,

        Tim
From: yeshwanth kumar 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, July 01, 2014 9:00 AM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

output is same even with ToXMLHandler

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
Did you try the ToXMLHandler?

From: yeshwanth kumar 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, June 30, 2014 4:50 PM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a>

org.apache.tika.exception.TikaException: Error creating OOXML extractor

any suggestions regarding these issues,

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
<[email protected]<mailto:[email protected]>> wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: [email protected]<mailto:[email protected]>
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:[email protected]]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.

QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

        parser.parse(stream, handler, metadata, context);

        logger.info<http://logger.info>("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367>

i am missing something, unable to figure it out, looking for some help

-----Original Message-----

From: yeshwanth kumar 
[mailto:[email protected]<mailto:[email protected]>]

Sent: Monday, June 30, 2014 1:28 PM

To: [email protected]<mailto:[email protected]>

Subject: Stack Overflow Question

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

RE: Stack Overflow Question

Reply via email to