cross posting to Tika list for help there too.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: yeshwanth kumar <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, June 11, 2014 3:48 AM
To: "[email protected]" <[email protected]>
Subject: tika parser is not parsing the BytesWritable in mapreduce

>i am writing a mapreduce job,
>where it takes a zip file as input, zip file contains different types of
>documents such as docx odt pdf txt,
> i am using tika parser to parse the documents.
>here's the code snippet of my mapper method
>
>public void map(Text key, BytesWritable value, Context context)throws
>IOException, InterruptedException {
>
>    ------------------------------
>    ------------------------------
>       
>logger.info <http://logger.info>("Length:\t" + value.getLength());
>        byte[] bytesbefore = value.getBytes();
>       
>logger.info <http://logger.info>("CONTENT BEFORE" + new
>String(bytesbefore));
>        InputStream in = new ByteArrayInputStream(bytesbefore);
>        Metadata metadata = new Metadata();
>        String mimeType = new Tika().detect(in);
>        metadata.set(Metadata.CONTENT_TYPE, mimeType);
>        Parser parser = new AutoDetectParser();
>        ContentHandler handler = new BodyContentHandler(
>                value.getLength());
>        try {
>            parser.parse(in, handler, metadata, new ParseContext());
>        } catch (SAXException e1) {
>           
>logger.info <http://logger.info>(e1.getMessage());
>            e1.printStackTrace();
>        } catch (TikaException e1) {
>           
>logger.info <http://logger.info>(e1.getMessage());
>            e1.printStackTrace();
>        }
>        in.close();
>       
>logger.info <http://logger.info>("Content AFTER" + handler.toString());
>    ------------------------------
>                   }
>output is written to hbase, content
> of the document is empty after parsing ,
>am i missing anything here??
>

Reply via email to