Re: tika parser is not parsing the BytesWritable in mapreduce

Julien Nioche Wed, 11 Jun 2014 04:31:31 -0700

I don't know what the issue here but the Tika module in Behemoth is a good
example of how to use Tika over MapReduce
https://github.com/DigitalPebble/behemoth/tree/master/tika


J.


On 11 June 2014 11:59, Mattmann, Chris A (3980) <
[email protected]> wrote:

> cross posting to Tika list for help there too.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: yeshwanth kumar <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, June 11, 2014 3:48 AM
> To: "[email protected]" <[email protected]>
> Subject: tika parser is not parsing the BytesWritable in mapreduce
>
> >i am writing a mapreduce job,
> >where it takes a zip file as input, zip file contains different types of
> >documents such as docx odt pdf txt,
> > i am using tika parser to parse the documents.
> >here's the code snippet of my mapper method
> >
> >public void map(Text key, BytesWritable value, Context context)throws
> >IOException, InterruptedException {
> >
> >    ------------------------------
> >    ------------------------------
> >
> >logger.info <http://logger.info>("Length:\t" + value.getLength());
> >        byte[] bytesbefore = value.getBytes();
> >
> >logger.info <http://logger.info>("CONTENT BEFORE" + new
> >String(bytesbefore));
> >        InputStream in = new ByteArrayInputStream(bytesbefore);
> >        Metadata metadata = new Metadata();
> >        String mimeType = new Tika().detect(in);
> >        metadata.set(Metadata.CONTENT_TYPE, mimeType);
> >        Parser parser = new AutoDetectParser();
> >        ContentHandler handler = new BodyContentHandler(
> >                value.getLength());
> >        try {
> >            parser.parse(in, handler, metadata, new ParseContext());
> >        } catch (SAXException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> >            e1.printStackTrace();
> >        } catch (TikaException e1) {
> >
> >logger.info <http://logger.info>(e1.getMessage());
> >            e1.printStackTrace();
> >        }
> >        in.close();
> >
> >logger.info <http://logger.info>("Content AFTER" + handler.toString());
> >    ------------------------------
> >                   }
> >output is written to hbase, content
> > of the document is empty after parsing ,
> >am i missing anything here??
> >
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: tika parser is not parsing the BytesWritable in mapreduce

Reply via email to