I don't know what the issue here but the Tika module in Behemoth is a good example of how to use Tika over MapReduce https://github.com/DigitalPebble/behemoth/tree/master/tika
J. On 11 June 2014 11:59, Mattmann, Chris A (3980) < [email protected]> wrote: > cross posting to Tika list for help there too. > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: yeshwanth kumar <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, June 11, 2014 3:48 AM > To: "[email protected]" <[email protected]> > Subject: tika parser is not parsing the BytesWritable in mapreduce > > >i am writing a mapreduce job, > >where it takes a zip file as input, zip file contains different types of > >documents such as docx odt pdf txt, > > i am using tika parser to parse the documents. > >here's the code snippet of my mapper method > > > >public void map(Text key, BytesWritable value, Context context)throws > >IOException, InterruptedException { > > > > ------------------------------ > > ------------------------------ > > > >logger.info <http://logger.info>("Length:\t" + value.getLength()); > > byte[] bytesbefore = value.getBytes(); > > > >logger.info <http://logger.info>("CONTENT BEFORE" + new > >String(bytesbefore)); > > InputStream in = new ByteArrayInputStream(bytesbefore); > > Metadata metadata = new Metadata(); > > String mimeType = new Tika().detect(in); > > metadata.set(Metadata.CONTENT_TYPE, mimeType); > > Parser parser = new AutoDetectParser(); > > ContentHandler handler = new BodyContentHandler( > > value.getLength()); > > try { > > parser.parse(in, handler, metadata, new ParseContext()); > > } catch (SAXException e1) { > > > >logger.info <http://logger.info>(e1.getMessage()); > > e1.printStackTrace(); > > } catch (TikaException e1) { > > > >logger.info <http://logger.info>(e1.getMessage()); > > e1.printStackTrace(); > > } > > in.close(); > > > >logger.info <http://logger.info>("Content AFTER" + handler.toString()); > > ------------------------------ > > } > >output is written to hbase, content > > of the document is empty after parsing , > >am i missing anything here?? > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
