cross posting to Tika list for help there too. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: yeshwanth kumar <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, June 11, 2014 3:48 AM To: "[email protected]" <[email protected]> Subject: tika parser is not parsing the BytesWritable in mapreduce >i am writing a mapreduce job, >where it takes a zip file as input, zip file contains different types of >documents such as docx odt pdf txt, > i am using tika parser to parse the documents. >here's the code snippet of my mapper method > >public void map(Text key, BytesWritable value, Context context)throws >IOException, InterruptedException { > > ------------------------------ > ------------------------------ > >logger.info <http://logger.info>("Length:\t" + value.getLength()); > byte[] bytesbefore = value.getBytes(); > >logger.info <http://logger.info>("CONTENT BEFORE" + new >String(bytesbefore)); > InputStream in = new ByteArrayInputStream(bytesbefore); > Metadata metadata = new Metadata(); > String mimeType = new Tika().detect(in); > metadata.set(Metadata.CONTENT_TYPE, mimeType); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( > value.getLength()); > try { > parser.parse(in, handler, metadata, new ParseContext()); > } catch (SAXException e1) { > >logger.info <http://logger.info>(e1.getMessage()); > e1.printStackTrace(); > } catch (TikaException e1) { > >logger.info <http://logger.info>(e1.getMessage()); > e1.printStackTrace(); > } > in.close(); > >logger.info <http://logger.info>("Content AFTER" + handler.toString()); > ------------------------------ > } >output is written to hbase, content > of the document is empty after parsing , >am i missing anything here?? >
