Hi Rupak, You're parsing XML? That's an important bit of information.
In that case, you don't want to be using Tika - just use Dom4J or any one of the other many XML parsers. Tika is designed to extract "text content" from a variety of input sources. Its parse output is designed to be XHTML 1.0-compatible, which means it's not what you want to be using for precise extraction of XML data. -- Ken On Feb 11, 2014, at 12:13pm, Rupak Khurana <[email protected]> wrote: > Hello, > > I have a small XML document that I want to parse using Tika and expect to get > SAX events for each element in the input XML file. However I get the output > only for html, head, meta, body & p. I dont get the events for each element > in the XML file. See the Code for ContentHandler further below. Please > advise.. > > **** Output ***** > > StartDocument > StartElement html > StartElement head > StartElement meta > EndElement meta > StartElement title > EndElement title > EndElement head > StartElement body > StartElement p > EndElement p > EndElement body > EndElement html > EndDocument > > > ***** sample.xml ****** > > <transformation> > <info> > <name>sample_normalize</name> > <description/> > <parameters> > <parameter> > <name>AS_OF_DATE</name> > <default_value>2012-06-01</default_value> > <description/> > </parameter> > </parameters> > </info> > </transformation> > > > ***************** XYZContentHandler **************** > > public class XYZContentHandler extends DefaultHandler { > > public XYZContentHandler() { > } > > @Override > public void startElement(String uri, String localName, String qName, > Attributes attributes) > throws SAXException { > System.out.println("StartElement "+qName); > } > > @Override > public void endElement(String uri, String local, String name) throws > SAXException { > System.out.println("EndElement "+name); > } > > @Override > public void startDocument() throws SAXException { > System.out.println("StartDocument"); > } > > @Override > public void endDocument() throws SAXException { > System.out.println("EndDocument"); > } > } > > > ****** Actual Code ******* > > stream = new FileInputStream(new File(filename)); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "application/xml"); > > XYZContentHandler handler = new XYZContentHandler(); > ParseContext context = new ParseContext(); > > //Parser parser = new AutoDetectParser(); > Parser parser = new XMLParser(); > parser.parse(stream, handler, metadata, context); > > > > > > > > > On Mon, Feb 10, 2014 at 3:30 PM, Nick Burch <[email protected]> wrote: > On Mon, 10 Feb 2014, Rupak Khurana wrote: > I am trying to parse out JIL(Job Information Language) scripts that happen > to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use its > parsing ability and SAX event firing to make life easier. > > Sounds like you'll want to define / identify a suitable mimetype for these, > add some mime magic so they get detected, then write your own parser that > spots these name/value pairs and emmits suitable sax events for you to consume > > See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to do > all of that > > Nick > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
