Re: Parse out Name:Value pairs

Ken Krugler Tue, 11 Feb 2014 13:41:12 -0800

Hi Rupak,

You're parsing XML? That's an important bit of information.


In that case, you don't want to be using Tika - just use Dom4J or any one of 
the other many XML parsers.

Tika is designed to extract "text content" from a variety of input sources. Its 
parse output is designed to be XHTML 1.0-compatible, which means it's not what 
you want to be using for precise extraction of XML data.

-- Ken

On Feb 11, 2014, at 12:13pm, Rupak Khurana <[email protected]> wrote:

> Hello,
> 
> I have a small XML document that I want to parse using Tika and expect to get 
> SAX events for each element in the input XML file. However I get the output 
> only for html, head, meta, body & p.  I dont get the events for each element 
> in the XML file. See the Code for ContentHandler further below. Please 
> advise..
> 
> **** Output *****
> 
> StartDocument
> StartElement html
> StartElement head
> StartElement meta
> EndElement meta
> StartElement title
> EndElement title
> EndElement head
> StartElement body
> StartElement p
> EndElement p
> EndElement body
> EndElement html
> EndDocument
> 
> 
> ***** sample.xml ******
> 
> <transformation>
>   <info>
>     <name>sample_normalize</name>
>     <description/>
>     <parameters>
>        <parameter>
>             <name>AS_OF_DATE</name>
>             <default_value>2012-06-01</default_value>
>             <description/>
>         </parameter>
>     </parameters>
>   </info>
> </transformation>
> 
> 
> ***************** XYZContentHandler ****************
> 
> public class XYZContentHandler extends DefaultHandler {
> 
>     public XYZContentHandler() {
>     }
>     
>     @Override
>     public void startElement(String uri, String localName, String qName, 
> Attributes attributes)
>              throws SAXException {        
>         System.out.println("StartElement "+qName);
>     }
>     
>     @Override
>     public void endElement(String uri, String local, String name) throws 
> SAXException {        
>         System.out.println("EndElement "+name);
>     }
> 
>     @Override
>     public void startDocument() throws SAXException {
>         System.out.println("StartDocument");
>     }
> 
>     @Override
>     public void endDocument() throws SAXException {
>         System.out.println("EndDocument");
>     }
> }
> 
> 
> ****** Actual Code *******
> 
>            stream = new FileInputStream(new File(filename));
>            Metadata metadata = new Metadata();            
>            metadata.set(Metadata.CONTENT_TYPE, "application/xml");
> 
>             XYZContentHandler handler = new XYZContentHandler();
>             ParseContext context = new ParseContext();
> 
>             //Parser parser = new AutoDetectParser();
>             Parser parser = new XMLParser();            
>             parser.parse(stream, handler, metadata, context);
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Feb 10, 2014 at 3:30 PM, Nick Burch <[email protected]> wrote:
> On Mon, 10 Feb 2014, Rupak Khurana wrote:
> I am trying to parse out  JIL(Job Information Language) scripts that happen
> to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use its
> parsing ability and SAX event firing to make life easier.
> 
> Sounds like you'll want to define / identify a suitable mimetype for these, 
> add some mime magic so they get detected, then write your own parser that 
> spots these name/value pairs and emmits suitable sax events for you to consume
> 
> See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to do 
> all of that
> 
> Nick
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Parse out Name:Value pairs

Reply via email to