Hello,
I have a small XML document that I want to parse using Tika and expect to
get SAX events for each element in the input XML file. However I get the
output only for html, head, meta, body & p. I dont get the events for each
element in the XML file. See the Code for ContentHandler further below.
Please advise..
**** Output *****
StartDocument
StartElement html
StartElement head
StartElement meta
EndElement meta
StartElement title
EndElement title
EndElement head
StartElement body
StartElement p
EndElement p
EndElement body
EndElement html
EndDocument
***** sample.xml ******
<transformation>
<info>
<name>sample_normalize</name>
<description/>
<parameters>
<parameter>
<name>AS_OF_DATE</name>
<default_value>2012-06-01</default_value>
<description/>
</parameter>
</parameters>
</info>
</transformation>
***************** XYZContentHandler ****************
public class XYZContentHandler extends DefaultHandler {
public XYZContentHandler() {
}
@Override
public void startElement(String uri, String localName, String qName,
Attributes attributes)
throws SAXException {
System.out.println("StartElement "+qName);
}
@Override
public void endElement(String uri, String local, String name) throws
SAXException {
System.out.println("EndElement "+name);
}
@Override
public void startDocument() throws SAXException {
System.out.println("StartDocument");
}
@Override
public void endDocument() throws SAXException {
System.out.println("EndDocument");
}
}
****** Actual Code *******
stream = new FileInputStream(new File(filename));
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "application/xml");
XYZContentHandler handler = new XYZContentHandler();
ParseContext context = new ParseContext();
//Parser parser = new AutoDetectParser();
Parser parser = new XMLParser();
parser.parse(stream, handler, metadata, context);
On Mon, Feb 10, 2014 at 3:30 PM, Nick Burch <[email protected]> wrote:
> On Mon, 10 Feb 2014, Rupak Khurana wrote:
>
>> I am trying to parse out JIL(Job Information Language) scripts that
>> happen
>> to have Name:Value pairs. Perhaps Tika is an overkill but wanted to use
>> its
>> parsing ability and SAX event firing to make life easier.
>>
>
> Sounds like you'll want to define / identify a suitable mimetype for
> these, add some mime magic so they get detected, then write your own parser
> that spots these name/value pairs and emmits suitable sax events for you to
> consume
>
> See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to
> do all of that
>
> Nick
>