Hello,

I would like to write a custom Tika Parser based on a XML Schema (by "based" I 
mean "which uses the attributes in that file").

The files which I would like to parse look have the following structure:

----------------------------------

<?xml version="1.0" encoding="ISO-8859-1" ?>
<article artName="abc" _Title="Maastricht" _Byline="By abc" _Dateline="abc" 
_CategoryId="news" _SourceId="Others" _AltSource="newspaper" Summary="An abc 
businessman ...">
<head>
<clipsHead>New York</clipsHead>
</head>
<body>
<dateline>
<txt it="No" bd="Yes">New York</txt>
</dateline>
<byline>By L Z</byline>
<credit>Newspaper</credit>

........................

----------------------------------

and here's the XML Schema File's structure:

----------------------------------

<Schema name="GN3" xmlns="urn:schemas-microsoft-com:xml-data" 
xmlns:dt="urn:schemas-microsoft-com:datatypes" 
xmlns:gn3="urn:schemas-teradp-com:gn3">

        <!-- METADATA 2.0 - ATTRIBUTES -->

        <AttributeType
                name="_UID"
                required="no"
                dt:type="string"
                gn3:label="Reference UID:"
                default=""
        />

        <AttributeType
                name="_Priority"
                required="no"
                dt:type="i4"
                gn3:label="Priority:"
                default="2"
        />

        <AttributeType
                name="_Byline"
                required="no"
                dt:type="string"
                gn3:style="multiLine"
                gn3:label="Byline:"
                default=""
        />

......

----------------------------------

Any suggestions would be greatly appreciated.

Philippe







Reply via email to