package parser ignoring tika-config.xml 
----------------------------------------

                 Key: TIKA-251
                 URL: https://issues.apache.org/jira/browse/TIKA-251
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Jonathan Koren


I created my own ContentHandler, XmlParser that echos out the dom tree of the 
xml file being parsed.  I modified tika-config so that AutoDetectParser will 
call this parser for xml files:

       <parser name="parse-xml" class="XmlParser">
               <mime>application/xml</mime>
       </parser>

If tika parses an xml file directly, the right thing is done:

        resourceName: 1001281.xml
ComplexIndexerTaskThread()
        XmlParser Begins
        SCH: start document
        SCH: start element nitf
        SCH: a: change.date=June 10, 2005
        SCH: a: change.time=19:30
        SCH: a: version=-//IPTC//DTD NITF 3.3//EN
        SCH: start element head
        SCH: start element title
        Apprentices Sample Life Of Doctors In Villages
        SCH: end element title
        SCH: start element meta
        SCH: a: content=Y11DOC$01
        SCH: a: name=slug

and so on for the fragment:

        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE nitf SYSTEM 
"http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd";>
        <nitf change.date="June 10, 2005" change.time="19:30" 
version="-//IPTC//DTD NITF 3.3//EN">
        <head>
        <title>Apprentices Sample Life Of Doctors In Villages</title>
        <meta content="Y11DOC$01" name="slug"/>


Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't 
called.  Instead it is somehow converted to plain text.  Which is not correct. 
Example output:

        fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
        resourceName: aaa.tar.gz
        ComplexIndexerTaskThread()
        SCH: start document
        SCH: start element html
        SCH: start element head
        SCH: start element title

        SCH: end element title

        SCH: end element head
        SCH: start element body
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 1 detected!
        SCH: start element h1
        aaa.tar
        SCH: subfile 1's name is aaa.tar

        SCH: end element h1
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 2 detected!
        SCH: start element h1
        1001281.xml
        SCH: subfile 2's name is 1001281.xml

        SCH: end element h1
        SCH: start element p


   Apprentices Sample Life Of Doctors In Villages


and so on.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to