[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml

Jonathan Koren (JIRA) Thu, 25 Jun 2009 17:16:38 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724352#action_12724352
 ]


Jonathan Koren commented on TIKA-251:
-------------------------------------

Just updated and reran `mvn install` to make sure.  

bash-3.2# svn update
At revision 788551.



> package parser ignoring tika-config.xml 
> ----------------------------------------
>
>                 Key: TIKA-251
>                 URL: https://issues.apache.org/jira/browse/TIKA-251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the 
> xml file being parsed.  I modified tika-config so that AutoDetectParser will 
> call this parser for xml files:
>        <parser name="parse-xml" class="XmlParser">
>                <mime>application/xml</mime>
>        </parser>
> If tika parses an xml file directly, the right thing is done:
>       resourceName: 1001281.xml
> ComplexIndexerTaskThread()
>       XmlParser Begins
>       SCH: start document
>       SCH: start element nitf
>       SCH: a: change.date=June 10, 2005
>       SCH: a: change.time=19:30
>       SCH: a: version=-//IPTC//DTD NITF 3.3//EN
>       SCH: start element head
>       SCH: start element title
>       Apprentices Sample Life Of Doctors In Villages
>       SCH: end element title
>       SCH: start element meta
>       SCH: a: content=Y11DOC$01
>       SCH: a: name=slug
> and so on for the fragment:
>       <?xml version="1.0" encoding="UTF-8"?>
>       <!DOCTYPE nitf SYSTEM 
> "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd";>
>       <nitf change.date="June 10, 2005" change.time="19:30" 
> version="-//IPTC//DTD NITF 3.3//EN">
>       <head>
>       <title>Apprentices Sample Life Of Doctors In Villages</title>
>       <meta content="Y11DOC$01" name="slug"/>
> Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't 
> called.  Instead it is somehow converted to plain text.  Which is not 
> correct. Example output:
>       fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
>       resourceName: aaa.tar.gz
>       ComplexIndexerTaskThread()
>       SCH: start document
>       SCH: start element html
>       SCH: start element head
>       SCH: start element title
>       SCH: end element title
>       SCH: end element head
>       SCH: start element body
>       SCH: start element div
>       SCH: a: class=package-entry
>       SCH: subfile 1 detected!
>       SCH: start element h1
>       aaa.tar
>       SCH: subfile 1's name is aaa.tar
>       SCH: end element h1
>       SCH: start element div
>       SCH: a: class=package-entry
>       SCH: subfile 2 detected!
>       SCH: start element h1
>       1001281.xml
>       SCH: subfile 2's name is 1001281.xml
>       SCH: end element h1
>       SCH: start element p
>    Apprentices Sample Life Of Doctors In Villages
> and so on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml

Reply via email to