Problem with indexing xml using DataImportHandler and XPath

Farhan Ali Wed, 05 Mar 2014 17:04:07 -0800

Hi,
I am a newbie to Solr and I am trying to index some xml documents using DIH
and XPath but I am unable to do it. I get a response message of successful
indexing but no document is added to the index. I do not know what i m
doing wrong.


This is my data config xml file


<dataConfig>
        <dataSource type="FileDataSource"/>
                <document>
                        <entity name="nytxmldir" rootEntity="false"
datasource="null"
                        processor="FileListEntityProcessor"
                        fileName=".*\.xml"
                        recursive="true"
                        baseDir="/home/farhan/Downloads/nytxml"
                        >

                        <entity name="nytxml"
                        pk="id"
                        datasource="nytxmldir"
                        url="${nytxmldir.fileAbsolutePath}"
                        processor="XPathEntityProcessor"
                        forEach="/ntif"
                        transformer="RegexTransformer">

                                <field column="id"
xpath="/ntif/head/docdata/doc-id/@id-string"/>
                                <field column="title"
xpath="/ntif/head/title"/>
                                <field column="paragraph"
xpath="/ntif/body/body.content/block[@class='full_text']/p"/>

                        </entity>
                        </entity>
                </document>
</dataConfig>





This is my xml document


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "
http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd";>
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD
NITF 3.3//EN">
  <head>
    <title>Paid Notice: Deaths   BRADLEY, CAROL L.</title>
    <meta content="dn010107" name="slug"/>
    <meta content="1" name="publication_day_of_month"/>
    <meta content="1" name="publication_month"/>
    <meta content="2007" name="publication_year"/>
    <meta content="Monday" name="publication_day_of_week"/>
    <meta content="Classified" name="dsk"/>
    <meta content="7" name="print_page_number"/>
    <meta content="B" name="print_section"/>
    <meta content="3" name="print_column"/>
    <meta content="Paid Death Notices" name="online_sections"/>
    <docdata>
      <doc-id id-string="1815719"/>
      <doc.copyright holder="The New York Times" year="2007"/>
      <identified-content>
        <person class="indexing_service">BRADLEY, CAROL L.</person>
        <classifier class="online_producer" type="types_of_material">Paid
Death Notice</classifier>
        <classifier class="online_producer"
type="taxonomic_classifier">Top/Classifieds/Paid Death Notices</classifier>
      </identified-content>
    </docdata>
    <pubdata date.publication="20070101T000000" ex-ref="
http://query.nytimes.com/gst/fullpage.html?res=9B06E1DE1E3AF932A35752C0A9619C8B63";
item-length="49" name="The New York Times" unit-of-measure="word"/>
  </head>
  <body>
    <body.head>
      <hedline>
        <hl1>Paid Notice: Deaths   BRADLEY, CAROL L.</hl1>
      </hedline>
    </body.head>
    <body.content>
      <block class="lead_paragraph">
        <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
      </block>
      <block class="full_text">
        <p>BRADLEY--Carol L., 84, of Tinton Falls, NJ died peacefully at
Seabrook Village on December 27. Beloved wife of Floyd (Pete) Bradley, Jr.;
loving mother of Steven, Floyd and Lynette Bradley; adored grandmother of
Victoria Kent and Camilla, William and Melissa Bradley; caring
stepgrandmother of Matthew and Charlton Field.</p>
      </block>
    </body.content>
  </body>
</nitf>


I am really stumped as to why it is not working. I know DIH does not
support full XPath syntax but according to the wiki it supports the limited
XPath syntax that I am using. Also I have read various internet forums and
people have suggested to use groovy and xlts which I am unfamiliar with.
I hope someone can help me.

Thanks
Farhan

Problem with indexing xml using DataImportHandler and XPath

Reply via email to