Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Noble Paul നോബിള്‍ नोब्ळ् Tue, 10 Jun 2008 05:55:44 -0700

The configuration is fine but for one detail
The documents are to be created for the entity 'oldsearchcontent' not
for the root entity . so add an attribute rootEntity="false" for the
entity 'oldsearchcontentlist' as follows.


   <entity name="oldsearchcontentlist"

url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
                               processor="XPathEntityProcessor"
                               forEach="/root/entries/entry"
                               rootEntity="false">

this means that the entity directly under this
('oldsearchcontent')will be treated as the root and documents will be
created for that.
--Noble

On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
> Hello fellow Solr users !
>
>
> I am in the process of trying to index XML documents in Solr. I went for the
> DataImportHandler approach, which seemed to perfectly suit this need. Due to
> the large amount of XML documents to be indexed ( ~60MB ), i thought i would
> hardly be possible to feed solr with the concatenation of all these docs at
> once. Hence this small php script i wrote, serving on HTTP the list of these
> documents, under the following form ( available from a local URL replicated
> in data-config.xml ) :
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <entries>
>        <entry>
>                <realm>old_search_content</realm>
>
>  
> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source>
>        </entry>
>        <entry>
>                <realm>old_search_content</realm>
>
>  
> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source>
>        </entry>
>        <entry>
>                <realm>old_search_content</realm>
>
>  
> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source>
>        </entry>
> </entries>
> </root>
>
>
> The idea would be to have one single data-config.xml configuration file for
> the DataImportHandler, which would read the listing presented above, and
> request every single subitem and index it. Every subitem has the following
> structure :
> <?xml version="1.0" encoding="ISO-8859-1" ?>
> <root>
>        <contenido id="10099" idioma="cat">
>                <antetitulo><![CDATA[This is an introduction
> text]]></antetitulo>
>                <titulo><![CDATA[This is a title]]></titulo>
>                <resumen><![CDATA[ This a a summary ]]></resumen>
>                <texto><![CDATA[This is the body of my article<br><br>]]>
>                </texto>
>                <autor><![CDATA[John Doe]]></autor>
>                <fecha><![CDATA[31/10/2001]]></fecha>
>                <fuente><![CDATA[]]></fuente>
>                <webexterna><![CDATA[]]></webexterna>
>                <recursos></recursos>
>                <ambitos></ambitos>
>        </contenido>
> </root>
>
>
>
> After struggling for a ( long ) while with different configuration
> scenarios, here is a data-config.xml i ended up with :
>
>
> <dataConfig>
>        <dataSource type="HttpDataSource"/>
>        <document>
>                <entity name="oldsearchcontentlist"
>                                pk="m_guid"
>
>  
> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
>                                processor="XPathEntityProcessor"
>                                forEach="/root/entries/entry">
>
>                        <field column="elementurl"
> xpath="/root/entries/entry/source/" />
>
>                        <entity name="oldsearchcontent"
>                                pk="m_guid"
>                                url="${oldsearchcontentlist.elementurl}"
>                                processor="XPathEntityProcessor"
>                                forEach="/root/contenido"
>                                transformer="TemplateTransformer">
>                                <field column="m_guid"
> xpath="/root/contenido/titulo" />
>                        </entity>
>                </entity>
>        </document>
> </dataConfig>
>
>
> As a note, i had to check out Solr's trunk, and patched it with the
> following : https://issues.apache.org/jira/browse/SOLR-469 (
> https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch ),
> and recompiled.
> Running the following command :
> http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on
> tells me that no Document was created at all, and does not throw any
> error....here is the full output :
>
>
> <response>
>        <lst name="responseHeader">
>                <int name="status">0</int>
>                <int name="QTime">39</int>
>        </lst>
>        <lst name="initArgs">
>                <lst name="defaults">
>                        <str name="config">data-config.xml</str>
>                        <lst name="datasource">
>                                <str name="type">HttpDataSource</str>
>                        </lst>
>                </lst>
>        </lst>
>        <str name="command">full-import</str>
>        <str name="mode">debug</str>
>        <null name="documents"/>
>                <lst name="verbose-output">
>                <lst name="entity:oldsearchcontentlist">
>                <lst name="document#1">
>                        <str name="query">
>
>  http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1
>                        </str>
>                        <str name="time-taken">0:0:0.23</str>
>                </lst>
>                </lst>
>                </lst>
>        <str name="status">idle</str>
>        <str name="importResponse">Configuration Re-loaded sucessfully</str>
>        <lst name="statusMessages">
>                <str name="Total Requests made to DataSource">1</str>
>                <str name="Total Rows Fetched">0</str>
>                <str name="Total Documents Skipped">0</str>
>                <str name="Full Dump Started">2008-06-10 14:38:56</str>
>                <str name="">
>                        Indexing completed. Added/Updated: 0 documents.
> Deleted 0 documents.
>                </str>
>                <str name="Committed">2008-06-10 14:38:56</str>
>                <str name="Time taken ">0:0:0.32</str>
>        </lst>
>        <str name="WARNING">
>                This response format is experimental.  It is likely to change
> in the future.
>        </str>
> </response>
>
>
> I am sure am i mis doing something, but can not figure out what. I read
> through several times all online documentation plus the full examples (
> slashdot RSS feed ).
> I would gladly have feedback from anyone who tried to index HTTP/XML
> sources, and got it to work smoothly.
>
> Thanks a million in advance,
>
> Regards,
> Nicolas
> --
> Nicolas Pastorino
> eZ Systems ( Western Europe )  |  http://ez.no
>
>
>
>
>



-- 
--Noble Paul

Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Reply via email to