The configuration is fine but for one detail The documents are to be created for the entity 'oldsearchcontent' not for the root entity . so add an attribute rootEntity="false" for the entity 'oldsearchcontentlist' as follows.
<entity name="oldsearchcontentlist" url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" processor="XPathEntityProcessor" forEach="/root/entries/entry" rootEntity="false"> this means that the entity directly under this ('oldsearchcontent')will be treated as the root and documents will be created for that. --Noble On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote: > Hello fellow Solr users ! > > > I am in the process of trying to index XML documents in Solr. I went for the > DataImportHandler approach, which seemed to perfectly suit this need. Due to > the large amount of XML documents to be indexed ( ~60MB ), i thought i would > hardly be possible to feed solr with the concatenation of all these docs at > once. Hence this small php script i wrote, serving on HTTP the list of these > documents, under the following form ( available from a local URL replicated > in data-config.xml ) : > > > <?xml version="1.0" encoding="UTF-8"?> > <root> > <entries> > <entry> > <realm>old_search_content</realm> > > > <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source> > </entry> > <entry> > <realm>old_search_content</realm> > > > <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source> > </entry> > <entry> > <realm>old_search_content</realm> > > > <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source> > </entry> > </entries> > </root> > > > The idea would be to have one single data-config.xml configuration file for > the DataImportHandler, which would read the listing presented above, and > request every single subitem and index it. Every subitem has the following > structure : > <?xml version="1.0" encoding="ISO-8859-1" ?> > <root> > <contenido id="10099" idioma="cat"> > <antetitulo><![CDATA[This is an introduction > text]]></antetitulo> > <titulo><![CDATA[This is a title]]></titulo> > <resumen><![CDATA[ This a a summary ]]></resumen> > <texto><![CDATA[This is the body of my article<br><br>]]> > </texto> > <autor><![CDATA[John Doe]]></autor> > <fecha><![CDATA[31/10/2001]]></fecha> > <fuente><![CDATA[]]></fuente> > <webexterna><![CDATA[]]></webexterna> > <recursos></recursos> > <ambitos></ambitos> > </contenido> > </root> > > > > After struggling for a ( long ) while with different configuration > scenarios, here is a data-config.xml i ended up with : > > > <dataConfig> > <dataSource type="HttpDataSource"/> > <document> > <entity name="oldsearchcontentlist" > pk="m_guid" > > > url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" > processor="XPathEntityProcessor" > forEach="/root/entries/entry"> > > <field column="elementurl" > xpath="/root/entries/entry/source/" /> > > <entity name="oldsearchcontent" > pk="m_guid" > url="${oldsearchcontentlist.elementurl}" > processor="XPathEntityProcessor" > forEach="/root/contenido" > transformer="TemplateTransformer"> > <field column="m_guid" > xpath="/root/contenido/titulo" /> > </entity> > </entity> > </document> > </dataConfig> > > > As a note, i had to check out Solr's trunk, and patched it with the > following : https://issues.apache.org/jira/browse/SOLR-469 ( > https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch ), > and recompiled. > Running the following command : > http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on > tells me that no Document was created at all, and does not throw any > error....here is the full output : > > > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">39</int> > </lst> > <lst name="initArgs"> > <lst name="defaults"> > <str name="config">data-config.xml</str> > <lst name="datasource"> > <str name="type">HttpDataSource</str> > </lst> > </lst> > </lst> > <str name="command">full-import</str> > <str name="mode">debug</str> > <null name="documents"/> > <lst name="verbose-output"> > <lst name="entity:oldsearchcontentlist"> > <lst name="document#1"> > <str name="query"> > > http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1 > </str> > <str name="time-taken">0:0:0.23</str> > </lst> > </lst> > </lst> > <str name="status">idle</str> > <str name="importResponse">Configuration Re-loaded sucessfully</str> > <lst name="statusMessages"> > <str name="Total Requests made to DataSource">1</str> > <str name="Total Rows Fetched">0</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2008-06-10 14:38:56</str> > <str name=""> > Indexing completed. Added/Updated: 0 documents. > Deleted 0 documents. > </str> > <str name="Committed">2008-06-10 14:38:56</str> > <str name="Time taken ">0:0:0.32</str> > </lst> > <str name="WARNING"> > This response format is experimental. It is likely to change > in the future. > </str> > </response> > > > I am sure am i mis doing something, but can not figure out what. I read > through several times all online documentation plus the full examples ( > slashdot RSS feed ). > I would gladly have feedback from anyone who tried to index HTTP/XML > sources, and got it to work smoothly. > > Thanks a million in advance, > > Regards, > Nicolas > -- > Nicolas Pastorino > eZ Systems ( Western Europe ) | http://ez.no > > > > > -- --Noble Paul