Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by ShalinMangar: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: Added wikipedia example ------------------------------------------------------------------------------ You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `'<dc:subject>'` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :) - /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml. + /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml. + + [[Anchor(wikipedia)]] + == Example: Indexing wikipedia == + The following data-config.xml was used to index a full (en-articles, recent only) [http://download.wikimedia.org/enwiki/20080724/ wikipedia dump]. The file downloaded from wikipedia was the pages-articles.xml.bz2 which when uncompressed is around 18GB on disk. + + {{{ + <dataConfig> + <dataSource type="FileDataSource" encoding="UTF-8" /> + <document> + <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/data/enwiki-20080724-pages-articles.xml"> + <field column="id" xpath="/mediawiki/page/id" /> + <field column="title" xpath="/mediawiki/page/title" /> + <field column="revision" xpath="/mediawiki/page/revision/id" /> + <field column="user" xpath="/mediawiki/page/revision/contributor/username" /> + <field column="userId" xpath="/mediawiki/page/revision/contributor/id" /> + <field column="text" xpath="/mediawiki/page/revision/text" /> + <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" /> + </entity> + </document> + </dataConfig>7278241 + }}} + The relevant portion of schema.xml is below: + {{{ + <field name="id" type="integer" indexed="true" stored="true" required="true"/> + <field name="title" type="string" indexed="true" stored="false"/> + <field name="revision" type="sint" indexed="true" stored="true"/> + <field name="user" type="string" indexed="true" stored="true"/> + <field name="userId" type="integer" indexed="true" stored="true"/> + <field name="text" type="text" indexed="true" stored="false"/> + <field name="timestamp" type="date" indexed="true" stored="true"/> + <field name="titleText" type="text" indexed="true" stored="true"/> + ... + <uniqueKey>id</uniqueKey> + <copyField source="title" dest="titleText"/> + }}} + + Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB. + = Extending the tool with APIs = The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few abstract class which can be implemented by the user to enhance the functionality.
