Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by FergusMcMenemie: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: enhance wikipedia example to show off use of $skipDoc ------------------------------------------------------------------------------ <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8" /> <document> - <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/data/enwiki-20080724-pages-articles.xml"> + <entity name="page" + processor="XPathEntityProcessor" + stream="true" + forEach="/mediawiki/page/" + url="/data/enwiki-20080724-pages-articles.xml" + transformer="RegexTransformer,DateFormatTransformer" + > - <field column="id" xpath="/mediawiki/page/id" /> + <field column="id" xpath="/mediawiki/page/id" /> - <field column="title" xpath="/mediawiki/page/title" /> + <field column="title" xpath="/mediawiki/page/title" /> - <field column="revision" xpath="/mediawiki/page/revision/id" /> + <field column="revision" xpath="/mediawiki/page/revision/id" /> - <field column="user" xpath="/mediawiki/page/revision/contributor/username" /> + <field column="user" xpath="/mediawiki/page/revision/contributor/username" /> - <field column="userId" xpath="/mediawiki/page/revision/contributor/id" /> + <field column="userId" xpath="/mediawiki/page/revision/contributor/id" /> - <field column="text" xpath="/mediawiki/page/revision/text" /> + <field column="text" xpath="/mediawiki/page/revision/text" /> - <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" /> + <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" /> + <field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/> - </entity> + </entity> </document> </dataConfig> }}} The relevant portion of schema.xml is below: {{{ - <field name="id" type="integer" indexed="true" stored="true" required="true"/> + <field name="id" type="integer" indexed="true" stored="true" required="true"/> - <field name="title" type="string" indexed="true" stored="false"/> + <field name="title" type="string" indexed="true" stored="false"/> - <field name="revision" type="sint" indexed="true" stored="true"/> + <field name="revision" type="sint" indexed="true" stored="true"/> - <field name="user" type="string" indexed="true" stored="true"/> + <field name="user" type="string" indexed="true" stored="true"/> - <field name="userId" type="integer" indexed="true" stored="true"/> + <field name="userId" type="integer" indexed="true" stored="true"/> - <field name="text" type="text" indexed="true" stored="false"/> + <field name="text" type="text" indexed="true" stored="false"/> - <field name="timestamp" type="date" indexed="true" stored="true"/> + <field name="timestamp" type="date" indexed="true" stored="true"/> - <field name="titleText" type="text" indexed="true" stored="true"/> + <field name="titleText" type="text" indexed="true" stored="true"/> ... <uniqueKey>id</uniqueKey> <copyField source="title" dest="titleText"/> }}} - Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB. + Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB. Note that many articles are merely redirects to other articles. The use of $skipDoc allows those articles to be ignored. == Using delta-import command == - The only !EntityProcessor which supports delta is !SqlEntityProcessor! The X!PathEntityProcessor has not implemented it yet. So, unfortunately, there is no delta support for XML at this thime. + The only !EntityProcessor which supports delta is !SqlEntityProcessor! The X!PathEntityProcessor has not implemented it yet. So, unfortunately, there is no delta support for XML at this time. If you want to implement those methods in X!PathEntityProcessor: The methods are explained in !EntityProcessor.java. = Indexing Emails =
