[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Apache Wiki Thu, 18 Sep 2008 01:15:26 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Added wikipedia example

------------------------------------------------------------------------------
  
  You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other Solr servers or even well formed xhtml documents . Our 
XPath support has its limitations (no wildcards , only fullpath etc) but we 
have tried to make sure that common use-cases are covered and since it's based 
on a streaming parser, it is extremely fast and consumes constant amount of 
memory even for large XMLs. It does not support namespaces , but it can handle 
xmls with namespaces . When you provide the xpath, just drop the namespace and 
give the rest (eg if the tag is `'<dc:subject>'` the mapping should just 
contain `'subject'`).Easy, isn't it? And you didn't need to write one line of 
code! Enjoy :)
  
- /!\ Note : Unlike with database , it is not possible to omit the field 
declarations if you are using X!PathEntityProcessor. It relies on the xpaths 
declared in the fields to identify what to extract from the xml. 
+ /!\ Note : Unlike with database , it is not possible to omit the field 
declarations if you are using X!PathEntityProcessor. It relies on the xpaths 
declared in the fields to identify what to extract from the xml.
+ 
+ [[Anchor(wikipedia)]]
+ == Example: Indexing wikipedia ==
+ The following data-config.xml was used to index a full (en-articles, recent 
only) [http://download.wikimedia.org/enwiki/20080724/ wikipedia dump]. The file 
downloaded from wikipedia was the pages-articles.xml.bz2 which when 
uncompressed is around 18GB on disk.
+ 
+ {{{
+ <dataConfig>
+         <dataSource type="FileDataSource" encoding="UTF-8" />
+         <document>
+         <entity name="page" processor="XPathEntityProcessor" stream="true" 
forEach="/mediawiki/page/" url="/data/enwiki-20080724-pages-articles.xml">
+                 <field column="id" xpath="/mediawiki/page/id" />
+                 <field column="title" xpath="/mediawiki/page/title" />
+                 <field column="revision" xpath="/mediawiki/page/revision/id" 
/>
+                 <field column="user" 
xpath="/mediawiki/page/revision/contributor/username" />
+                 <field column="userId" 
xpath="/mediawiki/page/revision/contributor/id" />
+                 <field column="text" xpath="/mediawiki/page/revision/text" />
+                 <field column="timestamp" 
xpath="/mediawiki/page/revision/timestamp" />
+         </entity>
+         </document>
+ </dataConfig>7278241 
+ }}}
+ The relevant portion of schema.xml is below:
+ {{{
+ <field name="id" type="integer" indexed="true" stored="true" required="true"/>
+ <field name="title" type="string" indexed="true" stored="false"/>
+ <field name="revision" type="sint" indexed="true" stored="true"/>
+ <field name="user" type="string" indexed="true" stored="true"/>
+ <field name="userId" type="integer" indexed="true" stored="true"/>
+ <field name="text" type="text" indexed="true" stored="false"/>
+ <field name="timestamp" type="date" indexed="true" stored="true"/>
+ <field name="titleText" type="text" indexed="true" stored="true"/>
+ ...
+ <uniqueKey>id</uniqueKey>
+ <copyField source="title" dest="titleText"/>
+ }}}
+ 
+ Time taken was around 2 hours 40 minutes to index 7278241 articles with peak 
memory usage at around 4GB.
+  
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have 
all user needs met by an xml configuration alone. So we expose a few abstract 
class which can be implemented by the user to enhance the functionality.

[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Reply via email to