[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Apache Wiki Sun, 30 Mar 2008 10:42:54 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Added explanation on the RSS indexing example

------------------------------------------------------------------------------
  The data-config for this example looks like this:
  {{{
  <dataConfig>
- 
        <document>
                <entity name="slashdot"
                                pk="link"
@@ -371, +370 @@

  </dataConfig>
  }}}
  
+ This data-config is the interesting part. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the SOLR fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items.
+ 
+ The ''forEach'' attribute in the slashdot ''entity'' contains xpath which 
tells DataImportHandler "What are the records that need to be converted into 
SOLR documents?". As you can see in the data-config, the forEach="/RDF/channel 
| /RDF/item" specifies two kinds of records separated by '|' (OR in standard 
xpath lexicon). The first one says "Create a SOLR document for each ''channel'' 
element". The second one says "Create a SOLR document for each ''item'' 
element".
+ 
+ But ofcourse, it doesn't make sense to create a SOLR document containing only 
the header elements, right? That's what we thought too, therefore we have the 
''pk'' attribute in the slashdot ''entity''. The ''pk=link'' says to 
DataImportHandler that only if the ''link'' field is present in the record, 
then only create a SOLR document for that record. Otherwise, just move on to 
the next one. The Slashdot RSS feed has only one ''/RDF/channel'' element 
present, therefore there is only record containing the source, source-link and 
subject fields. Since this record does not contain the ''link'' field (our pk), 
no SOLR document is created for this record and the !EntityProcessor just moves 
on.
+ 
+ But, we did want to store those header fields, right? Yes, we can do that by 
adding ''commonField=true'' attribute to the header fields (source, source-link 
and subject). The ''commonField=true'' says that "store the values for these 
fields and add them to each SOLR document created". Therefore, when the 
processor comes to records of ''/RDF/item'' elements which contain our pk, it 
creates a SOLR document for them and adds the header fields to each such 
document.
+ 
+ What about this ''transformer=!DateFormatTransformer'' attribute in the 
entity? Date representation is always a problem when getting data. Each data 
source decides to use it's own format for representing dates but you need to 
parse it and convert it into a java.util.Date object for SOLR to index into a 
date field. Therefore, we supply a transformer called !DateFormatTransformer 
which needs you to supply the input format for the date string and we'll do the 
rest. It uses java.text.!SimpleDateFormat class internally, so the syntax for 
dateTimeFormat attribute is the same as you'd write if you were using 
!SimpleDateFormat class.
+ 
+ You can use this feature for indexing from REST API's such as RSS/Atom feeds, 
other SOLR servers, XML data feeds or Last.FM user profiles! The possibilities 
are endless. Our XPath support has its limitations but we have tried to make 
sure that common use-cases are covered and since it's based on a streaming 
parser, it is extremely fast and consumes constant amount of memory even for 
large XMLs. Easy, isn't it? And you didn't need to write one line of code! 
Enjoy :)
+ 
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have 
all user needs met by an xml configuration alone. So we expose a few interfaces 
which can be implemented by the user to enhance the functionality.

[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Reply via email to