[Solr Wiki] Update of "DataImportHandler" by NoblePaul

Apache Wiki Mon, 31 Mar 2008 00:07:55 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by NoblePaul:
http://wiki.apache.org/solr/DataImportHandler

------------------------------------------------------------------------------
  </dataConfig>
  }}}
  
- This data-config is the interesting part. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the SOLR fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items.
+ This data-config is the interesting part. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the SOLR fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items. So, what we wish to do is , create a 
document in SOLR for each 'item'. 
  
- The ''forEach'' attribute in the slashdot ''entity'' contains xpath which 
tells DataImportHandler "What are the records that need to be converted into 
SOLR documents?". As you can see in the data-config, the forEach="/RDF/channel 
| /RDF/item" specifies two kinds of records separated by '|' (OR in standard 
xpath lexicon). The first one says "Create a SOLR document for each ''channel'' 
element". The second one says "Create a SOLR document for each ''item'' 
element".
+ The X!PathEntityprocessor is designed to stream the xml, row by row (Think of 
a row as various fields in a xml element ). It uses the ''forEach'' attribute 
to identify a 'row'. In this example forEach has the value `'/RDF/channel | 
/RDF/item'` . This says that this xml has two types of rows (This uses the 
xpath syntax for OR and there can be more than one type of rows) . After it 
encounters a row , it tries to read as many fields are there in the field 
declarations. So in this case, when it reads the row `'/RDF/channel'` it may 
get 3 fields 'source', 'source-link' , 'source-subject' . After it processes 
the row it realizes that it does not have any value for the 'pk' field so it 
does not try to create a SOLR document for this row (Even if it tries it may 
fail in solr). But all these 3 fields are marked as `commonField="true"` . So 
it keeps the values handy for subsequent rows.
  
- But ofcourse, it doesn't make sense to create a SOLR document containing only 
the header elements, right? That's what we thought too, therefore we have the 
''pk'' attribute in the slashdot ''entity''. The ''pk=link'' says to 
DataImportHandler that only if the ''link'' field is present in the record, 
then only create a SOLR document for that record. Otherwise, just move on to 
the next one. The Slashdot RSS feed has only one ''/RDF/channel'' element 
present, therefore there is only record containing the source, source-link and 
subject fields. Since this record does not contain the ''link'' field (our pk), 
no SOLR document is created for this record and the !EntityProcessor just moves 
on.
+ It moves ahead and encounters `/RDF/item` and processes the rows one by one . 
It gets the values for all the fields except for the 3 fields in the header. 
But as they were marked as common fields, the processor puts those fields into 
the record just before creating the document.
  
- But, we did want to store those header fields, right? Yes, we can do that by 
adding ''commonField=true'' attribute to the header fields (source, source-link 
and subject). The ''commonField=true'' says that "store the values for these 
fields and add them to each SOLR document created". Therefore, when the 
processor comes to records of ''/RDF/item'' elements which contain our pk, it 
creates a SOLR document for them and adds the header fields to each such 
document.
+ What about this ''transformer=!DateFormatTransformer'' attribute in the 
entity? This is an inbuilt utility transformer helps the user parse his date 
strings in custom format to 'Date' objects . Note the field `<field 
column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />` 
. The transformer only applies to a field which has the attribute 
'dateTimeFormat' and it uses the syntax of [ 
http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html java's 
!SimpleDateFormat].
  
- What about this ''transformer=!DateFormatTransformer'' attribute in the 
entity? Date representation is always a problem when getting data. Each data 
source decides to use it's own format for representing dates but you need to 
parse it and convert it into a java.util.Date object for SOLR to index into a 
date field. Therefore, we supply a transformer called !DateFormatTransformer 
which needs you to supply the input format for the date string and we'll do the 
rest. It uses java.text.!SimpleDateFormat class internally, so the syntax for 
dateTimeFormat attribute is the same as you'd write if you were using 
!SimpleDateFormat class.
  
+ You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other SOLR servers or even well formed xhtml documents . Our 
XPath support has its limitations but we have tried to make sure that common 
use-cases are covered and since it's based on a streaming parser, it is 
extremely fast and consumes constant amount of memory even for large XMLs. 
Easy, isn't it? And you didn't need to write one line of code! Enjoy :)
- You can use this feature for indexing from REST API's such as RSS/Atom feeds, 
other SOLR servers, XML data feeds or Last.FM user profiles! The possibilities 
are endless. Our XPath support has its limitations but we have tried to make 
sure that common use-cases are covered and since it's based on a streaming 
parser, it is extremely fast and consumes constant amount of memory even for 
large XMLs. Easy, isn't it? And you didn't need to write one line of code! 
Enjoy :)
- 
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have 
all user needs met by an xml configuration alone. So we expose a few interfaces 
which can be implemented by the user to enhance the functionality.

[Solr Wiki] Update of "DataImportHandler" by NoblePaul

Reply via email to