[Solr Wiki] Update of "DataImportHandler" by HossMan

Apache Wiki Sun, 31 Aug 2008 14:02:38 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by HossMan:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
SOLR->Solr and some intro tweaks

------------------------------------------------------------------------------
+ = Data Import Request Handler =
- <!> ["Solr1.3"] 
+ <!> ["Solr1.3"]
+ 
+ Most applications store data in relational databases or XML files and 
searching over such data is a common use-case. The !DataImportHandler is a Solr 
contrib that provides a configuration driven way to import this data into Solr 
in both "full builds" and using incremental delta imports.
  
  [[TableOfContents]]
  
  = Overview =
- 
- == Motivation ==
- Most applications store data in relational databases and searching over such 
data is a common use-case. However, there is no standard way to import this 
data into SOLR index requiring custom tools external to SOLR. Another common 
use case is data available in REST datasources (eg: RSS)  , xml files etc
  
  == Goals ==
   * Read data residing in relational databases 
-  * Build SOLR documents by aggregating data from multiple columns and tables 
according to configuration
+  * Build Solr documents by aggregating data from multiple columns and tables 
according to configuration
-  * Update SOLR with such documents
+  * Update Solr with such documents
   * Provide ability to do full imports according to configuration
   * Detect inserts/update deltas (changes) and do delta imports (we assume a 
last-modified timestamp column for this to work)
   * Schedule full imports and delta imports
@@ -43, +43 @@

   * Define a data-config.xml and specify the location this file in 
solrconfig.xml under DataImportHandler section
   * Give connection information (if you choose to put the datasource 
information in solrconfig)    
   * Open the DataImportHandler page to verify if everything is in order 
[http://localhost:8983/solr/dataimport]
-  * Use full-import command to do a full import from the database and add to 
SOLR index
+  * Use full-import command to do a full import from the database and add to 
Solr index
-  * Use delta-import command to do a delta import (get new inserts/updates) 
and add to SOLR index
+  * Use delta-import command to do a delta import (get new inserts/updates) 
and add to Solr index
  
  [[Anchor(dsconfig)]]
  
@@ -89, +89 @@

  Any extra attributes put into the tag are directly passed on to the jdbc 
driver.
  
  == Configuration in data-config.xml ==
- A SOLR document can be considered as a de-normalized schema having fields 
whose values come from multiple tables.
+ A Solr document can be considered as a de-normalized schema having fields 
whose values come from multiple tables.
  
- The data-config.xml starts by defining a `document` element. A `document` 
represents one kind of document.  A document contains one or more root 
entities. A root entity can contain multiple sub-entities which in turn can  
contain other entities. An entity is a table/view in a relational database. 
Each entity can contain multiple fields. Each field corresponds to a column in 
the resultset returned by the ''query'' in the entity. For each field, mention 
the column name in the resultset. If the column name is different from the solr 
field name, then another attribute ''name'' should be given. Rest of the 
required attributes such as ''type'' will be inferred directly from the SOLR 
schema.xml. (Can be overridden)
+ The data-config.xml starts by defining a `document` element. A `document` 
represents one kind of document.  A document contains one or more root 
entities. A root entity can contain multiple sub-entities which in turn can  
contain other entities. An entity is a table/view in a relational database. 
Each entity can contain multiple fields. Each field corresponds to a column in 
the resultset returned by the ''query'' in the entity. For each field, mention 
the column name in the resultset. If the column name is different from the solr 
field name, then another attribute ''name'' should be given. Rest of the 
required attributes such as ''type'' will be inferred directly from the Solr 
schema.xml. (Can be overridden)
  
  In order to get data from the database, our design philosophy revolves around 
'templatized sql' entered by the user for each entity. This gives the user the 
entire power of SQL if he needs it. The root entity is the central table whose 
columns can be used to join this table with other child entities.
  
@@ -138, +138 @@

  
  inline:example-schema.png
  
- This is a relational model of the same schema that SOLR currently ships with. 
We will use this as an example to build a data-config.xml for 
DataImportHandler. We've created a sample database with this schema in HSQLDB.  
To run it, do the following steps:
+ This is a relational model of the same schema that Solr currently ships with. 
We will use this as an example to build a data-config.xml for 
DataImportHandler. We've created a sample database with this schema in HSQLDB.  
To run it, do the following steps:
  
   1. Download attachment:example-solr-home.jar and use ''jar -xvf 
example-solr-home.jar'' to extract it to your local drive. This jar file 
contains a complete solr home with all the configuration you need to execute 
this as well as the RSS example (given later in this page). It also contains an 
example hsqldb schema (in hsqldb folder)
-  2. In the example-solr-home, there is a ''solr.war''. Copy this war file to 
your tomcat/jetty webapps folder. In addition to the 
[http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] patch, this war file 
also contains the JDBC driver for hsqldb needed to execute this example. If you 
want to deploy it with your existing solr installation, just drop in the 
'dataimport.jar' (find it in the jar) to WEB-INF/lib of your deployed SOLR 
webapp.
+  2. In the example-solr-home, there is a ''solr.war''. Copy this war file to 
your tomcat/jetty webapps folder. In addition to the 
[http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] patch, this war file 
also contains the JDBC driver for hsqldb needed to execute this example. If you 
want to deploy it with your existing Solr installation, just drop in the 
'dataimport.jar' (find it in the jar) to WEB-INF/lib of your deployed Solr 
webapp.
   3. Use the ''solr'' folder inside ''example-data-config'' folder as your 
solr home.
   4. Hit [http://localhost:8983/solr/dataimport] with a browser to verify the 
configuration.
   5. Hit [http://localhost:8983/solr/dataimport?command=full-import] to do a 
full import.
  
- The ''solr'' folder given in the above jar is a MultiCore SOLR home. It has 
two cores, one for the DB example (this one) and one for an RSS example (new 
feature).
+ The ''solr'' folder given in the above jar is a MultiCore Solr home. It has 
two cores, one for the DB example (this one) and one for an RSS example (new 
feature).
  
   * The data-config.xml used for this example is:
  
@@ -284, +284 @@

  }}}
  
  Here we have three queries specified for each entity except the root (which 
has only two).
-  * The ''query'' gives us the data needed to populate fields of the SOLR 
document
+  * The ''query'' gives us the data needed to populate fields of the Solr 
document
   * The ''deltaQuery'' gives the primary keys of the current entity which have 
changes since the last index time
   * The ''parentDeltaQuery'' uses the changed rows of the current table 
(fetched with deltaQuery) to give the changed rows in the parent table. This is 
necessary because whenever a row in the child table changes, we need to 
re-generate the document which has that field.
  
  Let us reiterate on the findings:
   * For each row given by ''query'', the query of the child entity is executed 
once.
   * For each row given by ''deltaQuery'', the parentDeltaQuery is executed.
-  * If any row in the root/child entity changes, we regenerate the complete 
SOLR document which contained that row.
+  * If any row in the root/child entity changes, we regenerate the complete 
Solr document which contained that row.
  
  = Usage with XML/HTTP Datasource =
  DataImportHandler can be used to index data from HTTP based data sources. 
This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds.
@@ -320, +320 @@

  
  
  The fields can have the following attributes (over and above the default 
attributes):
-  * '''`xpath`''' (required) : The xpath expression of the field to be mapped 
as a column in the record . It can be omitted if the column does not come from 
an xml attribute. That means it can be a synthetic field created by a 
transformer. If a field is marked as multivalued in the schema and in a given 
row if the xpath finds multiple values it is handled automaticallly by the 
X!PathEntityProcessor. No extra configuration is required
+  * '''`xpath`''' (required) : The xpath expression of the field to be mapped 
as a column in the record . It can be omitted if the column does not come from 
an xml attribute. That means it can be a synthetic field created by a 
transformer. If a field is marked as multivalued in the schema and in a given 
row if the xpath finds multiple values it is handled automatically by the 
X!PathEntityProcessor. No extra configuration is required
  
   * '''`commonField`''' : can be (true| false) . If true, this field once 
encountered in a record will be copied to other records before creating a Solr 
document
  
@@ -365, +365 @@

  </dataConfig>
  }}}
  
- This data-config is where the action is. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the SOLR fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items. So, what we wish to do is , create a 
document in SOLR for each 'item'. 
+ This data-config is where the action is. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the Solr fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items. So, what we wish to do is , create a 
document in Solr for each 'item'. 
  
- The X!PathEntityprocessor is designed to stream the xml, row by row (Think of 
a row as various fields in a xml element ). It uses the ''forEach'' attribute 
to identify a 'row'. In this example forEach has the value `'/RDF/channel | 
/RDF/item'` . This says that this xml has two types of rows (This uses the 
xpath syntax for OR and there can be more than one type of rows) . After it 
encounters a row , it tries to read as many fields are there in the field 
declarations. So in this case, when it reads the row `'/RDF/channel'` it may 
get 3 fields 'source', 'source-link' , 'source-subject' . After it processes 
the row it realizes that it does not have any value for the 'pk' field so it 
does not try to create a SOLR document for this row (Even if it tries it may 
fail in solr). But all these 3 fields are marked as `commonField="true"` . So 
it keeps the values handy for subsequent rows.
+ The X!PathEntityprocessor is designed to stream the xml, row by row (Think of 
a row as various fields in a xml element ). It uses the ''forEach'' attribute 
to identify a 'row'. In this example forEach has the value `'/RDF/channel | 
/RDF/item'` . This says that this xml has two types of rows (This uses the 
xpath syntax for OR and there can be more than one type of rows) . After it 
encounters a row , it tries to read as many fields are there in the field 
declarations. So in this case, when it reads the row `'/RDF/channel'` it may 
get 3 fields 'source', 'source-link' , 'source-subject' . After it processes 
the row it realizes that it does not have any value for the 'pk' field so it 
does not try to create a Solr document for this row (Even if it tries it may 
fail in solr). But all these 3 fields are marked as `commonField="true"` . So 
it keeps the values handy for subsequent rows.
  
  It moves ahead and encounters `/RDF/item` and processes the rows one by one . 
It gets the values for all the fields except for the 3 fields in the header. 
But as they were marked as common fields, the processor puts those fields into 
the record just before creating the document.
  
  What about this ''transformer=!DateFormatTransformer'' attribute in the 
entity? . See [#DateFormatTransformer DateFormatTransformer]  Section for 
details
  
- You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other SOLR servers or even well formed xhtml documents . Our 
XPath support has its limitations (no wildcards , only fullpath etc) but we 
have tried to make sure that common use-cases are covered and since it's based 
on a streaming parser, it is extremely fast and consumes constant amount of 
memory even for large XMLs. It does not support namespaces , but it can handle 
xmls with namespaces . When you provide the xpath, just drop the namespace and 
give the rest (eg if the tag is `'<dc:subject>'` the mapping should just 
contain `'subject'`).Easy, isn't it? And you didn't need to write one line of 
code! Enjoy :)
+ You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other Solr servers or even well formed xhtml documents . Our 
XPath support has its limitations (no wildcards , only fullpath etc) but we 
have tried to make sure that common use-cases are covered and since it's based 
on a streaming parser, it is extremely fast and consumes constant amount of 
memory even for large XMLs. It does not support namespaces , but it can handle 
xmls with namespaces . When you provide the xpath, just drop the namespace and 
give the rest (eg if the tag is `'<dc:subject>'` the mapping should just 
contain `'subject'`).Easy, isn't it? And you didn't need to write one line of 
code! Enjoy :)
  
  /!\ Note : Unlike with database , it is not possible to omit the field 
declarations if you are using X!PathEntityProcessor. It relies on the xpaths 
declared in the fields to identify what to extract from the xml. 
  = Extending the tool with APIs =
@@ -774, +774 @@

  inline:interactive-dev-dataimporthandler.PNG
  
  = Where to find it? =
- DataImportHandler is a new addition to SOLR. You can either:
+ DataImportHandler is a new addition to Solr. You can either:
   * Download a nightly build of Solr from [http://lucene.apache.org/solr/ Solr 
website], or
   * Use the steps given in Full Import Example to try it out.
  
- For a history of development discussion related to DataImportHandler, please 
see [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the SOLR JIRA.
+ For a history of development discussion related to DataImportHandler, please 
see [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the Solr JIRA.
  
  Please help us by giving your comments, suggestions and/or code contributions 
on this new feature.

[Solr Wiki] Update of "DataImportHandler" by HossMan

Reply via email to