[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Apache Wiki Sun, 10 Feb 2008 08:37:19 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Updated to remove schema creation step as per SOLR-469

------------------------------------------------------------------------------
   * Define a data-config.xml and specify the location this file in 
solrconfig.xml under DataImportHandler section
   * Give connection information such as JDBC Driver, JDBC URL, DB Username and 
password in solrconfig.xml under DataImportHandler section
   * Open the DataImportHandler page to verify if everything is in order 
[http://localhost:8983/solr/dataimport]
-  * Use the DataImportHandler's create-schema command to generate a SOLR 
schema out of the data-config.xml
   * Use full-import command to do a full import from the database and add to 
SOLR index
   * Use delta-import command to do a delta import (get new inserts/updates) 
and add to SOLR index
  
@@ -51, +50 @@

  == Configuration in data-config.xml ==
  A SOLR document can be considered as a de-normalized schema having fields 
whose values come from multiple tables.
  
- The data-config.xml starts by defining a "document" element which contains 
'''one root entity'''. The root entity can contain multiple sub-entities. An 
entity corresponds to a table in a relational database. Each entity can contain 
multiple fields. Each field can correspond to a column in it's parent's table. 
Alternately, a field can also be a copyField which can get data from multiple 
columns. For each field, write the same attributes as you would write in a SOLR 
schema.xml, when you use DataImportHandler to create the schema, the 
SOLR-specifc attributes will be copied directly into the generated schema.
+ The data-config.xml starts by defining a "document" element which contains 
'''one root entity'''. The root entity can contain multiple sub-entities. An 
entity corresponds to a table in a relational database. Each entity can contain 
multiple fields. Each field can correspond to a column in it's parent's table. 
Alternately, a field can also be a copyField which can get data from multiple 
columns. For each field, write only the column name from which the value for 
this field should come. If the column name is different from the field name, 
then another attribute ''name'' should be given. Rest of the required 
attributes such as ''type'' will be read directly from the SOLR schema.xml.
  
  In order to get data from the database, our design philosophy revolves around 
templatized 'sql' entered by the user for each entity. This gives the user the 
entire power of SQL if he needs it. The root entity is the central table whose 
primary key can be used to join this table with other child entities.
  
@@ -63, +62 @@

  
  {{{
  <dataConfig>
-     <document name="products" defaultSearchField="text">
+     <document name="products">
          <entity name="item" pk="id" query="select * from item">
-             <field column="id" type="string" indexed="false" stored="true"/>
-             <field column="name" type="text" indexed="true" stored="true"/>
-             <field column="name" name="nameSort" type="string" indexed="true" 
stored="false"/>
-             <field column="name" name="alphaNameSort" type="alphaOnlySort" 
indexed="true" stored="false"/>
-             <field column="manu" type="text" indexed="true" stored="true" 
omitNorms="true"/>
-             <field column="weight" type="sfloat" indexed="true" 
stored="true"/>
-             <field column="price" type="sfloat" indexed="true" stored="true"/>
-             <field column="popularity" type="sint" indexed="true" 
stored="true"/>
-             <field column="inStock" type="boolean" indexed="true" 
stored="true"/>
+             <field column="id" />
+             <field column="name" />
+             <field column="name" name="nameSort" />
+             <field column="name" name="alphaNameSort" />
+             <field column="manu" />
+             <field column="weight" />
+             <field column="price" />
+             <field column="popularity" />
+             <field column="inStock" />
  
              <entity name="feature"
                      query="select description from feature where 
item_id='${item.id}'">
-                 <field name="feature" column="description" type="text" 
indexed="true" stored="true" multiValued="true"/>
+                 <field name="feature" column="description" />
              </entity>
              <entity name="item_category"
                      query="select category_id from item_category where 
item_id='${item.id}'">
                  <entity name="category"
                          query="select description from category where id = 
'${item_category.category_id}'">
-                     <field column="description" name="cat" type="text_ws" 
indexed="true" stored="true" multiValued="true" omitNorms="true" 
termVectors="true" />
+                     <field column="description" name="cat" />
                  </entity>
              </entity>
          </entity>
-         <field name="text">
-             <copyFrom>cat</copyFrom>
-             <copyFrom>name</copyFrom>
-             <copyFrom>manu</copyFrom>
-             <copyFrom>features</copyFrom>
-         </field>
      </document>
  </dataConfig>
  
@@ -102, +95 @@

  {{{
     <entity name="feature"
                      query="select description from feature where 
item_id='${item.id}'">
-                 <field name="feature" column="description" type="text" 
indexed="true" stored="true" multiValued="true"/>
+                 <field name="feature" column="description" />
              </entity> 
  }}}
  The ''item_id'' foreign key in feature table is joined together with ''id'' 
primary key in ''item'' to retrieve rows for each row in ''item''. In a similar 
fashion, we join ''item'' and 'category' (which is a many-to-many 
relationship). Notice how we join these two tables using the intermediate table 
''item_category'' again using templated SQL.
@@ -111, +104 @@

                      query="select category_id from item_category where 
item_id='${item.id}'">
                  <entity name="category"
                          query="select description from category where id = 
'${item_category.category_id}'">
-                     <field column="description" name="cat" type="text_ws" 
indexed="true" stored="true" multiValued="true" omitNorms="true" 
termVectors="true" />
+                     <field column="description" name="cat" />
                  </entity>
              </entity>
  }}}
@@ -119, +112 @@

  '''NOTE'''
   * The data-config in this example can also be written with only one entity 
''item'' using SQL joins. In that case columns in ''category'' and ''feature'' 
tables can directly be read from the ''item'' entity.
   * This example does not use the delta features. We will add more examples 
soon.
- 
- == Using create-schema command ==
- Once you have your data-config.xml setup. Start SOLR and use the 
''create-schema'' to generate a SOLR schema.xml file according to the 
data-config.xml
- The command can be executed by hitting the URL 
[http://localhost:8983/solr/dataimport?command=create-schema] with a browser.
- 
- The newly created schema will be placed as ''conf/schema.xml.new'' Stop SOLR, 
rename it to schema.xml and again start SOLR. With the database connection 
information already in place in solrconfig.xml, we should be good to go ahead 
with full-import operations now.
  
  == Using full-import command ==
  Full Import operation can be started by hitting the URL 
[http://localhost:8983/solr/dataimport?command=full-import]. This operation 
will be started in a new thread and the ''status'' attribute in the response 
should be shown ''busy'' now. Depending on the size of your data set, this 
operation may take some time. At any time, you can hit 
[http://localhost:8983/solr/dataimport] to see the status flag.
@@ -136, +123 @@

  
  When delta-import command is executed, it reads the start time stored in 
''conf/dataimport.properties''. It uses that timestamp to run delta queries 
(TODO: Example) and after completion, updates the timestamp in 
''conf/dataimport.properties''.
  
+ = Where to find it? =
+ DataImportHandler is not in SOLR right now. It exists as a patch in 
[http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the SOLR JIRA. 
Please help us by giving your comments, suggestions and/or code contributions 
on this new feature.
+ 
  We hope to expand this documentation even more by adding more and more 
examples showing off the power of this tool. Keep checking back.
  
  ----

[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Reply via email to