Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by ShalinMangar: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: Updated to remove schema creation step as per SOLR-469 ------------------------------------------------------------------------------ * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler section * Give connection information such as JDBC Driver, JDBC URL, DB Username and password in solrconfig.xml under DataImportHandler section * Open the DataImportHandler page to verify if everything is in order [http://localhost:8983/solr/dataimport] - * Use the DataImportHandler's create-schema command to generate a SOLR schema out of the data-config.xml * Use full-import command to do a full import from the database and add to SOLR index * Use delta-import command to do a delta import (get new inserts/updates) and add to SOLR index @@ -51, +50 @@ == Configuration in data-config.xml == A SOLR document can be considered as a de-normalized schema having fields whose values come from multiple tables. - The data-config.xml starts by defining a "document" element which contains '''one root entity'''. The root entity can contain multiple sub-entities. An entity corresponds to a table in a relational database. Each entity can contain multiple fields. Each field can correspond to a column in it's parent's table. Alternately, a field can also be a copyField which can get data from multiple columns. For each field, write the same attributes as you would write in a SOLR schema.xml, when you use DataImportHandler to create the schema, the SOLR-specifc attributes will be copied directly into the generated schema. + The data-config.xml starts by defining a "document" element which contains '''one root entity'''. The root entity can contain multiple sub-entities. An entity corresponds to a table in a relational database. Each entity can contain multiple fields. Each field can correspond to a column in it's parent's table. Alternately, a field can also be a copyField which can get data from multiple columns. For each field, write only the column name from which the value for this field should come. If the column name is different from the field name, then another attribute ''name'' should be given. Rest of the required attributes such as ''type'' will be read directly from the SOLR schema.xml. In order to get data from the database, our design philosophy revolves around templatized 'sql' entered by the user for each entity. This gives the user the entire power of SQL if he needs it. The root entity is the central table whose primary key can be used to join this table with other child entities. @@ -63, +62 @@ {{{ <dataConfig> - <document name="products" defaultSearchField="text"> + <document name="products"> <entity name="item" pk="id" query="select * from item"> - <field column="id" type="string" indexed="false" stored="true"/> - <field column="name" type="text" indexed="true" stored="true"/> - <field column="name" name="nameSort" type="string" indexed="true" stored="false"/> - <field column="name" name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/> - <field column="manu" type="text" indexed="true" stored="true" omitNorms="true"/> - <field column="weight" type="sfloat" indexed="true" stored="true"/> - <field column="price" type="sfloat" indexed="true" stored="true"/> - <field column="popularity" type="sint" indexed="true" stored="true"/> - <field column="inStock" type="boolean" indexed="true" stored="true"/> + <field column="id" /> + <field column="name" /> + <field column="name" name="nameSort" /> + <field column="name" name="alphaNameSort" /> + <field column="manu" /> + <field column="weight" /> + <field column="price" /> + <field column="popularity" /> + <field column="inStock" /> <entity name="feature" query="select description from feature where item_id='${item.id}'"> - <field name="feature" column="description" type="text" indexed="true" stored="true" multiValued="true"/> + <field name="feature" column="description" /> </entity> <entity name="item_category" query="select category_id from item_category where item_id='${item.id}'"> <entity name="category" query="select description from category where id = '${item_category.category_id}'"> - <field column="description" name="cat" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> + <field column="description" name="cat" /> </entity> </entity> </entity> - <field name="text"> - <copyFrom>cat</copyFrom> - <copyFrom>name</copyFrom> - <copyFrom>manu</copyFrom> - <copyFrom>features</copyFrom> - </field> </document> </dataConfig> @@ -102, +95 @@ {{{ <entity name="feature" query="select description from feature where item_id='${item.id}'"> - <field name="feature" column="description" type="text" indexed="true" stored="true" multiValued="true"/> + <field name="feature" column="description" /> </entity> }}} The ''item_id'' foreign key in feature table is joined together with ''id'' primary key in ''item'' to retrieve rows for each row in ''item''. In a similar fashion, we join ''item'' and 'category' (which is a many-to-many relationship). Notice how we join these two tables using the intermediate table ''item_category'' again using templated SQL. @@ -111, +104 @@ query="select category_id from item_category where item_id='${item.id}'"> <entity name="category" query="select description from category where id = '${item_category.category_id}'"> - <field column="description" name="cat" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> + <field column="description" name="cat" /> </entity> </entity> }}} @@ -119, +112 @@ '''NOTE''' * The data-config in this example can also be written with only one entity ''item'' using SQL joins. In that case columns in ''category'' and ''feature'' tables can directly be read from the ''item'' entity. * This example does not use the delta features. We will add more examples soon. - - == Using create-schema command == - Once you have your data-config.xml setup. Start SOLR and use the ''create-schema'' to generate a SOLR schema.xml file according to the data-config.xml - The command can be executed by hitting the URL [http://localhost:8983/solr/dataimport?command=create-schema] with a browser. - - The newly created schema will be placed as ''conf/schema.xml.new'' Stop SOLR, rename it to schema.xml and again start SOLR. With the database connection information already in place in solrconfig.xml, we should be good to go ahead with full-import operations now. == Using full-import command == Full Import operation can be started by hitting the URL [http://localhost:8983/solr/dataimport?command=full-import]. This operation will be started in a new thread and the ''status'' attribute in the response should be shown ''busy'' now. Depending on the size of your data set, this operation may take some time. At any time, you can hit [http://localhost:8983/solr/dataimport] to see the status flag. @@ -136, +123 @@ When delta-import command is executed, it reads the start time stored in ''conf/dataimport.properties''. It uses that timestamp to run delta queries (TODO: Example) and after completion, updates the timestamp in ''conf/dataimport.properties''. + = Where to find it? = + DataImportHandler is not in SOLR right now. It exists as a patch in [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the SOLR JIRA. Please help us by giving your comments, suggestions and/or code contributions on this new feature. + We hope to expand this documentation even more by adding more and more examples showing off the power of this tool. Keep checking back. ----
