Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by HossMan: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: SOLR->Solr and some intro tweaks ------------------------------------------------------------------------------ + = Data Import Request Handler = - <!> ["Solr1.3"] + <!> ["Solr1.3"] + + Most applications store data in relational databases or XML files and searching over such data is a common use-case. The !DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports. [[TableOfContents]] = Overview = - - == Motivation == - Most applications store data in relational databases and searching over such data is a common use-case. However, there is no standard way to import this data into SOLR index requiring custom tools external to SOLR. Another common use case is data available in REST datasources (eg: RSS) , xml files etc == Goals == * Read data residing in relational databases - * Build SOLR documents by aggregating data from multiple columns and tables according to configuration + * Build Solr documents by aggregating data from multiple columns and tables according to configuration - * Update SOLR with such documents + * Update Solr with such documents * Provide ability to do full imports according to configuration * Detect inserts/update deltas (changes) and do delta imports (we assume a last-modified timestamp column for this to work) * Schedule full imports and delta imports @@ -43, +43 @@ * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler section * Give connection information (if you choose to put the datasource information in solrconfig) * Open the DataImportHandler page to verify if everything is in order [http://localhost:8983/solr/dataimport] - * Use full-import command to do a full import from the database and add to SOLR index + * Use full-import command to do a full import from the database and add to Solr index - * Use delta-import command to do a delta import (get new inserts/updates) and add to SOLR index + * Use delta-import command to do a delta import (get new inserts/updates) and add to Solr index [[Anchor(dsconfig)]] @@ -89, +89 @@ Any extra attributes put into the tag are directly passed on to the jdbc driver. == Configuration in data-config.xml == - A SOLR document can be considered as a de-normalized schema having fields whose values come from multiple tables. + A Solr document can be considered as a de-normalized schema having fields whose values come from multiple tables. - The data-config.xml starts by defining a `document` element. A `document` represents one kind of document. A document contains one or more root entities. A root entity can contain multiple sub-entities which in turn can contain other entities. An entity is a table/view in a relational database. Each entity can contain multiple fields. Each field corresponds to a column in the resultset returned by the ''query'' in the entity. For each field, mention the column name in the resultset. If the column name is different from the solr field name, then another attribute ''name'' should be given. Rest of the required attributes such as ''type'' will be inferred directly from the SOLR schema.xml. (Can be overridden) + The data-config.xml starts by defining a `document` element. A `document` represents one kind of document. A document contains one or more root entities. A root entity can contain multiple sub-entities which in turn can contain other entities. An entity is a table/view in a relational database. Each entity can contain multiple fields. Each field corresponds to a column in the resultset returned by the ''query'' in the entity. For each field, mention the column name in the resultset. If the column name is different from the solr field name, then another attribute ''name'' should be given. Rest of the required attributes such as ''type'' will be inferred directly from the Solr schema.xml. (Can be overridden) In order to get data from the database, our design philosophy revolves around 'templatized sql' entered by the user for each entity. This gives the user the entire power of SQL if he needs it. The root entity is the central table whose columns can be used to join this table with other child entities. @@ -138, +138 @@ inline:example-schema.png - This is a relational model of the same schema that SOLR currently ships with. We will use this as an example to build a data-config.xml for DataImportHandler. We've created a sample database with this schema in HSQLDB. To run it, do the following steps: + This is a relational model of the same schema that Solr currently ships with. We will use this as an example to build a data-config.xml for DataImportHandler. We've created a sample database with this schema in HSQLDB. To run it, do the following steps: 1. Download attachment:example-solr-home.jar and use ''jar -xvf example-solr-home.jar'' to extract it to your local drive. This jar file contains a complete solr home with all the configuration you need to execute this as well as the RSS example (given later in this page). It also contains an example hsqldb schema (in hsqldb folder) - 2. In the example-solr-home, there is a ''solr.war''. Copy this war file to your tomcat/jetty webapps folder. In addition to the [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] patch, this war file also contains the JDBC driver for hsqldb needed to execute this example. If you want to deploy it with your existing solr installation, just drop in the 'dataimport.jar' (find it in the jar) to WEB-INF/lib of your deployed SOLR webapp. + 2. In the example-solr-home, there is a ''solr.war''. Copy this war file to your tomcat/jetty webapps folder. In addition to the [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] patch, this war file also contains the JDBC driver for hsqldb needed to execute this example. If you want to deploy it with your existing Solr installation, just drop in the 'dataimport.jar' (find it in the jar) to WEB-INF/lib of your deployed Solr webapp. 3. Use the ''solr'' folder inside ''example-data-config'' folder as your solr home. 4. Hit [http://localhost:8983/solr/dataimport] with a browser to verify the configuration. 5. Hit [http://localhost:8983/solr/dataimport?command=full-import] to do a full import. - The ''solr'' folder given in the above jar is a MultiCore SOLR home. It has two cores, one for the DB example (this one) and one for an RSS example (new feature). + The ''solr'' folder given in the above jar is a MultiCore Solr home. It has two cores, one for the DB example (this one) and one for an RSS example (new feature). * The data-config.xml used for this example is: @@ -284, +284 @@ }}} Here we have three queries specified for each entity except the root (which has only two). - * The ''query'' gives us the data needed to populate fields of the SOLR document + * The ''query'' gives us the data needed to populate fields of the Solr document * The ''deltaQuery'' gives the primary keys of the current entity which have changes since the last index time * The ''parentDeltaQuery'' uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field. Let us reiterate on the findings: * For each row given by ''query'', the query of the child entity is executed once. * For each row given by ''deltaQuery'', the parentDeltaQuery is executed. - * If any row in the root/child entity changes, we regenerate the complete SOLR document which contained that row. + * If any row in the root/child entity changes, we regenerate the complete Solr document which contained that row. = Usage with XML/HTTP Datasource = DataImportHandler can be used to index data from HTTP based data sources. This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds. @@ -320, +320 @@ The fields can have the following attributes (over and above the default attributes): - * '''`xpath`''' (required) : The xpath expression of the field to be mapped as a column in the record . It can be omitted if the column does not come from an xml attribute. That means it can be a synthetic field created by a transformer. If a field is marked as multivalued in the schema and in a given row if the xpath finds multiple values it is handled automaticallly by the X!PathEntityProcessor. No extra configuration is required + * '''`xpath`''' (required) : The xpath expression of the field to be mapped as a column in the record . It can be omitted if the column does not come from an xml attribute. That means it can be a synthetic field created by a transformer. If a field is marked as multivalued in the schema and in a given row if the xpath finds multiple values it is handled automatically by the X!PathEntityProcessor. No extra configuration is required * '''`commonField`''' : can be (true| false) . If true, this field once encountered in a record will be copied to other records before creating a Solr document @@ -365, +365 @@ </dataConfig> }}} - This data-config is where the action is. If you read the structure of the Slashdot RSS, it has a few header elements such as title, link and subject. Those are mapped to the SOLR fields source, source-link and subject respectively using xpath syntax. The feed also has multiple ''item'' elements which contain the actual news items. So, what we wish to do is , create a document in SOLR for each 'item'. + This data-config is where the action is. If you read the structure of the Slashdot RSS, it has a few header elements such as title, link and subject. Those are mapped to the Solr fields source, source-link and subject respectively using xpath syntax. The feed also has multiple ''item'' elements which contain the actual news items. So, what we wish to do is , create a document in Solr for each 'item'. - The X!PathEntityprocessor is designed to stream the xml, row by row (Think of a row as various fields in a xml element ). It uses the ''forEach'' attribute to identify a 'row'. In this example forEach has the value `'/RDF/channel | /RDF/item'` . This says that this xml has two types of rows (This uses the xpath syntax for OR and there can be more than one type of rows) . After it encounters a row , it tries to read as many fields are there in the field declarations. So in this case, when it reads the row `'/RDF/channel'` it may get 3 fields 'source', 'source-link' , 'source-subject' . After it processes the row it realizes that it does not have any value for the 'pk' field so it does not try to create a SOLR document for this row (Even if it tries it may fail in solr). But all these 3 fields are marked as `commonField="true"` . So it keeps the values handy for subsequent rows. + The X!PathEntityprocessor is designed to stream the xml, row by row (Think of a row as various fields in a xml element ). It uses the ''forEach'' attribute to identify a 'row'. In this example forEach has the value `'/RDF/channel | /RDF/item'` . This says that this xml has two types of rows (This uses the xpath syntax for OR and there can be more than one type of rows) . After it encounters a row , it tries to read as many fields are there in the field declarations. So in this case, when it reads the row `'/RDF/channel'` it may get 3 fields 'source', 'source-link' , 'source-subject' . After it processes the row it realizes that it does not have any value for the 'pk' field so it does not try to create a Solr document for this row (Even if it tries it may fail in solr). But all these 3 fields are marked as `commonField="true"` . So it keeps the values handy for subsequent rows. It moves ahead and encounters `/RDF/item` and processes the rows one by one . It gets the values for all the fields except for the 3 fields in the header. But as they were marked as common fields, the processor puts those fields into the record just before creating the document. What about this ''transformer=!DateFormatTransformer'' attribute in the entity? . See [#DateFormatTransformer DateFormatTransformer] Section for details - You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other SOLR servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `'<dc:subject>'` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :) + You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `'<dc:subject>'` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :) /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml. = Extending the tool with APIs = @@ -774, +774 @@ inline:interactive-dev-dataimporthandler.PNG = Where to find it? = - DataImportHandler is a new addition to SOLR. You can either: + DataImportHandler is a new addition to Solr. You can either: * Download a nightly build of Solr from [http://lucene.apache.org/solr/ Solr website], or * Use the steps given in Full Import Example to try it out. - For a history of development discussion related to DataImportHandler, please see [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the SOLR JIRA. + For a history of development discussion related to DataImportHandler, please see [http://issues.apache.org/jira/browse/SOLR-469 SOLR-469] in the Solr JIRA. Please help us by giving your comments, suggestions and/or code contributions on this new feature.
