Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by FergusMcMenemie: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: Adjusting page for new URLDataSource and the deprecation of HTTPdataSource ------------------------------------------------------------------------------ * The datasource configuration can be done in solr config xml [#solrconfigdatasource also] * The attribute 'type' specifies the implementation class. It is optional. The default value is `'JdbcDataSource'` * The attribute 'name' can be used if there are [#multipleds multiple datasources] used by multiple entities - * All other attributes in the <dataSource> tag are arbitrary. It is decided by the !DataSource implementation. [#jdbcdatasource See here] for attributes used by !JdbcDataSource and [#httpds see here] for !HttpDataSource + * All other attributes in the <dataSource> tag are arbitrary. It is decided by the !DataSource implementation. [#jdbcdatasource See here] for attributes used by !JdbcDataSource and [#httpds see here] for !URLDataSource * [#datasource See here] for plugging in your own [[Anchor(multipleds)]] === Multiple DataSources === @@ -316, +316 @@ = Usage with XML/HTTP Datasource = DataImportHandler can be used to index data from HTTP based data sources. This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds. + [[Anchor(httpds)]] - == Configuration of HttpDataSource == + == Configuration of !URLDataSource == - A sample configuration in for !HttpdataSource in data config xml looks like this + A sample configuration in for !URLDataSource in data config xml looks like this {{{ - <dataSource type="HttpDataSource" baseUrl="http://host:port/" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/> + <dataSource type="URLDataSource" baseUrl="http://host:port/" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/> }}} ''' The attributes are ''' @@ -358, +359 @@ }}} - == HttpDataSource Example == + == URLDataSource Example == Download the full import example given in the DB section to try this out. We'll try indexing the [http://rss.slashdot.org/Slashdot/slashdot Slashdot RSS feed] for this example. @@ -366, +367 @@ The data-config for this example looks like this: {{{ <dataConfig> - <dataSource type="HttpDataSource" /> + <dataSource type="URLDataSource" /> <document> <entity name="slashdot" pk="link" @@ -714, +715 @@ == EntityProcessor == Each entity is handled by a default Entity processor called !SqlEntityProcessor. This works well for systems which use RDBMS as a datasource. For other kind of datasources like REST or Non Sql datasources you can choose to extend this abstract class `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to Stream rows one by one from an entity. The simplest way to implement your own !EntityProcessor is to extend !EntityProcessorBase and override the `public Map<String,Object> nextRow()` method. '!EntityProcessor' rely on the !DataSource for fetching data. The return type of the !DataSource is important for an !EntityProcessor. The built-in ones are, + === SqlEntityProcessor === This is the defaut. The !DataSource must be of type `DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used with this. + === XPathEntityProcessor === - Used for XML type datasource. The !DataSource must be of type `DataSourec<Reader>` . !HttpDataSource or !FileDataSource can be used with this. + Used when indexing XML type data. The !DataSource must be of type `DataSourec<Reader>` . !URLDataSource or !FileDataSource is commonly used with !XPathEntityProcessor. + === FileListEntityProcessor === A simple one which can be used to enumerate the list of files from a File System based on some criteria. It does not use a !DataSource. The entity attributes are: *'''`fileName`''' :(required) A regex pattern to identify files @@ -801, +805 @@ {{{ public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>> }}} - It is designed to iterate rows in DB one by one. A row is represented as a Map. + - === HttpDataSource === + === URLDataSource === - This is used by X!PathEntityProcessor to fetch content from HttpDataSources. See the documentation [#httpds here] . The signature is as follows + This datasource is often used with X!PathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [#httpds here] . The signature is as follows {{{ - public class HttpDataSource extends DataSource<Reader> + public class URLDataSource extends DataSource<Reader> }}} + + === HTTPDataSource === + This datasource now deprecated in favor of !URLDataSource. There is no change in functionality between !URLDataSource and !HTTPDataSource, only a name change. + === FileDataSource === - This can be used like an !HttpDataSource but used to fetch content from files on disk. The signature is as follows + This can be used like an !URLDataSource but used to fetch content from files on disk. The only difference from !URLDataSource, when accessing disk files, is how a pathname is specified. The signature is as follows {{{ public class FileDataSource extends DataSource<Reader> }}} @@ -821, +829 @@ === FieldReaderDataSource === <!> ["Solr1.4"] - This can be used like an !HttpDataSource . The signature is as follows + This can be used like an !URLDataSource . The signature is as follows {{{ public class FieldReaderDataSource extends DataSource<Reader> }}} - This can be useful for users who has a DB field containing xml and wish to use a nested X!PathEntityProcessor + This can be useful for users who have a DB field containing XML and wish to use a nested X!PathEntityProcessor to process the fields contents. The datasouce may be configured as follows {{{ <datasource name="f" type="FieldReaderDataSource" /> @@ -888, +896 @@ There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B) * `jdbc1` and `jdbc2` are instances of type `JdbcDataSource` which are configured in the solrconfig.xml. - * `http` is an instance of type `HttpDataSource` + * `http` is an instance of type `URLDataSource` * The root entity starts with a table called 'A' and uses 'jdbc1' as the datasource . The entity is conveniently named as the table itself * Entity 'A' has 2 sub-entities 'B' and 'C' . 'B' uses the datasource instance 'http' and 'C' uses the datasource instance 'jdbc2' * On doing a `command=full-import` The root-entity (A) is executed first
