[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Apache Wiki Wed, 29 Apr 2009 10:02:17 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by FergusMcMenemie:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Adjusting page for new URLDataSource and the deprecation of HTTPdataSource

------------------------------------------------------------------------------
   * The datasource configuration can be done in solr config xml 
[#solrconfigdatasource also]
   * The attribute 'type' specifies the implementation class. It is optional. 
The default value is `'JdbcDataSource'`
   * The attribute 'name' can be used if there are [#multipleds multiple 
datasources] used by multiple entities
-  * All other attributes in the <dataSource> tag are arbitrary. It is decided 
by the !DataSource implementation. [#jdbcdatasource See here] for attributes 
used by !JdbcDataSource and [#httpds see here] for !HttpDataSource
+  * All other attributes in the <dataSource> tag are arbitrary. It is decided 
by the !DataSource implementation. [#jdbcdatasource See here] for attributes 
used by !JdbcDataSource and [#httpds see here] for !URLDataSource
   * [#datasource See here] for plugging in your own
  [[Anchor(multipleds)]]
  === Multiple DataSources ===
@@ -316, +316 @@

  
  = Usage with XML/HTTP Datasource =
  DataImportHandler can be used to index data from HTTP based data sources. 
This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds.
+ 
  [[Anchor(httpds)]]
- == Configuration of HttpDataSource ==
+ == Configuration of !URLDataSource ==
  
- A sample configuration in for !HttpdataSource in data config xml looks like 
this
+ A sample configuration in for !URLDataSource in data config xml looks like 
this
  {{{
- <dataSource type="HttpDataSource" baseUrl="http://host:port/"; 
encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
+ <dataSource type="URLDataSource" baseUrl="http://host:port/"; encoding="UTF-8" 
connectionTimeout="5000" readTimeout="10000"/>
  }}}
  ''' The attributes are '''
  
@@ -358, +359 @@

  }}}
  
  
- == HttpDataSource Example ==
+ == URLDataSource Example ==
  
  Download the full import example given in the DB section to try this out. 
We'll try indexing the [http://rss.slashdot.org/Slashdot/slashdot Slashdot RSS 
feed] for this example.
  
@@ -366, +367 @@

  The data-config for this example looks like this:
  {{{
  <dataConfig>
-         <dataSource type="HttpDataSource" />
+         <dataSource type="URLDataSource" />
        <document>
                <entity name="slashdot"
                                pk="link"
@@ -714, +715 @@

  == EntityProcessor ==
  Each entity is handled by a default Entity processor called 
!SqlEntityProcessor. This works well for systems which use RDBMS as a 
datasource. For other kind of datasources like  REST or Non Sql datasources you 
can choose to extend this abstract class 
`org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to 
Stream rows one by one from an entity. The simplest way to implement your own 
!EntityProcessor is to extend !EntityProcessorBase and override the `public 
Map<String,Object> nextRow()` method.
  '!EntityProcessor' rely on the !DataSource for fetching data. The return type 
of the !DataSource is important for an !EntityProcessor. The built-in ones are,
+ 
  === SqlEntityProcessor ===
  This is the defaut. The !DataSource must be of type 
`DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used with 
this.
+ 
  === XPathEntityProcessor ===
- Used for XML type datasource. The !DataSource must be of type 
`DataSourec<Reader>` . !HttpDataSource or !FileDataSource can be used with this.
+ Used when indexing XML type data. The !DataSource must be of type 
`DataSourec<Reader>` . !URLDataSource or !FileDataSource is commonly used with 
!XPathEntityProcessor.
+ 
  === FileListEntityProcessor ===
  A simple one which can be used to enumerate the list of files from a File 
System based on some criteria. It does not use a !DataSource. The entity 
attributes are:
   *'''`fileName`''' :(required) A regex pattern to identify files
@@ -801, +805 @@

  {{{
  public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
  }}}
- 
  It is designed to iterate rows in DB one by one. A row is represented as a 
Map.
+ 
- === HttpDataSource ===
+ === URLDataSource ===
- This is used by X!PathEntityProcessor to fetch content from HttpDataSources. 
See the documentation [#httpds here] . The signature is as follows
+ This datasource is often used with X!PathEntityProcessor to fetch content 
from an underlying file:// or http:// location. See the documentation [#httpds 
here] . The signature is as follows
  {{{
- public class HttpDataSource extends DataSource<Reader>
+ public class URLDataSource extends DataSource<Reader>
  }}}
+ 
+ === HTTPDataSource ===
+ This datasource now deprecated in favor of !URLDataSource. There is no change 
in functionality between !URLDataSource and !HTTPDataSource, only a name change.
+ 
  === FileDataSource ===
- This can be used like an !HttpDataSource but used to fetch content from files 
on disk. The signature is as follows
+ This can be used like an !URLDataSource but used to fetch content from files 
on disk. The only difference from !URLDataSource, when accessing disk files, is 
how a pathname is specified. The signature is as follows
  {{{
  public class FileDataSource extends DataSource<Reader>
  }}}
@@ -821, +829 @@

  === FieldReaderDataSource ===
  <!> ["Solr1.4"]
  
- This can be used like an !HttpDataSource . The signature is as follows
+ This can be used like an !URLDataSource . The signature is as follows
  {{{
  public class FieldReaderDataSource extends DataSource<Reader>
  }}}
- This can be useful for users who has a DB field containing xml and wish to 
use a nested X!PathEntityProcessor
+ This can be useful for users who have a DB field containing XML and wish to 
use a nested X!PathEntityProcessor to process the fields contents.
  The datasouce may be configured as follows
  {{{
    <datasource name="f" type="FieldReaderDataSource" />
@@ -888, +896 @@

  There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
  
   * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are 
configured in the solrconfig.xml.
-  * `http` is an instance of type `HttpDataSource`
+  * `http` is an instance of type `URLDataSource`
   * The root entity starts with a table called 'A' and uses 'jdbc1' as the 
datasource . The entity is conveniently named as the table itself
   * Entity 'A' has 2 sub-entities 'B' and 'C' . 'B' uses the datasource 
instance  'http' and 'C' uses the datasource instance 'jdbc2'
   * On doing a `command=full-import` The root-entity (A) is executed first

[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to