[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Apache Wiki Thu, 26 Feb 2009 23:54:46 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by FergusMcMenemie:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
extending the explaination of how transformers are chained

------------------------------------------------------------------------------
  = Overview =
  
  == Goals ==
-  * Read data residing in relational databases 
+  * Read data residing in relational databases
   * Build Solr documents by aggregating data from multiple columns and tables 
according to configuration
   * Update Solr with such documents
   * Provide ability to do full imports according to configuration
   * Detect inserts/update deltas (changes) and do delta imports (we assume a 
last-modified timestamp column for this to work)
   * Schedule full imports and delta imports
-  * Read and Index data from xml/(http/file) based on configuration 
+  * Read and Index data from xml/(http/file) based on configuration
   * Make it possible to plugin any kind of datasource (ftp,scp etc) and any 
other format of user choice (JSON,csv etc)
  
  = Design Overview =
@@ -26, +26 @@

  {{{
    <requestHandler name="/dataimport" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
-       <str name="config">/home/username/data-config.xml</str>      
+       <str name="config">/home/username/data-config.xml</str>
      </lst>
    </requestHandler>
  }}}
@@ -36, +36 @@

  
   * solrconfig.xml . The data config file location is added here
   * The datasource also can be added here. Or it can be put directly into the 
data-config.xml
-  * data-config.xml 
+  * data-config.xml
     * How to fetch data (queries,url etc)
     * What to read ( resultset columns, xml fields etc)
-    * How to process (modify/add/remove fields)   
+    * How to process (modify/add/remove fields)
  = Usage with RDBMS =
  In order to use this handler, the following steps are required.
   * Define a data-config.xml and specify the location this file in 
solrconfig.xml under DataImportHandler section
-  * Give connection information (if you choose to put the datasource 
information in solrconfig)    
+  * Give connection information (if you choose to put the datasource 
information in solrconfig)
   * Open the DataImportHandler page to verify if everything is in order 
[http://localhost:8983/solr/dataimport]
   * Use full-import command to do a full import from the database and add to 
Solr index
   * Use delta-import command to do a delta import (get new inserts/updates) 
and add to Solr index
@@ -57, +57 @@

  }}}
   * The datasource configuration can be done in solr config xml 
[#solrconfigdatasource also]
   * The attribute 'type' specifies the implementation class. It is optional. 
The default value is `'JdbcDataSource'`
-  * The attribute 'name' can be used if there are [#multipleds multiple 
datasources] used by multiple entities   
+  * The attribute 'name' can be used if there are [#multipleds multiple 
datasources] used by multiple entities
-  * All other attributes in the <dataSource> tag are arbitrary. It is decided 
by the !DataSource implementation. [#jdbcdatasource See here] for attributes 
used by !JdbcDataSource and [#httpds see here] for !HttpDataSource 
+  * All other attributes in the <dataSource> tag are arbitrary. It is decided 
by the !DataSource implementation. [#jdbcdatasource See here] for attributes 
used by !JdbcDataSource and [#httpds see here] for !HttpDataSource
-  * [#datasource See here] for plugging in your own 
+  * [#datasource See here] for plugging in your own
  [[Anchor(multipleds)]]
  === Multiple DataSources ===
- It is possible to have more than one datasources for a configuration. To 
configure an extra datasource , just keep an another 'dataSource'  tag . There 
is an implicit attribute "name" for a datasource. If there are more than one, 
each extra datasource must be identified by a unique name  
`'name="datasource-2"'` . 
+ It is possible to have more than one datasources for a configuration. To 
configure an extra datasource , just keep an another 'dataSource'  tag . There 
is an implicit attribute "name" for a datasource. If there are more than one, 
each extra datasource must be identified by a unique name  
`'name="datasource-2"'` .
  
  eg:
  {{{
@@ -98, +98 @@

  In order to get data from the database, our design philosophy revolves around 
'templatized sql' entered by the user for each entity. This gives the user the 
entire power of SQL if he needs it. The root entity is the central table whose 
columns can be used to join this table with other child entities.
  
  === Schema for the data config ===
-   The dataconfig does not have a rigid schema. The attributes in the 
entity/field are arbitrary and depends on the `processor` and `transformer`. 
+   The dataconfig does not have a rigid schema. The attributes in the 
entity/field are arbitrary and depends on the `processor` and `transformer`.
  The default attributes for an entity are:
   * '''`name`''' (required) : A unique name used to identify an entity
   * '''`processor`''' : Required only if the datasource is not RDBMS . (The 
default value is `SqlEntityProcessor`)
   * '''`transformer`'''  : Transformers to be applied on this entity. (See the 
transformer section)
-  * '''`dataSource`''' : The name of a datasource as put in the the datasource 
.(Used if there are multiple datasources) 
+  * '''`dataSource`''' : The name of a datasource as put in the the datasource 
.(Used if there are multiple datasources)
   * '''`pk`''' : The primary key for the entity. It is '''optional''' and only 
needed when using delta-imports. It has no relation to the uniqueKey defined in 
schema.xml but they both can be the same.
   * '''`rootEntity`''' : By default the entities falling under the document 
are root entities. If it is set to false , the entity directly falling under 
that entity will be treated as the root entity (so on and so forth). For every 
row returned by the root entity a document is created in Solr
   * '''`onError`''' : (abort|skip|continue) . The default value is 'abort' . 
'skip' skips the current document. 'continue' continues as if the error did not 
happen . <!> ["Solr1.4"]
-  * '''`preImportDeleteQuery`''' : before full-import this will be used to 
cleanup the index instead of using '*:*' .This is honored only on an entity 
that is an immediete sub-child of <document> <!> ["Solr1.4"].
+  * '''`preImportDeleteQuery`''' : before full-import this will be used to 
cleanup the index instead of using '*:*' .This is honored only on an entity 
that is an immediate sub-child of <document> <!> ["Solr1.4"].
-  * '''`postImportDeleteQuery`''' : after full-import this will be used to 
cleanup the index <!>. This is honored only on an entity that is an immediete 
sub-child of <document> ["Solr1.4"].
+  * '''`postImportDeleteQuery`''' : after full-import this will be used to 
cleanup the index <!>. This is honored only on an entity that is an immediate 
sub-child of <document> ["Solr1.4"].
  For !SqlEntityProcessor the entity attributes are :
  
   * '''`query`''' (required) : The sql string using which to query the db
@@ -123, +123 @@

  The handler exposes all its API as http requests . The following are the 
possible operations
  
   * '''full-import''' : Full Import operation can be started by hitting the 
URL `http://<host>:<port>/solr/dataimport?command=full-import`
-   * This operation will be started in a new thread and the ''status'' 
attribute in the response should be shown ''busy'' now. 
+   * This operation will be started in a new thread and the ''status'' 
attribute in the response should be shown ''busy'' now.
    * The operation may take some time depending on size of dataset.
    * When full-import command is executed, it stores the start time of the 
operation in a file located at ''conf/dataimport.properties''
    * This stored timestamp is used when a delta-import operation is executed.
    * Queries to Solr are not blocked during full-imports.
    * It takes in extra parameters
-    * '''entity''' : Name of an entity directly under the <document> tag. Use 
this to execute one or more entities selectively. Multple 'entity' parameters 
can be passed on to run multiple entities at once. If nothing is passed , all 
entites are executed
+    * '''entity''' : Name of an entity directly under the <document> tag. Use 
this to execute one or more entities selectively. Multiple 'entity' parameters 
can be passed on to run multiple entities at once. If nothing is passed , all 
entities are executed
     * '''clean''' : (default 'true'). Tells whether to clean up the index 
before the indexing is started
     * '''commit''': (default 'true'). Tells whether to commit after the 
operation
     * '''optimize''': (default 'true'). Tells whether to optimize after the 
operation
@@ -137, +137 @@

      * Please note that in debug mode, documents are never committed 
automatically. If you want to run debug mode and commit the results too, add 
'commit=true' as a request parameter.
   * '''delta-import''' :  For incremental imports and change detection run the 
command `http://<host>:<port>/solr/dataimport?command=delta-import . It 
supports the same clean, commit, optimize and debug parameters as full-import 
command.
   * '''status''' : To know the status of the current command , hit the URL 
`http://<host>:<port>/solr/dataimport` .It gives an elaborate statistics on 
no:of docs created, deleted, queries run, rows fetched , status etc
-  * '''reload-config''' : If the data-config is changed and you wish to reload 
the file without restarting Solr. run the command 
`http://<host>:<port>/solr/dataimport?command=reload-config` 
+  * '''reload-config''' : If the data-config is changed and you wish to reload 
the file without restarting Solr. run the command 
`http://<host>:<port>/solr/dataimport?command=reload-config`
   * '''abort''' : Abort an ongoing operation by hitting the url 
`http://<host>:<port>/solr/dataimport?command=abort`
  
  == Full Import Example ==
@@ -189, +189 @@

  {{{
     <entity name="feature" query="select description from feature where 
item_id='${item.id}'">
         <field name="feature" column="description" />
-    </entity> 
+    </entity>
  }}}
  The ''item_id'' foreign key in feature table is joined together with ''id'' 
primary key in ''item'' to retrieve rows for each row in ''item''. In a similar 
fashion, we join ''item'' and 'category' (which is a many-to-many 
relationship). Notice how we join these two tables using the intermediate table 
''item_category'' again using templated SQL.
  
@@ -199, +199 @@

                      <field column="description" name="cat" />
                  </entity>
              </entity>
- }}} 
+ }}}
  [[Anchor(shortconfig)]]
  === A shorter data-config ===
  In the above example, there are mappings of fields to Solr fields. It is 
possible to totally avoid the field entries in entities if the names of the 
fields are same (case does not matter) as those in Solr schema. You may need to 
add a field entry if any of the built-in Tranformers are used (see Transformer 
section)
@@ -209, +209 @@

  <dataConfig>
      <dataSource driver="org.hsqldb.jdbcDriver" 
url="jdbc:hsqldb:/temp/example/ex" user="sa" />
      <document>
-         <entity name="item" query="select * from item">                    
+         <entity name="item" query="select * from item">
-             <entity name="feature" query="select description as features from 
feature where item_id='${item.ID}'"/>            
+             <entity name="feature" query="select description as features from 
feature where item_id='${item.ID}'"/>
              <entity name="item_category" query="select CATEGORY_ID from 
item_category where item_id='${item.ID}'">
-                 <entity name="category" query="select description as cat from 
category where id = '${item_category.CATEGORY_ID}'"/>                        
+                 <entity name="category" query="select description as cat from 
category where id = '${item_category.CATEGORY_ID}'"/>
              </entity>
          </entity>
      </document>
@@ -234, +234 @@

      <dataSource driver="org.hsqldb.jdbcDriver" 
url="jdbc:hsqldb:/temp/example/ex" user="sa" />
      <document name="products">
            <entity name="item" pk="ID" query="select * from item"
-               deltaQuery="select id from item where last_modified > 
'${dataimporter.last_index_time}'">           
+               deltaQuery="select id from item where last_modified > 
'${dataimporter.last_index_time}'">
  
-             <entity name="feature" pk="ITEM_ID" 
+             <entity name="feature" pk="ITEM_ID"
-                     query="select description as features from feature where 
item_id='${item.ID}'">                
+                     query="select description as features from feature where 
item_id='${item.ID}'">
              </entity>
              <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
                      query="select CATEGORY_ID from item_category where 
ITEM_ID='${item.ID}'">
                  <entity name="category" pk="ID"
-                        query="select description as cat from category where 
id = '${item_category.CATEGORY_ID}'">                    
+                        query="select description as cat from category where 
id = '${item_category.CATEGORY_ID}'">
                  </entity>
              </entity>
          </entity>
@@ -253, +253 @@

  Pay attention to the ''deltaQuery'' attribute which has an SQL statement 
capable of detecting changes in the ''item'' table. Note the variable 
{{{${dataimporter.last_index_time}}}}
  The DataImportHandler exposes a variable called ''last_index_time'' which is 
a timestamp value denoting the last time ''full-import'' ''''or'''' 
''delta-import'' was run. You can use this variable anywhere in the SQL you 
write in data-config.xml and it will be replaced by the value during processing.
  
- /!\ Note 
+ /!\ Note
   * The deltaQuery in the above example only detects changes in ''item'' but 
not in other tables. You can detect the changes to all child tables in one SQL 
query as specified below. Figuring out it's details is an exercise for the user 
:)
  {{{
        deltaQuery="select id from item where id in
                                (select item_id as id from feature where 
last_modified > '${dataimporter.last_index_time}')
-                               or id in 
+                               or id in
-                               (select item_id as id from item_category where 
item_id in 
+                               (select item_id as id from item_category where 
item_id in
                                    (select id as item_id from category where 
last_modified > '${dataimporter.last_index_time}')
                                or last_modified > 
'${dataimporter.last_index_time}')
                                or last_modified > 
'${dataimporter.last_index_time}'"
@@ -271, +271 @@

      <document>
            <entity name="item" pk="ID" query="select * from item"
                deltaQuery="select id from item where last_modified > 
'${dataimporter.last_index_time}'">
-                 <entity name="feature" pk="ITEM_ID" 
+                 <entity name="feature" pk="ITEM_ID"
                    query="select DESCRIPTION as features from FEATURE where 
ITEM_ID='${item.ID}'"
                    deltaQuery="select ITEM_ID from FEATURE where last_modified 
> '${dataimporter.last_index_time}'"
                    parentDeltaQuery="select ID from item where 
ID=${feature.ITEM_ID}"/>
-                 
-           
+ 
+ 
            <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
                    query="select CATEGORY_ID from item_category where 
ITEM_ID='${item.ID}'"
                    deltaQuery="select ITEM_ID, CATEGORY_ID from item_category 
where last_modified > '${dataimporter.last_index_time}'"
@@ -308, +308 @@

  
  A sample configuration in for !HttpdataSource in data config xml looks like 
this
  {{{
- <dataSource type="HttpDataSource" baseUrl="http://host:port/"; 
encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>  
+ <dataSource type="HttpDataSource" baseUrl="http://host:port/"; 
encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
  }}}
  ''' The attributes are '''
  
   * '''`baseUrl`''' (optional): you should use it when the host/port changes 
between Dev/QA/Prod environments. Using this attribute isolates the changes to 
be made to the solrconfig.xml
   * '''`encoding`'''(optional): By default the encoding in the response header 
is used. You can use this property to override the default encoding.
-  * '''`connectionTimeout`''' (optional):The default value is 5000ms 
+  * '''`connectionTimeout`''' (optional):The default value is 5000ms
   * '''`readTimeout`''' (optional): the default value is 10000ms
-  
+ 
  == Configuration in data-config.xml ==
  
  The entity for an xml/http data source can have the following attributes over 
and above the default attributes
   * '''`processor`''' (required) : The value must be `"XPathEntityProcessor"`
   * '''`url`''' (required) : The url used to invoke the REST API. (Can be 
templatized). if the data souce is file this must be the file location
   * '''`stream`''' (optional) : set this to true , if the xml is really big
-  * '''`forEach`'''(required) : The xpath expression which demarcates a 
record. If there are mutiple types of record separate them with ''" |  "'' 
(pipe) . If  useSolrAddSchema is set to 'true' this can be omitted.
+  * '''`forEach`'''(required) : The xpath expression which demarcates a 
record. If there are multiple types of record separate them with ''" |  "'' 
(pipe) . If  useSolrAddSchema is set to 'true' this can be omitted.
   * '''`xsl`'''(optional): This will be used as a preprocessor for applying 
the XSL transformation. Provide the full path in the filesystem or a url.
   * '''`useSolrAddSchema`'''(optional): Set it's value to 'true' if the xml 
that is fed into this processor has the same schema as that of the solr add 
xml. No need to mention any fields if it is set to true.
   * '''`flatten`''' (optional) : If this is set to true, text from under all 
the tags are extracted into one field , irrespective of the tag name. <!> 
["Solr1.4"]
@@ -334, +334 @@

  
   * '''`commonField`''' : can be (true| false) . If true, this field once 
encountered in a record will be copied to other records before creating a Solr 
document
  
- If an API supports chunking (when the dataset is too large) multiple calls 
need to be made to complete the process. 
+ If an API supports chunking (when the dataset is too large) multiple calls 
need to be made to complete the process.
  X!PathEntityprocessor supports this with a transformer. If transformer 
returns a row which contains a field '''`$hasMore`''' with a the value `"true"` 
the Processor makes another request with the same url template (The actual 
value is recomputed before invoking ). A transformer can pass a totally new url 
too for the next call by returning a row which contains a field 
'''`$nextUrl`''' whose value must be the complete url for the next call.
  
  The X!PathEntityProcessor implements a streaming parser which supports a 
subset of xpath syntax. Complete xpath syntax is not supported but most of the 
common use cases are covered as follows:-
@@ -361, +361 @@

                                processor="XPathEntityProcessor"
                                forEach="/RDF/channel | /RDF/item"
                                transformer="DateFormatTransformer">
-                               
+ 
                        <field column="source" xpath="/RDF/channel/title" 
commonField="true" />
                        <field column="source-link" xpath="/RDF/channel/link" 
commonField="true" />
                        <field column="subject" xpath="/RDF/channel/subject" 
commonField="true" />
-                       
+ 
                        <field column="title" xpath="/RDF/item/title" />
                        <field column="link" xpath="/RDF/item/link" />
                        <field column="description" 
xpath="/RDF/item/description" />
@@ -380, +380 @@

  </dataConfig>
  }}}
  
- This data-config is where the action is. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the Solr fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items. So, what we wish to do is , create a 
document in Solr for each 'item'. 
+ This data-config is where the action is. If you read the structure of the 
Slashdot RSS, it has a few header elements such as title, link and subject. 
Those are mapped to the Solr fields source, source-link and subject 
respectively using xpath syntax. The feed also has multiple ''item'' elements 
which contain the actual news items. So, what we wish to do is , create a 
document in Solr for each 'item'.
  
  The X!PathEntityprocessor is designed to stream the xml, row by row (Think of 
a row as various fields in a xml element ). It uses the ''forEach'' attribute 
to identify a 'row'. In this example forEach has the value `'/RDF/channel | 
/RDF/item'` . This says that this xml has two types of rows (This uses the 
xpath syntax for OR and there can be more than one type of rows) . After it 
encounters a row , it tries to read as many fields are there in the field 
declarations. So in this case, when it reads the row `'/RDF/channel'` it may 
get 3 fields 'source', 'source-link' , 'source-subject' . After it processes 
the row it realizes that it does not have any value for the 'pk' field so it 
does not try to create a Solr document for this row (Even if it tries it may 
fail in solr). But all these 3 fields are marked as `commonField="true"` . So 
it keeps the values handy for subsequent rows.
  
@@ -410, +410 @@

                  <field column="timestamp" 
xpath="/mediawiki/page/revision/timestamp" />
          </entity>
          </document>
- </dataConfig> 
+ </dataConfig>
  }}}
  The relevant portion of schema.xml is below:
  {{{
@@ -438, +438 @@

  
  [[Anchor(transformer)]]
  == Transformer ==
- Every row that is fetched from the DB can be either consumed directly or it 
can be massaged to create a totally new set of fields or it can even return 
more than one row of data. The configuration must be done on an entity level as 
follows.
+ Every set of fields fetched by the entity can be either consumed directly by 
the indexing process or they can be massaged using transformers to create a 
totally new set of fields or it can even return more than one row of data. The 
transformers must be configured on an entity level as follows.
  {{{
  <entity name="foo" transformer="com.foo.Foo" ... />
  }}}
- /!\ Note -- The trasformer value has to be fully qualified classname .If the 
class package is `'org.apache.solr.handler.dataimport'` the package name can be 
omitted. The solr.<classname> also works if the class belongs to one of the 
'solr' packages . This rule applies for all the pluggable classes like 
!DataSource , !EntityProcessor and Evaluator.
+ /!\ Note -- The transformer value has to be fully qualified classname .If the 
class package is `'org.apache.solr.handler.dataimport'` the package name can be 
omitted. The solr.<classname> also works if the class belongs to one of the 
'solr' packages . This rule applies for all the pluggable classes like 
!DataSource , !EntityProcessor and Evaluator.
  
  the class 'Foo' must extend the abstract class 
`org.apache.solr.hander.dataimport.Transformer` The class has only one abstract 
method.
  
- The transformer attribute can take in multiple transformers (`say 
transformer="foo.X,foo.Y"`) separated by comma. The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified.
+ The transformer attribute can consist of a comma separated list of 
transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
+ 
+ A transformer can be used to alter the value of a field fetched from the 
datasource or to populate an undefined field. If the action of the transformer 
fails, say a regex fails to match, then an
+ existing field will be unaltered and an undefined field will remain 
undefined. The chaining effect described above allows a column's value to be 
altered again and again by successive transformers. A transformer may make use 
of other entity fields in the course of massaging a columns value.
  
  {{{
  public abstract class Transformer {
@@ -464, +467 @@

  }}}
  
  
- The Context is the abstract class that provides the contextual information 
that may be necessary to process the data. 
+ The Context is the abstract class that provides the contextual information 
that may be necessary to process the data.
  
  Alternately the class `Foo` may choose NOT TO implement this abstract class 
and just write a method with this signature
  {{{
@@ -472, +475 @@

  }}}
  
  So there is no compile-time dependency on the !DataImportHandler API
-  
  
+ 
- The configuration has a 'flexible' schema. It lets the user provide arbitrary 
attributes in an 'entity' tag  and 'field' tags. The tool reads the data and 
hands it over to the implementation class as it is. If the 'Transformer' needs 
extra information to be provided on a per entity/field basis it can get them 
from the context. 
+ The configuration has a 'flexible' schema. It lets the user provide arbitrary 
attributes in an 'entity' tag  and 'field' tags. The tool reads the data and 
hands it over to the implementation class as it is. If the 'Transformer' needs 
extra information to be provided on a per entity/field basis it can get them 
from the context.
  
  === RegexTransformer ===
  
@@ -483, +486 @@

  
  example:
  {{{
- <entity name="foo" transformer="RegexTransformer"  
+ <entity name="foo" transformer="RegexTransformer"
  query="select full_name , emailids from foo"/>
  ... />
     <field column="full_name"/>
@@ -496, +499 @@

  ==== Attributes ====
  !RegexTransfromer applies only on the fields with an attribute 'regex' or 
'splitBy'. All other fields are left as it is.
   * '''`regex`''' : The regular expression that is used to match . This or 
`splitBy` must be present for each field. If not, that field is not touched by 
the transformer . If `replaceWith` is absent, each ''group'' is taken as a 
value and a list of values is returned
-  * '''`sourceColName`''' : The column on which the regex is to be applied. If 
this is absent source and target are same 
+  * '''`sourceColName`''' : The column on which the regex is to be applied. If 
this is absent source and target are same
   * '''`splitBy`''' : If the `regex` is used to split a String to obtain 
multipple values use this
   * '''`replaceWith`''' : Used alongwith `regex` . It is equivalent to the 
method `new String(<sourceColVal>).replaceAll(<regex>, <replaceWith>)`
  Here the attributes 'regex' and 'sourceColName' are custom attributes used by 
the transformer. It reads the field 'full_name' from the resultset and 
transform it to two target fields 'firstName' and 'lastName' . So even though 
the query returned only one column 'full_name' in the resultset the solr 
document gets two extra fields 'firstName' and 'lastName' wich are 'derived' 
fields.
@@ -566, +569 @@

  ==== Attributes ====
  !DateFormatTransformer applies only on the fields with an attribute 
'dateTimeFormat' . All other fields are left as it is.
   * '''`dateTimeFormat`''' : The format used for parsing this field. This must 
comply with the syntax of java 
[http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html 
SimpleDateFormat].
-  * '''`sourceColName`''' : The column on which the dateFormat is to be 
applied. If this is absent source and target are same 
+  * '''`sourceColName`''' : The column on which the dateFormat is to be 
applied. If this is absent source and target are same
  
- The above field definition is used in the RSS example to parse the publish 
date of the RSS feed item. 
+ The above field definition is used in the RSS example to parse the publish 
date of the RSS feed item.
  
  === NumberFormatTransformer ===
  Can be used to parse a number from a String. Uses the !NumberFormat class in 
java
@@ -583, +586 @@

  }}}
  
  ==== Attributes ====
- !NumberFormatTransformer applies only on the fields with an attribute 
'formatStyle' . 
+ !NumberFormatTransformer applies only on the fields with an attribute 
'formatStyle' .
   * '''`formatStyle`''' : The format used for parsing this field The value of 
the attribute must be one of (number|percent|integer|currency). This uses the 
semantics of java 
[http://java.sun.com/j2se/1.4.2/docs/api/java/text/NumberFormat.html 
NumberFormat].
   * '''`sourceColName`''' : The column on which the !NumberFormat is to be 
applied. If this is absent, source and target are same.
   * '''`locale`''' : The locale to be used for parsing the strings. If this is 
absent, the system's default locale is used. It must be specified as 
language-country. For example en-US.
@@ -661, +664 @@

      <document>
          <entity name="f" processor="FileListEntityProcessor" fileName=".*xml" 
newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
              <entity name="x" processor="XPathEntityProcessor" 
forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
-                 <field column="full_name" xpath="/field/xpath"/> 
+                 <field column="full_name" xpath="/field/xpath"/>
              </entity>
          </entity>
      <document>
@@ -672, +675 @@

  === CachedSqlEntityProcessor ===
  [[Anchor(cached)]]
  
- This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps 
reduce the no: of DB queries executed by caching the rows. It does not help to 
use it in the root most entity because only one sql is run for the entity. 
+ This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps 
reduce the no: of DB queries executed by caching the rows. It does not help to 
use it in the root most entity because only one sql is run for the entity.
-  
+ 
  Example 1.
  {{{
  <entity name="x" query="select * from x">
@@ -695, +698 @@

  The difference with the previous one is the 'where' attribute. In this case 
the query fetches all the rows from the table and stores all the rows in the 
cache. The magic is in the 'where' value. The cache stores the values with the 
'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the 
entity has to be run and the value is looked up in the cache an the rows are 
returned.
  
  In the where the lhs (the part before '=') is the column in y and the rhs 
(the part after '=') is the value to be computed for looking up the cache.
-  
+ 
  === PlainTextEntityProcessor ===
  [[Anchor(plaintext)]]
  <!> ["Solr1.4"]
@@ -713, +716 @@

  
  == DataSource ==
  [[Anchor(datasource)]]
- A class can extend `org.apache.solr.handler.dataimport.DataSource` . 
[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup
 See source] 
+ A class can extend `org.apache.solr.handler.dataimport.DataSource` . 
[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup
 See source]
  
  and can be used as a !DataSource. It must be configured in the dataSource 
definition
  {{{
@@ -724, +727 @@

  === JdbcdataSource ===
  This is the default. See the  [#jdbcdatasource example] . The signature is as 
follows
  {{{
- public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>> 
+ public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
  }}}
  
  It is designed to iterate rows in DB one by one. A row is represented as a 
Map.
@@ -736, +739 @@

  === FileDataSource ===
  This can be used like an !HttpDataSource . The signature is as follows
  {{{
- public class FileDataSource extends DataSource<Reader>  
+ public class FileDataSource extends DataSource<Reader>
  }}}
  
  The attributes are:
@@ -748, +751 @@

  
  This can be used like an !HttpDataSource . The signature is as follows
  {{{
- public class FieldReaderDataSource extends DataSource<Reader>  
+ public class FieldReaderDataSource extends DataSource<Reader>
  }}}
  This can be useful for users who has a DB field containing xml and wish to 
use a nested X!PathEntityProcessor
  The datasouce may be configured as follows
@@ -765, +768 @@

  == Boosting , Skipping documents ==
  It is possible to decide in the runtime to skip or boost a particular 
document.
  
- Write a custom Transformer to add a value '''$skipDoc''' with a value 'true' 
to skip that document. To boost a document with a given value add 
'''$docBoost''' with the boost value 
+ Write a custom Transformer to add a value '''$skipDoc''' with a value 'true' 
to skip that document. To boost a document with a given value add 
'''$docBoost''' with the boost value
  
  == Adding datasource in solrconfig.xml ==
  [[Anchor(solrconfigdatasource)]]
@@ -793, +796 @@

  The use case is as follows:
  There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
  
-  * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are 
configured in the solrconfig.xml. 
+  * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are 
configured in the solrconfig.xml.
-  * `http` is an instance of type `HttpDataSource` 
+  * `http` is an instance of type `HttpDataSource`
   * The root entity starts with a table called 'A' and uses 'jdbc1' as the 
datasource . The entity is conveniently named as the table itself
   * Entity 'A' has 2 sub-entities 'B' and 'C' . 'B' uses the datasource 
instance  'http' and 'C' uses the datasource instance 'jdbc2'
-  * On doing a `command=full-import` The root-entity (A) is executed first 
+  * On doing a `command=full-import` The root-entity (A) is executed first
   * Each row that emitted by the 'query' in entity 'A' is fed into its sub 
entities B, C
   * The queries in B and C use a column in 'A' to construct their queries 
using placeholders like `${A.a}`
-    * B has a url  (B is an xml/http datasource) 
+    * B has a url  (B is an xml/http datasource)
     * C has a query
-  * C has two transformers ('f' and 'g' )   
+  * C has two transformers ('f' and 'g' )
   * Each row that comes out of C is fed into 'f' and 'g' sequentially 
(transformers are chained) . Each transformer can change the input. Note that 
the transformer 'g' produces 2 output rows for an input row `f(C.1))
   * The end output of each entity is combined together to construct a document
     * Note that the intermediate rows from C i.e `C.1, C.2, f(C.1) , f(C1)` 
are ignored
@@ -817, +820 @@

  == A VariableResolver ==
  A !VariableResolver is the component which replaces all those placeholders 
such as `${<name>}`. It is a multilevel Map.  Each namespace is a Map and 
namespaces are separated by periods (.) . eg if there is a placeholder 
${item.ID} , 'item' is a nampespace (which is a map) and 'ID' is a value in 
that namespace. It is possible to nest namespaces like ${item.x.ID} where x 
could be another Map. A reference to the current !VariableResolver can be 
obtained from the Context. Or the object can be directly consumed by using 
${<name>} in 'query' for RDMS queries or 'url' in Http .
  === Custom formatting in query and url using Functions ===
- While the namespace concept is useful , the user may want to put some 
computed value into the query or url for example there is a Date object and 
your datasource  accepts Date in some custom format . There are a few functions 
provided by the !DataImportHandler which can do some of these. 
+ While the namespace concept is useful , the user may want to put some 
computed value into the query or url for example there is a Date object and 
your datasource  accepts Date in some custom format . There are a few functions 
provided by the !DataImportHandler which can do some of these.
   * ''formatDate'' : It is used like this 
`'${dataimporter.functions.formatDate(item.ID, 'yyyy-MM-dd HH:mm')}'` . The 
first argument can be a valid value from the !VariableResolver and the second 
cvalue can be a a format string (use !SimpledateFormat) . The first argument 
can be a computed value eg: `'${dataimporter.functions.formatDate('NOW-3DAYS', 
'yyyy-MM-dd HH:mm')}'` and it uses the syntax of the datemath parser in Solr. 
(note that it must enclosed in single quotes) . <!> Note . This syntax has been 
changed in 1.4 . The second parameter was not enclosed in single quotes 
earlier. But it will continue to work without single quote also.
   * ''escapeSql'' : Use this to escape special sql characters . eg : 
`'${dataimporter.functions.escapeSql(item.ID)}'`. Takes only one argument and 
must be a valid value in the !VaraiableResolver.
   * ''encodeUrl'' : Us this to encode urls . eg : 
`'${dataimporter.functions.encodeUrl(item.ID)}'` . Takes only one argument and 
must be a valid value in the !VariableResolver
@@ -829, +832 @@

  This is a new cool and powerful feature in the tool. It helps you build a 
dataconfig.xml with the UI. It can be accessed from 
http://host:port/solr/admin/dataimport.jsp . The features are
   * A UI with two panels . RHS takes in the input and LHS shows the output
   * When you hit the button 'debug now' it runs the configuration and shows 
the documents created
-  * You can configure the start and rows parameters to debug documents say 115 
to 118 . 
+  * You can configure the start and rows parameters to debug documents say 115 
to 118 .
-  * Choose the 'verbose' option to get detailed information about the 
intermediete steps. What was emitted by the query and what went into the 
Transformer and what was the output. 
+  * Choose the 'verbose' option to get detailed information about the 
intermediete steps. What was emitted by the query and what went into the 
Transformer and what was the output.
   * If an exception occurred during the run, the stacktrace is shown right 
there
   * The fields produced by the Entities, Transformers may not be visible in 
documents if the fields are either not present in the schema.xml of there is an 
explicit <field> declaration

[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to