[Solr Wiki] Update of "DataImportHandler" by PeterTyrrell

Apache Wiki Fri, 14 Dec 2012 09:39:05 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "DataImportHandler" page has been changed by PeterTyrrell:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=330&rev2=331

  <<Anchor(commands)>> The handler exposes all its API as http requests . The 
following are the possible operations
  
   * '''full-import''' : Full Import operation can be started by hitting the 
URL `http://<host>:<port>/solr/dataimport?command=full-import`
- 
    * This operation will be started in a new thread and the ''status'' 
attribute in the response should be shown ''busy'' now.
    * The operation may take some time depending on size of dataset.
-   * When full-import command is executed, it stores the start time of the 
operation in a file located at ''conf/dataimport.properties''  ([[#Configuring 
The Property Writer|this file is configurable]])
+   * When full-import command is executed, it stores the start time of the 
operation in a file located at ''conf/dataimport.properties''  
([[#Configuring_The_Property_Writer|this file is configurable]])
    * This stored timestamp is used when a delta-import operation is executed.
    * Queries to Solr are not blocked during full-imports.
    * It takes in extra parameters:
     * '''entity''' : Name of an entity directly under the <document> tag. Use 
this to execute one or more entities selectively. Multiple 'entity' parameters 
can be passed on to run multiple entities at once. If nothing is passed, all 
entities are executed.
     * '''clean''' : (default 'true'). Tells whether to clean up the index 
before the indexing is started.
     * '''commit''' : (default 'true'). Tells whether to commit after the 
operation.
-    * '''optimize''' : (default 'true' up to Solr 3.6, 'false' afterwards). 
Tells whether to optimize after the operation. Please note: this can be a very 
expensive operation and usually does not make sense for delta-imports. 
+    * '''optimize''' : (default 'true' up to Solr 3.6, 'false' afterwards). 
Tells whether to optimize after the operation. Please note: this can be a very 
expensive operation and usually does not make sense for delta-imports.
     * '''debug''' : (default 'false'). Runs in debug mode. It is used by the 
interactive development mode ([[#interactive|see here]]).
- 
      * Please note that in debug mode, documents are never committed 
automatically. If you want to run debug mode and commit the results too, add 
'commit=true' as a request parameter.
   * '''delta-import''' : For incremental imports and change detection run the 
command `http://<host>:<port>/solr/dataimport?command=delta-import` . It 
supports the same clean, commit, optimize and debug parameters as full-import 
command.
   * '''status''' : To know the status of the current command, hit the URL 
`http://<host>:<port>/solr/dataimport` . It gives an elaborate statistics on 
no. of docs created, deleted, queries run, rows fetched, status etc.
@@ -339, +337 @@

   . {{{
   deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataView}"
  }}}
- 
    . Changed to:
- 
   {{{
   deltaQuery="SELECT MAX(did) AS did FROM ${dataimporter.request.dataView}"
  }}}
  
  = Configuring The Property Writer =
- <!> [[Solr4.1]] 
- Add the tag 'propertyWriter' directly under the 'dataConfig' tag.  The 
property "last_index_time" is converted to text and stored in the properties 
file and is available for the next import as the variable 
'${dih.last_index_time}' . This tag gives control over how this properties file 
is written.
+ <!> [[Solr4.1]]  Add the tag 'propertyWriter' directly under the 'dataConfig' 
tag.  The property "last_index_time" is converted to text and stored in the 
properties file and is available for the next import as the variable 
'${dih.last_index_time}' . This tag gives control over how this properties file 
is written.
  
  {{{
  <propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" 
type="SimplePropertiesWriter" directory="data" filename="my_dih.properties" 
locale="en_US" />
  }}}
-  * This tag is optional, resulting in the default locale,directory and 
filename.  The 'type' will default to SimplePropertiesWriter for non-SolrCloud 
installations.  For SolrCloud, ZKPropertiesWriter is default. 
+  * This tag is optional, resulting in the default locale,directory and 
filename.  The 'type' will default to SimplePropertiesWriter for non-SolrCloud 
installations.  For SolrCloud, ZKPropertiesWriter is default.
   * 'type' - the implementation class.  This is required unless 
<propertyWriter /> is omitted entirely.
   * 'filename' - (SimplePropertiesWriter) The default is the name of the 
request handler followed by ".properties", for instance, dataimport.properties
   * 'directory' -(SimplePropertiesWriter) The default is "conf".
@@ -532, +527 @@

  A transformer can be used to alter the value of a field fetched from the 
datasource or to populate an undefined field. If the action of the transformer 
fails, say a regex fails to match, then an existing field will be unaltered and 
an undefined field will remain undefined. The chaining effect described above 
allows a column's value to be altered again and again by successive 
transformers. A transformer may make use of other entity fields in the course 
of massaging a columns value.
  
  === RegexTransformer ===
- There is an built-in transformer called '!RegexTransfromer' provided with 
DIH. It helps in extracting or manipulating values from fields (from the 
source) using Regular Expressions. The actual class name is 
`org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the 
default package the package-name can be omitted.
+ There is an built-in transformer called '!RegexTransformer' provided with 
DIH. It helps in extracting or manipulating values from fields (from the 
source) using Regular Expressions. The actual class name is 
`org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the 
default package the package-name can be omitted.
  
  '''Attributes'''
  
- !RegexTransfromer is only activated for fields with an attribute of 'regex' 
or 'splitBy'. Other fields are ignored.
+ !RegexTransformer is only activated for fields with an attribute of 'regex' 
or 'splitBy'. Other fields are ignored.
  
   * '''`regex`''' : The regular expression that is used to match against the 
column or sourceColName's value(s). If `replaceWith` is absent, each regex 
''group'' is taken as a value and a list of values is returned
   * '''`sourceColName`''' : The column on which the regex is to be applied. If 
this is absent source and target are same
@@ -562, +557 @@

  In this example the attributes 'regex' and 'sourceColName' are custom 
attributes used by the transformer. It reads the field 'full_name' from the 
resultset and transforms it to two new target fields 'firstName' and 
'lastName'. So even though the query returned only one column 'full_name' in 
the resultset the solr document gets two extra fields 'firstName' and 
'lastName' which are 'derived' fields. These new fields are only created if the 
regexp matches.
  
  The 'emailids' field in the table can be a comma separated value. So it ends 
up giving out one or more than one email ids and we expect the 'mailId' to be a 
multivalued field in Solr.
+ 
+ The regular expression matching is case-sensitive by default. Use the (?i) 
and/or (?u) embedded flags (u enables Unicode case-folding, i is US-ASCII only) 
to indicate that all or a portion of the expression should be case-insensitive. 
Other flags and behaviours can be set according to Java's regex flavour, cf. 
`java.util.regex`.
+ 
+ {{{
+ <!-- matches Apples and apples -->
+ <field column="just_apples" regex="(?iu)(apples)" />
+ }}}
  
  <!> Note that this transformer can either be used to split a string into 
tokens based on a '''`splitBy`''' pattern, or to perform a string substitution 
as per '''`replaceWith`''', or it can assign groups within a pattern to a list 
of '''`groupNames`'''. It decides what it is to do based upon the above 
attributes '''`splitBy`''', '''`replaceWith`''' and  '''`groupNames`''' which 
are looked for in order. This first one found is acted upon and other unrelated 
attributes are ignored.
  
@@ -640, +642 @@

  {{{
  <field column="price" formatStyle="number" />
  }}}
+ By default, !NumberFormat uses the system's default locale to parse the given 
string.  Optionally, specify the Locale to use as shown (see java.util.Locale 
javadoc for more information):
- By default, !NumberFormat uses the system's default locale to parse the given 
string. 
- Optionally, specify the Locale to use as shown (see java.util.Locale javadoc 
for more information):
  
  {{{
  <field column="price" formatStyle="number" locale="de-DE" />
  }}}
- 
  '''Attributes'''
  
  !NumberFormatTransformer applies only on the fields with an attribute 
'formatStyle' .
  
   * '''`formatStyle`''' : The format used for parsing this field The value of 
the attribute must be one of (number|percent|integer|currency). This uses the 
semantics of java 
[[http://java.sun.com/j2se/1.4.2/docs/api/java/text/NumberFormat.html|NumberFormat]].
   * '''`sourceColName`''' : The column on which the !NumberFormat is to be 
applied. If this is absent, source and target are same.
-  * '''`locale`''' : The locale to be used for parsing the strings. If no 
Locale is specified, Solr4.1 and later defaults to the ROOT Locale (Versions 
prior to Solr4.1 use the current machine's default Locale.)  
+  * '''`locale`''' : The locale to be used for parsing the strings. If no 
Locale is specified, Solr4.1 and later defaults to the ROOT Locale (Versions 
prior to Solr4.1 use the current machine's default Locale.)
  
  === TemplateTransformer ===
  Can be used to overwrite or modify any existing Solr field or to create new 
Solr fields. The value assigned to the field is based on a static template 
string, which can contain DIH variables. If a template string contains 
placeholders or variables they must be defined when the transformer is being 
evaluated. An undefined variable causes the entire template instruction to be 
ignored. eg:
@@ -863, +863 @@

  In the where the lhs (the part before '=') is the column in y and the rhs 
(the part after '=') is the value to be computed for looking up the cache.
  
  An alternate syntax to Example 2 above uses the "cacheKey" and "cacheLookup" 
parameters:
+ 
  {{{
  <entity name="x" query="select * from x">
      <entity name="y" query="select * from y" 
processor="CachedSqlEntityProcessor" cacheKey="xid" cacheLookup="x.id">
@@ -1062, +1063 @@

   * On doing a `command=full-import` The root-entity (A) is executed first
   * Each row that emitted by the 'query' in entity 'A' is fed into its sub 
entities B, C
   * The queries in B and C use a column in 'A' to construct their queries 
using placeholders like `${A.a}`
- 
    * B has a url  (B is an xml/http datasource)
    * C has a query
   * C has two transformers ('f' and 'g' )
@@ -1087, +1087 @@

  While the namespace concept is useful , the user may want to put some 
computed value into the query or url for example there is a Date object and 
your datasource accepts Date in some custom format.
  
  === formatDate ===
-  Use this to format dates as strings.  It takes three parameters (prior to 
Solr 4.1, it takes two):
+  . Use this to format dates as strings.  It takes three parameters (prior to 
Solr 4.1, it takes two):
    1. A variable that refers to a date, or a datemath expression.
-   2. A date format string.  See java.text.SimpleDateFormat javadoc for valid 
date formats. (Solr 4.1 and later, this must be enclosed in single quotes.  
Solr 1.4 - 4.0, quotes are optional.  Prior to Solr 1.4, this must not be 
enclosed in single quotes)
+   1. A date format string.  See java.text.SimpleDateFormat javadoc for valid 
date formats. (Solr 4.1 and later, this must be enclosed in single quotes.  
Solr 1.4 - 4.0, quotes are optional.  Prior to Solr 1.4, this must not be 
enclosed in single quotes)
-   3. <!> [[Solr4.1]] (optional)  The locale code to use when formatting 
dates, enclosed in single quotes. See java.util.Locale javadoc for details.  If 
omitted, this defaults to the ROOT Locale. (Note: prior to Solr 4.1, formatDate 
would always use the current machine's default locale.)
+   1. <!> [[Solr4.1]] (optional)  The locale code to use when formatting 
dates, enclosed in single quotes. See java.util.Locale javadoc for details.  If 
omitted, this defaults to the ROOT Locale. (Note: prior to Solr 4.1, formatDate 
would always use the current machine's default locale.)
- 
  
   * example using a variable:  `'${dataimporter.functions.formatDate(item.ID, 
'yyyy-MM-dd HH:mm')}'`
   * example using a datemmath expression:  
`'${dataimporter.functions.formatDate('NOW-3DAYS', 'yyyy-MM-dd HH:mm')}'`
@@ -1116, +1115 @@

    </document>
  </dataConfig>
  }}}
- The implementation of !LowerCaseFunctionEvaluator 
+ The implementation of !LowerCaseFunctionEvaluator
  
  <!> [[Solr4.1]] this example depends on API modifications made in Solr 4.1
+ 
  {{{
    public class LowerCaseFunctionEvaluator extends Evaluator{
      public String evaluate(String expression, Context context) {
@@ -1136, +1136 @@

  <<Anchor(interactive)>>
  
  = Interactive Development Mode =
- 
  /!\ '''NOTE:''' The Interactive 'debug' mode only exists in Solr 3.x.  It has 
not yet been implemented in Solr 4.x (see 
[[https://issues.apache.org/jira/browse/SOLR-4151|SOLR-4151]])
  
  This is a new cool and powerful feature in the tool. It helps you build a 
dataconfig.xml with the UI. It can be accessed from 
http://host:port/solr/admin/dataimport.jsp . The features are
@@ -1266, +1265 @@

   * uses ''HTTPPostScheduler'', 
[[http://download.oracle.com/javase/6/docs/api/java/util/Timer.html|java.util.Timer]]
 and context attribute map to facilitate periodic method invocation (scheduling)
   * Timer is essentially a facility for threads to schedule tasks for future 
execution in a background thread.
   * Don't forget to add the following listener declaration to Solr's 
web.xml:<<BR>>
+ 
  {{{
   <listener>
     
<listener-class>org.apache.solr.handler.dataimport.scheduler.ApplicationListener</listener-class>
   </listener>
  }}}
   * In order to make Scheduler classes available to DIH you need to place 
downloaded jar file to your solr.war's web-inf\lib folder (you can either alter 
the war archive before deploying it or you can place jar file in deployed, 
unpacked {{{lib}}} folder under your web server's (typically) {{{webapps}}} 
folder)
+ 
  {{{
  package org.apache.solr.handler.dataimport.scheduler;
  
@@ -1585, +1586 @@

  
  = Troubleshooting =
   * If you are having trouble indexing international characters, try setting 
the '''encoding''' attribute to "UTF-8" on the dataSource element (example 
below). This should ensure that international character data (stored in UTF8) 
ingested by the given source will be preserved.
- 
    . {{{
     <dataSource type="FileDataSource" encoding="UTF-8"/>
  }}}

[Solr Wiki] Update of "DataImportHandler" by PeterTyrrell

Reply via email to