[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Apache Wiki Sat, 28 Feb 2009 00:27:27 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by FergusMcMenemie:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Improving the documentation on transformers

------------------------------------------------------------------------------
  <dataConfig>
      <dataSource driver="org.hsqldb.jdbcDriver" 
url="jdbc:hsqldb:/temp/example/ex" user="sa" />
      <document name="products">
-           <entity name="item" pk="ID" 
+           <entity name="item" pk="ID"
                  query="select * from item"
                  deltaImportQuery="select * from item where 
ID=='${dataimporter.delta.id}'"
                deltaQuery="select id from item where last_modified > 
'${dataimporter.last_index_time}'">
@@ -445, +445 @@

  
  [[Anchor(transformer)]]
  == Transformer ==
- Every set of fields fetched by the entity can be either consumed directly by 
the indexing process or they can be massaged using transformers to create a 
totally new set of fields or it can even return more than one row of data. The 
transformers must be configured on an entity level as follows.
+ Every set of fields fetched by the entity can be either consumed directly by 
the indexing process or they can be massaged using transformers to modify a 
field or create a totally new set of fields, it can even return more than one 
row of data. The transformers must be configured on an entity level as follows.
  {{{
  <entity name="foo" transformer="com.foo.Foo" ... />
  }}}
@@ -453, +453 @@

  
  the class 'Foo' must extend the abstract class 
`org.apache.solr.hander.dataimport.Transformer` The class has only one abstract 
method.
  
- The transformer attribute can consist of a comma separated list of 
transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
+ The entity transformer attribute can consist of a comma separated list of 
transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
  
  A transformer can be used to alter the value of a field fetched from the 
datasource or to populate an undefined field. If the action of the transformer 
fails, say a regex fails to match, then an
  existing field will be unaltered and an undefined field will remain 
undefined. The chaining effect described above allows a column's value to be 
altered again and again by successive transformers. A transformer may make use 
of other entity fields in the course of massaging a columns value.
  
- {{{
- public abstract class Transformer {
-   /**
-    * The input is a row of data and the output has to be a new row.
-    *
-    * @param context The current context
-    * @param row     A row of data
-    * @return The changed data. It must be a Map<String, Object> if it returns
-    *         only one row or if there are multiple rows to be returned it must
-    *         be a List<Map<String, Object>>
-    */
-   public abstract Object transformRow(Map<String, Object> row, Context 
context);
- }
- }}}
  
- 
- The Context is the abstract class that provides the contextual information 
that may be necessary to process the data.
- 
- Alternately the class `Foo` may choose NOT TO implement this abstract class 
and just write a method with this signature
- {{{
- public Object transformRow(Map<String, Object> row)
- }}}
- 
- So there is no compile-time dependency on the !DataImportHandler API
- 
- 
- The configuration has a 'flexible' schema. It lets the user provide arbitrary 
attributes in an 'entity' tag  and 'field' tags. The tool reads the data and 
hands it over to the implementation class as it is. If the 'Transformer' needs 
extra information to be provided on a per entity/field basis it can get them 
from the context.
  
  === RegexTransformer ===
  
@@ -641, +615 @@

  ==== Attributes ====
   * '''`clob`''' : Boolean value to signal if !ClobTransformer should process 
this field or not.
  
+ [[Anchor(example-transformers)]]
+ === Transformers Example ===
+ The following example shows transformer chaining in action along with 
extensive reuse of variables. An invariant is defined in the solrconfig.xml and 
reused within some transforms. Column names from both entities are also used in 
transforms.
+ 
+ Imaging we have XML documents, each of which describes a set of images. The 
images are stored in an images subdirectory of the XML document. An attribute 
storing an images filename is accompanied by a brief caption and a relative 
link to another document holding a longer description of the image. Finally the 
image name if preceded by an 's' links to a smaller icon sized version of the 
image which is always a png. We want SOLR to store fields containing the 
absolute link to the image, its icon and the full description. The following 
shows one way we could configure solrconfig.xml and DIH's data-config.xml to 
index this data.
+ 
+ {{{
+   <requestHandler name="/dataimport" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
+     <lst name="defaults">
+        <str name="config">data-config.xml</str>
+        </lst>
+     <lst name="invariants">
+        <!-- Pass through the prefix which needs stripped from
+             an absolute disk path to give an absolute web path  -->
+        <str name="img_installdir">/usr/local/apache2/htdocs</str>
+        </lst>
+     </requestHandler>
+ }}}
+ 
+ 
+ {{{
+  <dataConfig>
+  <dataSource name="myfilereader" type="FileDataSource"/>
+    <document>
+      <entity name="jc" rootEntity="false" dataSource="null"
+            processor="FileListEntityProcessor"
+            fileName="^.*\.xml$" recursive="true"
+            baseDir="/usr/local/apache2/htdocs/imagery">
+        <entity name="x"rootEntity="true"
+              dataSource="myfilereader"
+              processor="XPathEntityProcessor"
+              url="${jc.fileAbsolutePath}"
+              stream="false" forEach="/mediaBlock"
+              
transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
+ 
+          <field column="fileAbsPath"     template="${jc.fileAbsolutePath}" />
+ 
+          <field column="fileWebPath"     template="${x.fileAbsolutePath}"
+                                          
regex="${dataimporter.request.img_installdir}(.*)" replaceWith="$1"/>
+ 
+          <field column="fileWebDir"      regex="(.*)/.*" replaceWith="$1" 
sourceColName="fileWebPath"/>
+ 
+          <field column="imgFilename"     xpath="/mediaBlock/@url" />
+          <field column="imgCaption"      xpath="/mediaBlock/caption"  />
+          <field column="imgSrcArticle"   xpath="/mediaBlock/source"
+                                          
template="${x.fileWebDir}/${x.imgSrcArticle}/"/>
+ 
+          <field column="uid"             regex="(.*)" 
replaceWith="$1#${x.imgFilename}" sourceColName="fileWebPath"/>
+ 
+          <!-- if imgFilename is not defined all the following will also not 
be defined -->
+          <field column="imgWebPathFULL"  
template="${x.fileWebDir}/images/${x.imgFilename}"/>
+          <field column="imgWebPathICON"  regex="(.*)\.\w+$" 
replaceWith="${x.fileWebDir}/images/s$1.png"
+                                          sourceColName="imgFilename"/>
+ 
+        </entity>
+      </entity>
+    </document>
+   </dataConfig>
+ }}}
+ 
  [[Anchor(custom-transformers)]]
- == Writing Custom Transformers ==
+ === Writing Custom Transformers ===
- [:DIHCustomTransformer:see here]
+ It is simple to add you own transformers and this documented on the page 
[:DIHCustomTransformer:DIHCustomTransformer]
  
  [[Anchor(entityprocessor)]]
  == EntityProcessor ==

[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to