Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by FergusMcMenemie: http://wiki.apache.org/solr/DataImportHandler The comment on the change is: Improving the documentation on transformers ------------------------------------------------------------------------------ <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> - <entity name="item" pk="ID" + <entity name="item" pk="ID" query="select * from item" deltaImportQuery="select * from item where ID=='${dataimporter.delta.id}'" deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'"> @@ -445, +445 @@ [[Anchor(transformer)]] == Transformer == - Every set of fields fetched by the entity can be either consumed directly by the indexing process or they can be massaged using transformers to create a totally new set of fields or it can even return more than one row of data. The transformers must be configured on an entity level as follows. + Every set of fields fetched by the entity can be either consumed directly by the indexing process or they can be massaged using transformers to modify a field or create a totally new set of fields, it can even return more than one row of data. The transformers must be configured on an entity level as follows. {{{ <entity name="foo" transformer="com.foo.Foo" ... /> }}} @@ -453, +453 @@ the class 'Foo' must extend the abstract class `org.apache.solr.hander.dataimport.Transformer` The class has only one abstract method. - The transformer attribute can consist of a comma separated list of transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in this case and they are applied one after the other in the order in which they are specified. What this means is that after the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag and scanned by the first transformer to see if any of that transformers attributes are present. If so the transformer does it's thing! When all of the listed entity columns have been scanned the process is repeated using the next transformer in the list. + The entity transformer attribute can consist of a comma separated list of transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in this case and they are applied one after the other in the order in which they are specified. What this means is that after the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag and scanned by the first transformer to see if any of that transformers attributes are present. If so the transformer does it's thing! When all of the listed entity columns have been scanned the process is repeated using the next transformer in the list. A transformer can be used to alter the value of a field fetched from the datasource or to populate an undefined field. If the action of the transformer fails, say a regex fails to match, then an existing field will be unaltered and an undefined field will remain undefined. The chaining effect described above allows a column's value to be altered again and again by successive transformers. A transformer may make use of other entity fields in the course of massaging a columns value. - {{{ - public abstract class Transformer { - /** - * The input is a row of data and the output has to be a new row. - * - * @param context The current context - * @param row A row of data - * @return The changed data. It must be a Map<String, Object> if it returns - * only one row or if there are multiple rows to be returned it must - * be a List<Map<String, Object>> - */ - public abstract Object transformRow(Map<String, Object> row, Context context); - } - }}} - - The Context is the abstract class that provides the contextual information that may be necessary to process the data. - - Alternately the class `Foo` may choose NOT TO implement this abstract class and just write a method with this signature - {{{ - public Object transformRow(Map<String, Object> row) - }}} - - So there is no compile-time dependency on the !DataImportHandler API - - - The configuration has a 'flexible' schema. It lets the user provide arbitrary attributes in an 'entity' tag and 'field' tags. The tool reads the data and hands it over to the implementation class as it is. If the 'Transformer' needs extra information to be provided on a per entity/field basis it can get them from the context. === RegexTransformer === @@ -641, +615 @@ ==== Attributes ==== * '''`clob`''' : Boolean value to signal if !ClobTransformer should process this field or not. + [[Anchor(example-transformers)]] + === Transformers Example === + The following example shows transformer chaining in action along with extensive reuse of variables. An invariant is defined in the solrconfig.xml and reused within some transforms. Column names from both entities are also used in transforms. + + Imaging we have XML documents, each of which describes a set of images. The images are stored in an images subdirectory of the XML document. An attribute storing an images filename is accompanied by a brief caption and a relative link to another document holding a longer description of the image. Finally the image name if preceded by an 's' links to a smaller icon sized version of the image which is always a png. We want SOLR to store fields containing the absolute link to the image, its icon and the full description. The following shows one way we could configure solrconfig.xml and DIH's data-config.xml to index this data. + + {{{ + <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> + <lst name="defaults"> + <str name="config">data-config.xml</str> + </lst> + <lst name="invariants"> + <!-- Pass through the prefix which needs stripped from + an absolute disk path to give an absolute web path --> + <str name="img_installdir">/usr/local/apache2/htdocs</str> + </lst> + </requestHandler> + }}} + + + {{{ + <dataConfig> + <dataSource name="myfilereader" type="FileDataSource"/> + <document> + <entity name="jc" rootEntity="false" dataSource="null" + processor="FileListEntityProcessor" + fileName="^.*\.xml$" recursive="true" + baseDir="/usr/local/apache2/htdocs/imagery"> + <entity name="x"rootEntity="true" + dataSource="myfilereader" + processor="XPathEntityProcessor" + url="${jc.fileAbsolutePath}" + stream="false" forEach="/mediaBlock" + transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"> + + <field column="fileAbsPath" template="${jc.fileAbsolutePath}" /> + + <field column="fileWebPath" template="${x.fileAbsolutePath}" + regex="${dataimporter.request.img_installdir}(.*)" replaceWith="$1"/> + + <field column="fileWebDir" regex="(.*)/.*" replaceWith="$1" sourceColName="fileWebPath"/> + + <field column="imgFilename" xpath="/mediaBlock/@url" /> + <field column="imgCaption" xpath="/mediaBlock/caption" /> + <field column="imgSrcArticle" xpath="/mediaBlock/source" + template="${x.fileWebDir}/${x.imgSrcArticle}/"/> + + <field column="uid" regex="(.*)" replaceWith="$1#${x.imgFilename}" sourceColName="fileWebPath"/> + + <!-- if imgFilename is not defined all the following will also not be defined --> + <field column="imgWebPathFULL" template="${x.fileWebDir}/images/${x.imgFilename}"/> + <field column="imgWebPathICON" regex="(.*)\.\w+$" replaceWith="${x.fileWebDir}/images/s$1.png" + sourceColName="imgFilename"/> + + </entity> + </entity> + </document> + </dataConfig> + }}} + [[Anchor(custom-transformers)]] - == Writing Custom Transformers == + === Writing Custom Transformers === - [:DIHCustomTransformer:see here] + It is simple to add you own transformers and this documented on the page [:DIHCustomTransformer:DIHCustomTransformer] [[Anchor(entityprocessor)]] == EntityProcessor ==
