DataImportHandler reverted to revision 222 on Solr Wiki

Apache Wiki Wed, 09 Dec 2009 19:51:47 -0800

Dear wiki user,

You have subscribed to a wiki page "Solr Wiki" for change notification.


The page DataImportHandler has been reverted to revision 222 by NoblePaul.
The comment on this change is: data loss.
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=223&rev2=224

--------------------------------------------------

   * '''`excludes`''' : A Regex pattern of excluded file names
   * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd HH:mm:ss`) 
. It can also be a datemath string eg: ('NOW-3DAYS'). The single quote is 
necessary . Or it can be a valid variableresolver format like (${var.name})
   * '''`olderThan`''' : A date param . Same rules as above
-  * '''`rootEntity`''' :It must be false for this (Unless you wish to just 
index filenames) An entity directly under the <document> is a root entity. That 
means that for each row emitted by the root entity one document is created in 
Solr/Lucene. But as in this case we do not wish to make one document per file. 
We wish to make one document per row emitted by the following entity 'x'. 
Because the entity 'f' has rootEntity=false the entExCachedSqlEntityProcessor"  
where="xid=x.id">
+  * '''`rootEntity`''' :It must be false for this (Unless you wish to just 
index filenames) An entity directly under the <document> is a root entity. That 
means that for each row emitted by the root entity one document is created in 
Solr/Lucene. But as in this case we do not wish to make one document per file. 
We wish to make one document per row emitted by the following entity 'x'. 
Because the entity 'f' has rootEntity=false the entity directly under it 
becomes a root entity automatically and each row emitted by that becomes a 
document.
+  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because 
this does not use any DataSource. No need to specify that in Solr1.4 .It just 
means that we won't create a DataSource instance. (In most of the cases there 
is only one !DataSource (A !JdbcDataSource) and all entities just use them. In 
case of !FileListEntityProcessor a !DataSource is not necessary.)
+ 
+ example:
+ {{{
+ <dataConfig>
+     <dataSource type="FileDataSource" />
+     <document>
+         <entity name="f" processor="FileListEntityProcessor" 
baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" 
recursive="true" rootEntity="false" dataSource="null">
+             <entity name="x" processor="XPathEntityProcessor" 
forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
+                 <field column="full_name" xpath="/field/xpath"/>
+             </entity>
+         </entity>
+     </document>
+ </dataConfig>
+ }}}
+ Do not miss the `rootEntity` attribute. The implicit fields generated by the 
!FileListEntityProcessor are `fileAbsolutePath, fileSize, fileLastModified, 
fileName` and these are available for use within the entity X as shown above. 
It should be noted that !FileListEntityProcessor returns a list of pathnames 
and that the subsequent entity must use the !FileDataSource to fetch the files 
content.
+ 
+ === CachedSqlEntityProcessor ===
+ <<Anchor(cached)>>
+ 
+ This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps 
reduce the no: of DB queries executed by caching the rows. It does not help to 
use it in the root most entity because only one sql is run for the entity.
+ 
+ Example 1.
+ {{{
+ <entity name="x" query="select * from x">
+     <entity name="y" query="select * from y where xid=${x.id}" 
processor="CachedSqlEntityProcessor">
      </entity>
  <entity>
  }}}
  
- The difference with the previous one is the 'where' attribute. In this case 
the query fetches all the rows from the table and stores all the rows in the 
cache. The magic is in the 'where' value. The cache stores the values with the 
'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the 
entity has to be run and the value is looked up in the cache an the rows are 
returned.
+ The usage is exactly same as the other one. When a query is run the results 
are stored and if the same query is run again it is fetched from the cache and 
returned
  
- In the where the lhs (the part before '=') is the column in y and the rhs 
(the part after '=') is the value to be computed for looking up the cache.
- 
- === PlainTextEntityProcessor ===
- <<Anchor(plaintext)>>
- <!> [[Solr1.4]]
- 
- This !EntityProcessor reads all content from the data source into an single 
implicit field called 'plainText'. The content is not parsed in any way, 
however you may add transformers to manipulate the data within 'plainText' as 
needed or to create other additional fields.
- 
- example:
+ Example 2:
  {{{
+ <entity name="x" query="select * from x">
+     <entity name="y" query="select * from y" 
processor="CachedSqlEntityProcessor"  where="xid=x.id">
- <entity processor="PlainTextEntityProcessor" name="x" 
url="http://abc.com/a.txt"; dataSource="data-source-name">
-    <!-- copies the text to a field called 'text' in Solr-->
-   <field column="plainText" name="text"/>
- </entity>
- }}}
- 
- Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, 
URL!DataSource)
- 
- === LineEntityProcessor ===
- <<Anchor(LineEntityProcessor)>>
- <!> [[Solr1.4]]
- 
- This !EntityProcessor reads all content from the data source on a line by 
line basis, a field called 'rawLine' is returned for each line read. The 
content is not parsed in any way, however you may add transformers to 
manipulate the data within 'rawLine' or to create other additional fields.
- 
- The lines read can be filtered by two regular expressions 
'''acceptLineRegex''' and '''omitLineRegex'''.
- This entities additional attributes are:
-  * '''`url`''' : a required attribute that specifies the location of the 
input file in a way that is compatible with the configured datasource. If this 
value is relative and you are using !FileDataSource or URL!DataSource, it 
assumed to be relative to '''baseLoc'''.
-  * '''`acceptLineRegex`''' :an optional attribute that if present discards 
any line which does not match the regExCachedSqlEntityProcessor"  
where="xid=x.id">
      </entity>
  <entity>
  }}}
@@ -980, +982 @@

   * ''encodeUrl'' : Us this to encode urls . eg : 
`'${dataimporter.functions.encodeUrl(item.ID)}'` . Takes only one argument and 
must be a valid value in the !VariableResolver
  
  ==== Custom Functions ====
- [[DIHCustomFunctions]]
+ It is possible to plug in custom functions into DIH. Implement an 
[[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/Evaluator.java?view=markup|Evalutor]]
 and specify it in the data-config.xml . Following is an example of an 
evaluator which does a 'toLowerCase' on a String.
+ {{{
+ <dataConfig>
+    <function name="toLowerCase" class="foo.LowerCaseFunctionEvaluator"/>
+    <document>
+    <entity query="select * from table where 
name='${dataimporter.functions.toLowerCase(dataimporter.request.user)'">
+     <!- ......field declarations......->
+    </entity>
+ </dataConfig>
+ }}}
+ 
+ The implementation of !LowerCaseFunctionEvaluator
+ {{{
+   public class LowerCaseFunctionEvaluator implements Evaluator{
+     public String evaluate(String expression, Context context) {
+       List l = EvaluatorBag.parseParams(expression, 
context.getVariableResolver());
+ 
+       if (l.size() != 1) {
+           throw new RuntimeException("'toLowerCase' must have only one 
parameter ");
+       }
+       return l.get(0).toString().toLowerCase();
+ 
+     }
+ 
+   }
+ }}}
  
  === Accessing request parameters ===
  All http request parameters sent to SOLR when using the dataimporter can be 
accessed using the 'request' namespace eg: `'${dataimporter.request.command}'` 
will return the command that was run.

DataImportHandler reverted to revision 222 on Solr Wiki

Reply via email to