[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Apache Wiki Fri, 01 May 2009 04:53:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by FergusMcMenemie:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Add a description of the new LineEntityProcessor

------------------------------------------------------------------------------
  A simple entity processor which can be used to enumerate the list of files 
from a File System based on some criteria. It does not use a !DataSource. The 
entity attributes are:
   * '''`fileName`''' :(required) A regex pattern to identify files
   * '''`baseDir`''' : (required) The Base directory (absolute path)
-  * '''`recursive`''' : Recursive listing or not.default is 'false '
+  * '''`recursive`''' : Recursive listing or not. Default is 'false'
   * '''`excludes`''' : A Regex pattern of excluded file names
   * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd HH:mm:ss`) 
. It can also be a datemath string eg: ('NOW-3DAYS'). The single quote is 
necessary . Or it can be a valid variableresolver format like (${var.name})
   * '''`olderThan`''' : A date param . Same rules as above
@@ -796, +796 @@

  [[Anchor(plaintext)]]
  <!> ["Solr1.4"]
  
- This works mostly like an X!PathEntityProcessor. The only difference is that 
it does not parse the content. It just gives out the whole content as one big 
String . It produces one implicit field called 'plainText' .
+ This !EntityProcessor reads all content from the data source into an single 
implicit field called 'plainText'. The content is not parsed in any way, 
however you may add transformers to manipulate the data within 'plainText' as 
needed or to create other additional fields. 
  
  example:
  {{{
@@ -806, +806 @@

  <entity>
  }}}
  
+ === LineEntityProcessor ===
+ [[Anchor(LineEntityProcessor)]]
+ <!> ["Solr1.4"]
+ 
+ This !EntityProcessor reads all content from the data source on a line by 
line basis, a field called 'rawLine' is returned for each line read. The 
content is not parsed in any way, however you may add transformers to 
manipulate the data within 'rawLine' or to create other additional fields. 
+ 
+ The lines read can be filtered by two regular expressions 
'''acceptLineRegex''' and '''omitLineRegex'''.
+ This entities additional attributes are:
+  * '''`url`''' : a required attribute that specifies the location of the 
input file in a way that is compatible with the configured datasource. If this 
value is relative and you are using !FileDataSource or URL!DataSource, it 
assumed to be relative to '''baseLoc'''.
+  * '''`acceptLineRegex`''' :an optional attribute that if present discards 
any line which does not match the regExp.
+  * '''`omitLineRegex`''' : an optional attribute that is applied after any 
acceptLineRegex and discards any line which matches this regExp.
+ example:
+ {{{
+ <entity name="jc"
+         processor="LineEntityProcessor"
+         acceptLineRegex="^.*\.xml$"
+         omitLineRegex="/obsolete"
+         url="file:///Volumes/ts/files.lis"
+         rootEntity="false"
+         dataSource="myURIreader1"
+         transformer="RegexTransformer,DateFormatTransformer"
+         >
+    ...
+ }}}
+ While there are use cases where you might need to create a solr document per 
line read from a file, it is expected that in most cases that the lines read 
will consist of a pathname which is in turn consumed by another !EntityProcessor
+ such as X!PathEntityProcessor. 
  
  == DataSource ==
  [[Anchor(datasource)]]

[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to