The wiki has eaten up a lot of documentation On Tue, Oct 6, 2009 at 1:54 PM, Apache Wiki <wikidi...@apache.org> wrote: > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Solr Wiki" for change > notification. > > The "DataImportHandler" page has been changed by FergusMcMenemie: > http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=212&rev2=213 > > xpath="/a/b/subje...@qualifier='fullTitle']" > xpath="/a/b/subject/@qualifier" > xpath="/a/b/c" > + }}} > + <!> new for [[Solr1.4]] > + {{{ > + xpath="//a/..." > + xpath="/a//b..." > }}} > > > @@ -768, +773 @@ > > <document> > <entity name="f" processor="FileListEntityProcessor" > baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" > recursive="true" rootEntity="false" dataSource="null"> > <entity name="x" processor="XPathEntityProcessor" > forEach="/the/record/xpath" url="${f.fileAbsolutePath}"> > + <field column="full_name" xpat0Aand can be used as a > !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name"> > + <!-- copies the text to a field called 'text' in Solr--> > + <field column="plainText" name="text"/> > - <field column="full_name" xpath="/field/xpath"/> > - </entity> > - </entity> > - </document> > - </dataConfig> > - }}} > - Do not miss the `rootEntity` attribute. The implicit fields generated by > the !FileListEntityProcessor are `fileAbsolutePath, fileSize, > fileLastModified, fileName` and these are available for use within the entity > X as shown above. It should be noted that !FileListEntityProcessor returns a > list of pathnames and that the subsequent entity must use the !FileDataSource > to fetch the files content. > - > - === CachedSqlEntityProcessor === > - <<Anchor(cached)>> > - > - This is an extension of the !SqlEntityProcessor. This !EntityProcessor > helps reduce the no: of DB queries executed by caching the rows. It does not > help to use it in the root most entity because only one sql is run for the > entity. > - > - Example 1. > - {{{ > - <entity name="x" query="select * from x"> > - <entity name="y" query="select * from y where xid=${x.id}" > processor="CachedSqlEntityProcessor"> > - </entity> > - <entity> > + </entity> > }}} > > - The usage is exactly same as the other one. When a query is run the results > are stored and if the same query is run again it is fetched from the cache > and returned > + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, > URL!DataSource) > > - Example 2: > - {{{ > - <entity name="x" query="select * from x"> > - <entity name="y" query="select * from y" > processor="CachedSqlEntityProcessor" where="xid=x.id"> > - </entity> > - <entity> > - }}} > - > - The difference with the previous one is the 'where' attribute. In this case > the query fetches all the rows from the table and stores all the rows in the > cache. The magic is in the 'where' value. The cache stores the values with > the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every > time the entity has to be run and the value is looked up in the cache an the > rows are returned. > - > - In the where the lhs (the part before '=') is the column in y and the rhs > (the part after '=') is the value to be computed for looking up the cache. > - > - === PlainTextEntityProcessor === > + === LineEntityProcessor === > - <<Anchor(plaintext)>> > + <<Anchor(LineEntityProcessor)>> > <!> [[Solr1.4]] > > - This !EntityProcessor reads all content from the data source into an single > implicit field called 'plainText'. The content is not parsed in any way, > however you may add transformers to manipulate the data within 'plainText' as > needed or to create other additional fields. > + This !EntityProcessor reads all content from the data source on a line by > line basis, a field called 'rawLine' is returned for each line read. The > content is not parsed in any way, however you may add transformers to > manipulate the data within 'rawLine' or to create other additional fields. > > + The lines read can be filtered by two regular expressions > '''acceptLineRegex''' and '''omitLineRegex'''. > + This entities additional attributes are: > + * '''`url`''' : a required attribute that specifies the location of the > input file in a way that is compatible with the configured datasource. If > this value is relative and you are using !FileDataSource or URL!DataSource, > it assumed to be relative to '''baseLoc'''. > + * '''`acceptLineRegex`''' :an optional attribute that if present discards > any line which does not match the regExp. > + * '''`omitLineRegex`''' : an optional attribute that is applied after any > acceptLineRegex and discards any line which matches this regExp. > example: > {{{ > - <entity processor="PlainTextEntityProcessor" name="x" > url="http://abc.com/a.txt" dataSource="data-source-name"> > + <entity name="jc" > + processor="LineEntityProcessor" > + acceptLineRegex="^.*\.xml$" > + omitLineRegex="/obsolete" > + url="file:///Volumes/ts/files.lis" > + rootEntity="false" > + dataSource="myURIreader1" > + transformer="RegexTransformer,DateFormatTransformer" > + > > + ... > + }}} > + While there are use cases where you might need to create a solr document > per line read from a file, it is expected that in most cases that the lines > read will consist of a pathname which is in turn consumed by another > !EntityProcessor > + such as X!PathEntityProcessor. > + > + == DataSource == > + <<Anchor(datasource)>> > + A class can extend `org.apache.solr.handler.dataimport.DataSource` . > [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See > source]] > + > + and can be used as a !DataSource. It must be3A//abc.com/a.txt" > dataSource="data-source-name"> > <!-- copies the text to a field called 'text' in Solr--> > <field column="plainText" name="text"/> > </entity> >
-- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com