Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Noble Paul നോബിള്‍ नोब्ळ् Tue, 06 Oct 2009 02:20:13 -0700

The wiki has eaten up a lot of documentation

On Tue, Oct 6, 2009 at 1:54 PM, Apache Wiki <wikidi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
> notification.
>
> The "DataImportHandler" page has been changed by FergusMcMenemie:
> http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=212&rev2=213
>
>     xpath="/a/b/subje...@qualifier='fullTitle']"
>     xpath="/a/b/subject/@qualifier"
>     xpath="/a/b/c"
> + }}}
> + <!> new for [[Solr1.4]]
> + {{{
> +    xpath="//a/..."
> +    xpath="/a//b..."
>  }}}
>
>
> @@ -768, +773 @@
>
>      <document>
>          <entity name="f" processor="FileListEntityProcessor" 
> baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" 
> recursive="true" rootEntity="false" dataSource="null">
>              <entity name="x" processor="XPathEntityProcessor" 
> forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
> +                 <field column="full_name" xpat0Aand can be used as a 
> !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name">
> +    <!-- copies the text to a field called 'text' in Solr-->
> +   <field column="plainText" name="text"/>
> -                 <field column="full_name" xpath="/field/xpath"/>
> -             </entity>
> -         </entity>
> -     </document>
> - </dataConfig>
> - }}}
> - Do not miss the `rootEntity` attribute. The implicit fields generated by 
> the !FileListEntityProcessor are `fileAbsolutePath, fileSize, 
> fileLastModified, fileName` and these are available for use within the entity 
> X as shown above. It should be noted that !FileListEntityProcessor returns a 
> list of pathnames and that the subsequent entity must use the !FileDataSource 
> to fetch the files content.
> -
> - === CachedSqlEntityProcessor ===
> - <<Anchor(cached)>>
> -
> - This is an extension of the !SqlEntityProcessor.  This !EntityProcessor 
> helps reduce the no: of DB queries executed by caching the rows. It does not 
> help to use it in the root most entity because only one sql is run for the 
> entity.
> -
> - Example 1.
> - {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y where xid=${x.id}" 
> processor="CachedSqlEntityProcessor">
> -     </entity>
> - <entity>
> + </entity>
>  }}}
>
> - The usage is exactly same as the other one. When a query is run the results 
> are stored and if the same query is run again it is fetched from the cache 
> and returned
> + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, 
> URL!DataSource)
>
> - Example 2:
> - {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y" 
> processor="CachedSqlEntityProcessor"  where="xid=x.id">
> -     </entity>
> - <entity>
> - }}}
> -
> - The difference with the previous one is the 'where' attribute. In this case 
> the query fetches all the rows from the table and stores all the rows in the 
> cache. The magic is in the 'where' value. The cache stores the values with 
> the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every 
> time the entity has to be run and the value is looked up in the cache an the 
> rows are returned.
> -
> - In the where the lhs (the part before '=') is the column in y and the rhs 
> (the part after '=') is the value to be computed for looking up the cache.
> -
> - === PlainTextEntityProcessor ===
> + === LineEntityProcessor ===
> - <<Anchor(plaintext)>>
> + <<Anchor(LineEntityProcessor)>>
>  <!> [[Solr1.4]]
>
> - This !EntityProcessor reads all content from the data source into an single 
> implicit field called 'plainText'. The content is not parsed in any way, 
> however you may add transformers to manipulate the data within 'plainText' as 
> needed or to create other additional fields.
> + This !EntityProcessor reads all content from the data source on a line by 
> line basis, a field called 'rawLine' is returned for each line read. The 
> content is not parsed in any way, however you may add transformers to 
> manipulate the data within 'rawLine' or to create other additional fields.
>
> + The lines read can be filtered by two regular expressions 
> '''acceptLineRegex''' and '''omitLineRegex'''.
> + This entities additional attributes are:
> +  * '''`url`''' : a required attribute that specifies the location of the 
> input file in a way that is compatible with the configured datasource. If 
> this value is relative and you are using !FileDataSource or URL!DataSource, 
> it assumed to be relative to '''baseLoc'''.
> +  * '''`acceptLineRegex`''' :an optional attribute that if present discards 
> any line which does not match the regExp.
> +  * '''`omitLineRegex`''' : an optional attribute that is applied after any 
> acceptLineRegex and discards any line which matches this regExp.
>  example:
>  {{{
> - <entity processor="PlainTextEntityProcessor" name="x" 
> url="http://abc.com/a.txt"; dataSource="data-source-name">
> + <entity name="jc"
> +         processor="LineEntityProcessor"
> +         acceptLineRegex="^.*\.xml$"
> +         omitLineRegex="/obsolete"
> +         url="file:///Volumes/ts/files.lis"
> +         rootEntity="false"
> +         dataSource="myURIreader1"
> +         transformer="RegexTransformer,DateFormatTransformer"
> +         >
> +    ...
> + }}}
> + While there are use cases where you might need to create a solr document 
> per line read from a file, it is expected that in most cases that the lines 
> read will consist of a pathname which is in turn consumed by another 
> !EntityProcessor
> + such as X!PathEntityProcessor.
> +
> + == DataSource ==
> + <<Anchor(datasource)>>
> + A class can extend `org.apache.solr.handler.dataimport.DataSource` . 
> [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See
>  source]]
> +
> + and can be used as a !DataSource. It must be3A//abc.com/a.txt" 
> dataSource="data-source-name">
>     <!-- copies the text to a field called 'text' in Solr-->
>    <field column="plainText" name="text"/>
>  </entity>
>




-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to