[jira] Updated: (SOLR-1358) Integration of Tika and DataImportHandler

Noble Paul (JIRA) Tue, 08 Dec 2009 07:26:43 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Noble Paul updated SOLR-1358:
-----------------------------

    Comment: was deleted

(was: Configuration with attribute to select format of emitted content:

{code:xml} 
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
 <!-- 'emitFormat' can be one of text | html | xml --> 
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" 
url="${some.var.goes.here}" emitFormat="xml" >
      <!--Do appropriate mapping here  meta="true" means it is a metadata field 
-->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emitted by TikaEntityProcessor . Map it 
appropriately-->
      <field column="text"/>
     </entity>
  <document>
</dataConfig>
{code} 

With 'emitFormat' different EntityProcessors can be chained. E.g. using "xml" 
value will allow chaining XPathEntityProcessor with TikaEntityProcessor for 
further custom processing.)

> Integration of Tika and DataImportHandler
> -----------------------------------------
>
>                 Key: SOLR-1358
>                 URL: https://issues.apache.org/jira/browse/SOLR-1358
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Sascha Szott
>            Assignee: Noble Paul
>         Attachments: SOLR-1358.patch, SOLR-1358.patch
>
>
> At the moment, it's impossible to configure Solr such that it build up 
> documents by using data that comes from both pdf documents and database table 
> columns. Currently, to accomplish this task, it's up to the user to add some 
> preprocessing that converts pdf files into plain text files. Therefore, I 
> would like to see an integration of Solr Cell into DIH that makes those 
> preprocessing obsolete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1358) Integration of Tika and DataImportHandler

Reply via email to