[jira] Issue Comment Edited: (SOLR-1358) Integration of Tika and DataImportHandler

Noble Paul (JIRA) Tue, 08 Dec 2009 07:30:41 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855
 ]


Noble Paul edited comment on SOLR-1358 at 12/8/09 3:29 PM:
-----------------------------------------------------------

Let us provide a new TikaEntityProcessor 

{code:xml}
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
   <!-- The value of format can be text|xml|html . The implicit field 'text' 
will have that format.
          default value is 'text'  (if not specified) -->
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" 
url="${some.var.goes.here}" format="text">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field 
-->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it 
appropriately-->
      <field column="text"/>
     </entity>
  <document>
</dataConfig>
{code}

With format=xml|html XPathEntityProcessor can be nested. This may help users 
extract more nested data from a file. It is even possible to create multiple 
documents from a single file

      was (Author: noble.paul):
    Let us provide a new TikaEntityProcessor 

{code:xml}
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" 
url="${some.var.goes.here}">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field 
-->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it 
appropriately-->
      <field column="text"/>
     </entity>
  <document>
</dataConfig>
{code}

This most likely would need a BinUrlDataSource/BinContentStreamDataSource 
because Tika uses binary inputs.

My suggestion is that TikaEntityProcessor live in the extraction contrib so 
that managing dependencies is easier. But we will have to make extraction have 
a compile-time dependency on DIH. 

Grant , what do you think?
  
> Integration of Tika and DataImportHandler
> -----------------------------------------
>
>                 Key: SOLR-1358
>                 URL: https://issues.apache.org/jira/browse/SOLR-1358
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Sascha Szott
>            Assignee: Noble Paul
>         Attachments: SOLR-1358.patch, SOLR-1358.patch
>
>
> At the moment, it's impossible to configure Solr such that it build up 
> documents by using data that comes from both pdf documents and database table 
> columns. Currently, to accomplish this task, it's up to the user to add some 
> preprocessing that converts pdf files into plain text files. Therefore, I 
> would like to see an integration of Solr Cell into DIH that makes those 
> preprocessing obsolete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1358) Integration of Tika and DataImportHandler

Reply via email to