[
https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855
]
Noble Paul edited comment on SOLR-1358 at 12/9/09 4:48 AM:
-----------------------------------------------------------
Let us provide a new TikaEntityProcessor
{code:xml}
<dataConfig>
<!-- use any of type DataSource<InputStream> -->
<dataSource type="BinURLDataSource"/>
<document>
<!-- The value of format can be text|xml|html|none. this is the format in
which the body is emited (the 'text' field) . The implicit field 'text' will
have that format.
default value is 'text' (if not specified) . format="none" means
body is not emited-->
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml"
url="${some.var.goes.here}" format="text">
<!--Do appropriate mapping here meta="true" means it is a metadata field
-->
<field column="Author" meta="true" name="author"/>
<field column="title" meta="true" name="docTitle"/>
<!--'text' is an implicit field emited by TikaEntityProcessor . Map it
appropriately-->
<field column="text"/>
</entity>
<document>
</dataConfig>
{code}
With format=xml|html XPathEntityProcessor can be nested. This may help users
extract more nested data from a file. It is even possible to create multiple
documents from a single file
was (Author: noble.paul):
Let us provide a new TikaEntityProcessor
{code:xml}
<dataConfig>
<!-- use any of type DataSource<InputStream> -->
<dataSource type="BinURLDataSource"/>
<document>
<!-- The value of format can be text|xml|html . The implicit field 'text'
will have that format.
default value is 'text' (if not specified) -->
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml"
url="${some.var.goes.here}" format="text">
<!--Do appropriate mapping here meta="true" means it is a metadata field
-->
<field column="Author" meta="true" name="author"/>
<field column="title" meta="true" name="docTitle"/>
<!--'text' is an implicit field emited by TikaEntityProcessor . Map it
appropriately-->
<field column="text"/>
</entity>
<document>
</dataConfig>
{code}
With format=xml|html XPathEntityProcessor can be nested. This may help users
extract more nested data from a file. It is even possible to create multiple
documents from a single file
> Integration of Tika and DataImportHandler
> -----------------------------------------
>
> Key: SOLR-1358
> URL: https://issues.apache.org/jira/browse/SOLR-1358
> Project: Solr
> Issue Type: New Feature
> Components: contrib - DataImportHandler
> Reporter: Sascha Szott
> Assignee: Noble Paul
> Attachments: SOLR-1358.patch, SOLR-1358.patch, SOLR-1358.patch
>
>
> At the moment, it's impossible to configure Solr such that it build up
> documents by using data that comes from both pdf documents and database table
> columns. Currently, to accomplish this task, it's up to the user to add some
> preprocessing that converts pdf files into plain text files. Therefore, I
> would like to see an integration of Solr Cell into DIH that makes those
> preprocessing obsolete.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.