Store tika extracted result as xhtml

Andy Lam Yin Cong Sat, 17 Oct 2009 08:00:16 -0700

Dear All,

I have a field defined in schema.xml as below,
<fieldtype name="string"  class="solr.StrField" sortMissingLast="true" 
indexed="true" stored="true" multiValued="false" omitNorms="true"/>
<field name="original"     type="string" indexed="false"  />


and in the solrconfig.xml
<str name="fmap.content">original</str>

basically, when I upload the document via the command below
curl 
'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&commit=true'
 -F "fi...@mccm.pdf"

and try to display field via a query, it shows 

Take A Chance On Me      
Take A Chance On Me
Monte Carlo Condensed Matter
A very brief guide to Monte Carlo simulation.
An explanation of what I do.
A chance for far too many ABBA puns
.......
The above is Not an xhtml(!)

However, if I run the command below with extractOnly=true
> curl 
> 'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&extractOnly=true'
>  -F "fi...@mccm.pdf"

I get the result
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title&gt;Take A Chance On Me&lt;/title&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;div&gt;
.........
which is an xhtml output.

My objective is to be able to stored it as xhtml in the field and be able to 
retrieve it as cached output. 
Since tika is already giving xhtml output, I wonder why when Solr save it as a 
plain text. (Maybe I missed out something in the configuration??)

Also, I will be using SolrJ as the application layer, hence as a workaround if 
there are any ways that I can get the xhtml result, maybe I can stored it 
somewhere else outside of Solr.
Any advice on this will be highly appreciated.

 Many Thanks & Kind Regards
Andy

Store tika extracted result as xhtml

Reply via email to