Hi,
I have uploaded a sample EML file here:
https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
This is what is indexed in the content:
"content":" font-size: 14pt; font-family: book antiqua,
palatino, serif; Hi There, <br><br> font-size: 14pt; font-family:
book antiqua, palatino, serif; My client owns the domain name “
font-size: 14pt; color: #0000ff; font-family: arial black, sans-serif;
TravelInsuranceEurope.com font-size: 14pt; font-family: book
antiqua, palatino, serif; ” and is considering putting it in market.
It is keyword rich domain with good search volume,adword bidding and
type-in-traffic. <br><br> font-size: 14pt; font-family: book
antiqua, palatino, serif; Based on our extensive study, we strongly
feel that you should consider buying this domain name to improve the
SEO, Online visibility, brand image, authority and type-in-traffic for
your business. We also do provide free 1 year hosting and unlimited
emails along with domain name. <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif; Besides this, if you need
any other domain name, web and app designing services and digital
marketing services (SEO, PPC and SMO) at reasonable charges, feel free
to contact us. <br><br> font-size: 14pt; font-family: book antiqua,
palatino, serif; Best Regards, <br><br> font-size: 14pt;
font-family: book antiqua, palatino, serif; Josh <br><br>",
As you can see, this is taken from the Content-Type: text/html.
However, the Content-Type: text/plain looks clean, and that is what we want
it to be indexed.
How can we configure the Tika to change the priority to get the content
from Content-Type: text/plain instead of Content-Type: text/html?
Regards,
Edwin
On Mon, 14 Jan 2019 at 11:30, Zheng Lin Edwin Yeo <[email protected]>
wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files,
> whereby the content is being extracted from the Content-type=text/html
> instead of Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean,
> which is not right. How can we get this to be clean when Tika is extracting
> text?
>
> Regards,
> Edwin
>
>