[jira] [Closed] (JAMES-2910) HTML could be indexed directly in ElasticSearch

Benoit Tellier (Jira) Mon, 14 Oct 2019 20:23:27 -0700


     [ 
https://issues.apache.org/jira/browse/JAMES-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Tellier closed JAMES-2910.
---------------------------------
    Resolution: Fixed

https://github.com/linagora/james-project/pull/2739 solved this

> HTML could be indexed directly in ElasticSearch
> -----------------------------------------------
>
>                 Key: JAMES-2910
>                 URL: https://issues.apache.org/jira/browse/JAMES-2910
>             Project: James Server
>          Issue Type: Improvement
>          Components: elasticsearch, guice
>    Affects Versions: 3.4.0
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.5.0
>
>
> When tika is disabled, the DefaultTextExtract is used, which does not perform 
> html text extraction.
> This results in decreased precision in search in such situation (index being 
> polluted by html) and of course results in a massive index size.
> Proposal:
> CassandraGuice should default to JsoupTextExtractor when tika is disabled.
> This will allow html text extraction to actually happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (JAMES-2910) HTML could be indexed directly in ElasticSearch

Reply via email to