Ok, thanks for providing the information.
Regards,
Edwin
On Fri, 18 Jan 2019 at 00:33, Tim Allison wrote:
> Y, I tracked this down within Solr. This is a feature, not a bug. I
> found a solution (set {{captureAttr}} to {{true}}):
>
>
Y, I tracked this down within Solr. This is a feature, not a bug. I
found a solution (set {{captureAttr}} to {{true}}):
https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
Please, though, for
Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
said that the issue could be with the Solr's ExtractingRequestHandler, in
which the HTMLParser is either not being applied, or is somehow not
stripping the content of elements. Straight Tika app is able to do
the right
Hi Alex,
Thanks for the suggestions.
Yes, I have posted it in the Tika mailing list too.
Regards,
Edwin
On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch
wrote:
> I think asking this question on Tika mailing list may give you better
> answers. Then, if the conclusion is that the behavior is
Using 6.6.0, I am able to index EML files just fine. The trick is, when
indexing files containing .eml, add "-filetypes eml" to the commandline
(note the plural filetypes).
Terry Steichen
On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
>
I think asking this question on Tika mailing list may give you better
answers. Then, if the conclusion is that the behavior is configurable,
you can see how to do it in Solr. It may be however, that you need to
do the parsing outside of Solr with standalone Tika. Standalone Tika
is a production
Hi,
I have uploaded a sample EML file here:
https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
This is what is indexed in the content:
"content":" font-size: 14pt; font-family: book antiqua,
palatino, serif; Hi There,font-size: 14pt; font-family:
Hi,
I am using Solr 7.5.0 with Tika 1.18.
Currently I am facing a situation during the indexing of EML files, whereby
the content is being extracted from the Content-type=text/html instead of
Content-type=text/plain.
The problem with Content-type=text/html is that it contains alot of words
like