Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-19 Thread Zheng Lin Edwin Yeo
Ok, thanks for providing the information. Regards, Edwin On Fri, 18 Jan 2019 at 00:33, Tim Allison wrote: > Y, I tracked this down within Solr. This is a feature, not a bug. I > found a solution (set {{captureAttr}} to {{true}}): > >

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-17 Thread Tim Allison
Y, I tracked this down within Solr. This is a feature, not a bug. I found a solution (set {{captureAttr}} to {{true}}): https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263 Please, though, for

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-16 Thread Zheng Lin Edwin Yeo
Based on the discussion in Tika and also on the Jira (TIKA-2814), it was said that the issue could be with the Solr's ExtractingRequestHandler, in which the HTMLParser is either not being applied, or is somehow not stripping the content of elements. Straight Tika app is able to do the right

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Zheng Lin Edwin Yeo
Hi Alex, Thanks for the suggestions. Yes, I have posted it in the Tika mailing list too. Regards, Edwin On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch wrote: > I think asking this question on Tika mailing list may give you better > answers. Then, if the conclusion is that the behavior is

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Terry Steichen
Using 6.6.0, I am able to index EML files just fine.  The trick is, when indexing files containing .eml, add "-filetypes eml" to the commandline (note the plural filetypes). Terry Steichen On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote: > Hi, > > I am using Solr 7.5.0 with Tika 1.18. > >

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Alexandre Rafalovitch
I think asking this question on Tika mailing list may give you better answers. Then, if the conclusion is that the behavior is configurable, you can see how to do it in Solr. It may be however, that you need to do the parsing outside of Solr with standalone Tika. Standalone Tika is a production

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-13 Thread Zheng Lin Edwin Yeo
Hi, I have uploaded a sample EML file here: https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing This is what is indexed in the content: "content":" font-size: 14pt; font-family: book antiqua, palatino, serif; Hi There,font-size: 14pt; font-family:

Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-13 Thread Zheng Lin Edwin Yeo
Hi, I am using Solr 7.5.0 with Tika 1.18. Currently I am facing a situation during the indexing of EML files, whereby the content is being extracted from the Content-type=text/html instead of Content-type=text/plain. The problem with Content-type=text/html is that it contains alot of words like