Re: Metadata and HTML ending up in searchable text

2016-06-02 Thread Simon Blandford
I have investigated different Solr versions. I have found that 4.10.3 is the last version that completely strips the HTML to text as expected. 4.10.4 starts introducing some HTML comments and Javascript and anything over 5.0 is full of mangled HTML and attribute artefacts such as

Re: Metadata and HTML ending up in searchable text

2016-06-01 Thread Simon Blandford
Thanks Timothy, Will give the DIH a try. I have submitted a bug report. Regards, Simon On 31/05/16 13:22, Allison, Timothy B. wrote: From the same page, extractFormat=text only applies when extractOnly is true, which just shows the output from tika without indexing the document. Y, sorry.

RE: Metadata and HTML ending up in searchable text

2016-05-31 Thread Allison, Timothy B.
>> From the same page, extractFormat=text only applies when extractOnly >> is true, which just shows the output from tika without indexing the document. Y, sorry. I just looked through the source code. You're right. If you use DIH (TikaEntityProcessor) instead of Solr Cell

Re: Metadata and HTML ending up in searchable text

2016-05-31 Thread Simon Blandford
Hi Alex, That sounds similar. I am puzzled by what I am seeing because it looks like a major bug and I am following the docs for curl as closely as possible, but hardly anyone else seems to have noticed it. To me it is a show-stopper. If I convert the docs to txt with html2text first then I

Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Alexandre Rafalovitch
I think Solr's layer above Tika was merging in metadata and text all together without a way (that I could see) to separate them. That's all I remember of my examination of this issue when I run into something similar. Not very helpful, I know. Regards, Alex. Newsletter and resources for

Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Simon Blandford
Hi Timothy, Thanks for responding. java -jar tika-app-1.13.jar -t "/home/user/Documents/library/UsingMailingLists.txt" ...gives a clean result with no CSS or other nasties in the output. So it looks like the latest version of tika itself is OK. I was basing the test case on this doc page as

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
Of course, for greater control over indexing (and for more robust handling of exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ: http://searchhub.org/2012/02/14/indexing-with-solrj/ -Original Message- From: Simon Blandford

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
I'm only minimally familiar with Solr Cell, but... 1) It looks like you aren't setting extractFormat=text. According to [0]...the default is xhtml which will include a bunch of the metadata. 2) is there an attr_* dynamic field in your index with type="ignored"? This would strip out the attr_