Of course, for greater control over indexing (and for more robust handling of exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:
http://searchhub.org/2012/02/14/indexing-with-solrj/ -----Original Message----- From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] Sent: Thursday, May 26, 2016 9:49 AM To: solr-user@lucene.apache.org Subject: Metadata and HTML ending up in searchable text Hi, I am using Solr 6.0 on Ubuntu 14.04. I am ending up with loads of junk in the text body. It starts like, The JSON entry output of a search result shows the indexed text starting with... body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By...." And then once it gets to the actual text I get CSS class names appearing that were in <p> or <div> tags etc. e.g. "....the power of calibre3 silence calibre2 and....", where "calibre3" etc are the CSS class names. All this junk is searchable and is polluting the index. I would like to index _only_ the actual content I am interested in searching for. Steps to reproduce: 1) Solr installed by untaring solr tgz in /opt. 2) Core created by typing "bin/solr create -c mycore" 3) Solr started with bin/solr start 4) TXT document index using the following command curl "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true" -F "content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt" 5) HTML document index using following command curl "http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true" -F "content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html" 6) Query using URL: http://localhost:8983/solr/mycore/select?q=especially&wt=json Result: For the txt file, I get the following JSON for the document... { id: "doc1", attr_stream_size: [ "8107" ], attr_x_parsed_by: [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.txt.TXTParser" ], attr_stream_content_type: [ "text/plain" ], attr_stream_name: [ "UsingMailingLists.txt" ], attr_stream_source_info: [ "content/UsingMailingLists.txt" ], attr_content_encoding: [ "ISO-8859-1" ], attr_content_type: [ "text/plain; charset=ISO-8859-1" ], body_txt_en: " stream_size 8107 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.txt.TXTParser stream_content_type text/plain stream_name UsingMailingLists.txt stream_source_info content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki Login ****** UsingMailingLists ****** * FrontPage * RecentChanges...etc", _version_: 1535398235801124900 } For the HTML file, I get the following JSON for the document... { id: "doc2", attr_stream_size: [ "20440" ], attr_x_parsed_by: [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser" ], attr_stream_content_type: [ "text/html" ], attr_stream_name: [ "UsingMailingLists.html" ], attr_stream_source_info: [ "content/UsingMailingLists.html" ], attr_dc_title: [ "UsingMailingLists - Solr Wiki" ], attr_content_encoding: [ "UTF-8" ], attr_robots: [ "index,nofollow" ], attr_title: [ "UsingMailingLists - Solr Wiki" ], attr_content_type: [ "text/html; charset=utf-8" ], body_txt_en: " stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8 projection /wiki/modernized/css/projection.css alternate Solr Wiki: UsingMailingLists /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1 application/rss+xml Start /solr/FrontPage Alternate Wiki Markup /solr/UsingMailingLists?action=raw Alternate print Print View /solr/UsingMailingLists?action=print Search /solr/FindPage Index /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html stream_name UsingMailingLists.html stream_source_info...etc", _version_: 1535398408383103000 }