RE: Metadata and HTML ending up in searchable text

Allison, Timothy B. Fri, 27 May 2016 05:33:38 -0700

Of course, for greater control over indexing (and for more robust handling of 
exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:


http://searchhub.org/2012/02/14/indexing-with-solrj/

-----Original Message-----
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] 
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with...
body_txt_en: " stream_size 36499 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By...."

And then once it gets to the actual text I get CSS class names appearing that 
were in <p> or <div> tags etc.
e.g. "....the power of calibre3 silence calibre2 and....", where "calibre3" etc 
are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching 
for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
 
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";
 
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL: 
http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
     id: "doc1",
     attr_stream_size: [
         "8107"
     ],
     attr_x_parsed_by: [
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.txt.TXTParser"
     ],
     attr_stream_content_type: [
         "text/plain"
     ],
     attr_stream_name: [
         "UsingMailingLists.txt"
     ],
     attr_stream_source_info: [
         "content/UsingMailingLists.txt"
     ],
     attr_content_encoding: [
         "ISO-8859-1"
     ],
     attr_content_type: [
         "text/plain; charset=ISO-8859-1"
     ],
     body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By 
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain stream_name 
UsingMailingLists.txt stream_source_info content/UsingMailingLists.txt 
Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 Search: 
[value ] [Titles] [Text] Solr_Wiki Login ****** UsingMailingLists ****** * 
FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
     id: "doc2",
         attr_stream_size: [
         "20440"
     ],
     attr_x_parsed_by: [
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.html.HtmlParser"
     ],
     attr_stream_content_type: [
         "text/html"
     ],
     attr_stream_name: [
         "UsingMailingLists.html"
     ],
     attr_stream_source_info: [
         "content/UsingMailingLists.html"
     ],
     attr_dc_title: [
         "UsingMailingLists - Solr Wiki"
     ],
     attr_content_encoding: [
         "UTF-8"
     ],
     attr_robots: [
         "index,nofollow"
     ],
     attr_title: [
         "UsingMailingLists - Solr Wiki"
     ],
     attr_content_type: [
         "text/html; charset=utf-8"
     ],
     body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki: 
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type 
text/html stream_name UsingMailingLists.html stream_source_info...etc",
     _version_: 1535398408383103000
}

RE: Metadata and HTML ending up in searchable text

Reply via email to