Hi,
if a fetched HTML page (using SimplePostTool: -Ddata=web) contains <script> and 
<style> tags inside the <body> tag (not in <head> tag ) the innerText ( i.e. 
EMAC/JS scripts and CSS styles) of these tags remains as part of document text 
inside the "content"/"_text_" field in indexed documents.

So when I search in _text_ for "push(arguments)", for example, i get a result :(
Any idea how to remove these unwanted content?
Using: Solr 6.6.0.
Solrconfig.xml:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
                 <str name="uprefix">ignored_</str>
                 <str name="captureAttr">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">plaintext</str>
    </lst>
  </requestHandler>
Thanks in advance
Daniel

Reply via email to