Hi Daniel, HTMLStripCharFilterFactory in your index analyzer should do the trick: <https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.HTMLStripCharFilterFactory>
-- Steve www.lucidworks.com > On Aug 10, 2017, at 4:13 AM, Daniel von der Helm > <d.vonderh...@neumueller.com> wrote: > > Hi, > if a fetched HTML page (using SimplePostTool: -Ddata=web) contains <script> > and <style> tags inside the <body> tag (not in <head> tag ) the innerText ( > i.e. EMAC/JS scripts and CSS styles) of these tags remains as part of > document text inside the "content"/"_text_" field in indexed documents. > > So when I search in _text_ for "push(arguments)", for example, i get a result > :( > Any idea how to remove these unwanted content? > Using: Solr 6.6.0. > Solrconfig.xml: > > <requestHandler name="/update/extract" > startup="lazy" > class="solr.extraction.ExtractingRequestHandler" > > <lst name="defaults"> > <str name="lowernames">true</str> > <str name="uprefix">ignored_</str> > <str name="captureAttr">true</str> > <str name="fmap.meta">ignored_</str> > <str name="fmap.content">plaintext</str> > </lst> > </requestHandler> > Thanks in advance > Daniel >