i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out.
html-file: <html>nav-content<body> nur das will ich sehen</body>footer-content</html> solr update debug output: "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: > I tried this and it seems to work when added to the standard Solr example in > 4.4: > > <field name="body" type="text_html_body" indexed="true" stored="true" /> > > <fieldType name="text_html_body" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > That char filter retains only text between <body> and </body>. Is that what > you wanted? > > Indexing this data: > > curl 'localhost:8983/solr/update?commit=true' -H > 'Content-type:application/json' -d ' > [{"id":"doc-1","body":"abc <body>A test.</body> def"}]' > > And querying with these commands: > > curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json" > Shows all data > > curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json" > shows the body text > > curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json" > shows nothing (outside of body) > > curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json" > shows nothing (outside of body) > > curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json" > Shows nothing, HTML tag stripped > > In your original query, you didn't show us what your default field, df > parameter, was. > > -- Jack Krupansky > > -----Original Message----- From: Andreas Owen > Sent: Sunday, September 08, 2013 5:21 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > yes but that filter html and not the specific tag i want. > > On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: > >> Hmmm, have you looked at: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >> >> Not quite the <body>, perhaps, but might it help? >> >> >> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote: >> >>> ok i have html pages with <html>.....<!--body-->content i >>> want....<!--/body-->.....</html>. i want to extract (index, store) only >>> that between the body-comments. i thought regexTransformer would be the >>> best because xpath doesn't work in tika and i cant nest a >>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>> htmlparser from tika cuts my body-comments out and tries to make well >>> formed html, which i would like to switch off. >>> >>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>> >>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>> i've managed to get it working if i use the regexTransformer and string >>> is on the same line in my tika entity. but when the string is multilined it >>> isn't working even though i tried ?s to set the flag dotall. >>>>> >>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" >>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>> transformer="RegexTransformer"> >>>>> <field column="text_html" regex="<body>(.+)</body>" >>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>> </entity> >>>>> >>>>> then i tried it like this and i get a stackoverflow >>>>> >>>>> <field column="text_html" regex="<body>((.|\n|\r)+)</body>" >>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>> >>>>> in javascript this works but maybe because i only used a small string. >>>> >>>> Sounds like we've got an XY problem here. >>>> >>>> http://people.apache.org/~hossman/#xyproblem >>>> >>>> How about you tell us *exactly* what you'd actually like to have happen >>>> and then we can find a solution for you? >>>> >>>> It sounds a little bit like you're interested in stripping all the HTML >>>> tags out. Perhaps the HTMLStripCharFilter? >>>> >>>> >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>> >>>> Something that I already said: By using the KeywordTokenizer, you won't >>>> be able to search for individual words on your HTML input. The entire >>>> input string is treated as a single token, and therefore ONLY exact >>>> entire-field matches (or certain wildcard matches) will be possible. >>>> >>>> >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>>> >>>> Note that no matter what you do to your data with the analysis chain, >>>> Solr will always return the text that was originally indexed in search >>>> results. If you need to affect what gets stored as well, perhaps you >>>> need an Update Processor. >>>> >>>> Thanks, >>>> Shawn >>>