Re: charfilter doesn't do anything

Andreas Owen Mon, 09 Sep 2013 14:29:13 -0700

i index html pages with a lot of lines and not just a string with the body-tag. 
it doesn't work with proper html files, even though i took all the new lines 
out.


html-file:
<html>nav-content<body> nur das will ich sehen</body>footer-content</html>

solr update debug output:
"text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; 
charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will 
ich sehenfooter-content</body></html>"]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

> I tried this and it seems to work when added to the standard Solr example in 
> 4.4:
> 
> <field name="body" type="text_html_body" indexed="true" stored="true" />
> 
> <fieldType name="text_html_body" class="solr.TextField" 
> positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> That char filter retains only text between <body> and </body>. Is that what 
> you wanted?
> 
> Indexing this data:
> 
> curl 'localhost:8983/solr/update?commit=true' -H 
> 'Content-type:application/json' -d '
> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
> 
> And querying with these commands:
> 
> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
> Shows all data
> 
> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
> shows the body text
> 
> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
> Shows nothing, HTML tag stripped
> 
> In your original query, you didn't show us what your default field, df 
> parameter, was.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Sunday, September 08, 2013 5:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> yes but that filter html and not the specific tag i want.
> 
> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
> 
>> Hmmm, have you looked at:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>> 
>> Not quite the <body>, perhaps, but might it help?
>> 
>> 
>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
>> 
>>> ok i have html pages with <html>.....<!--body-->content i
>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>> that between the body-comments. i thought regexTransformer would be the
>>> best because xpath doesn't work in tika and i cant nest a
>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>> htmlparser from tika cuts my body-comments out and tries to make well
>>> formed html, which i would like to switch off.
>>> 
>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>> 
>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>> i've managed to get it working if i use the regexTransformer and string
>>> is on the same line in my tika entity. but when the string is multilined it
>>> isn't working even though i tried ?s to set the flag dotall.
>>>>> 
>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>> transformer="RegexTransformer">
>>>>>    <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>> </entity>
>>>>> 
>>>>> then i tried it like this and i get a stackoverflow
>>>>> 
>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>> 
>>>>> in javascript this works but maybe because i only used a small string.
>>>> 
>>>> Sounds like we've got an XY problem here.
>>>> 
>>>> http://people.apache.org/~hossman/#xyproblem
>>>> 
>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>> and then we can find a solution for you?
>>>> 
>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>> be able to search for individual words on your HTML input.  The entire
>>>> input string is treated as a single token, and therefore ONLY exact
>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>> 
>>>> Note that no matter what you do to your data with the analysis chain,
>>>> Solr will always return the text that was originally indexed in search
>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>> need an Update Processor.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>

Re: charfilter doesn't do anything

Reply via email to