Re: implementing profanity detector

Lance Norskog Tue, 16 Feb 2010 21:00:34 -0800

A problem is that your profanity list will not stop growing, and with
each new word you will want to rescrub the index.


We had a thousand-word NOT clause in every query (a filter query would
be true for 99% of the index) until we switched to another
arrangement.

Another small problem was that I knew of many more perversions than my
co-workers, but did not wish to display my vast erudition in the
seamier side of life :)

On Fri, Feb 12, 2010 at 4:26 PM, Chris Hostetter
<hossman_luc...@fucit.org> wrote:
>
> : Otherwise, I'd do it via copy fields.  Your first field is your main
> : field and is analyzed as before.  Your second field does the profanity
> : detection and simply outputs a single token at the end, safe/unsafe.
>
> you don't even need custom code for this ... copyFiled all your text into
> a 'has_profanity' field where you use a suitable Tokenizer followed by the
> KeepWordsTokenFilter that only keeps profane words and then a
> PatternReplaceTokenFilter that matches .* and replaces it with "HELL_YEA"
> ... now a search for "is_profane:HELL_YEA" finds all profane docs, with
> the added bonus that the scores are based on how many profane words occur
> in the doc.
>
> it could be used as a filter query (probably negated) as needed.
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: implementing profanity detector

Reply via email to