RE: [PHP] Filter vulger / controversial words - need word source
Don't forget that people are very good at coming up with easy to read displays that a computer will have trouble processing. I'll include a lightly sanitized (so I don't get in trouble at work) example below: - - - - - W d y f m S h 0 o s e c y n u c * ' k in n t comethorpe! - - - - - Most people on this list will have little (if any) trouble reading the above sentence, but find me an algorithm that'll flag it, and you'll have my undying respect. - Theo -Original Message- From: Jon Haworth [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 11, 2002 9:14 AM To: 'Christopher Raymond'; [EMAIL PROTECTED] Subject: RE: [PHP] Filter vulger / controversial words - need word source Hi Christopher, > I'm wondering if someone has a great source for a master-list > of controversial and vulger words that I can use on my site. > I would like to pattern match input text against this master-list > in order to prevent vulger and controversial words from appearing > on my site. Before you spend ages finding a good list, get the routine working. Once you've got the routine working, post it here, because there are many people who would like to know how to do this properly. The problems that others have experienced in the past are: - what happens with "mis"spellings, e.g. "fsck"? - what happens with dodgy formatting, e.g "f s c k"? - what happens with words like "Scunthorpe"? Additionally, from my experience of the mail content filters we use here at $WORKPLACE, you will also need to be careful not to cause offence by catching peoples' names. We have a Chinese gentleman as a client with a surname that could be mistaken for an offensive word - he was not best pleased to receive a bounce message telling him that his email hadn't been delivered because he was using profanity. May I suggest, rather than picking your way through this minefield, you provide a "report abusive comment" link instead? Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
there simply is no definitive list of words I agree. In Amateur Radio we faced this problem in the UK in the mid-90s on the ax25/tcpip network when the Home Office made the sysops responsible for content. Filtering messages off for human reading if any words on the "list" occured was the only practical solution. The sysops gradually built up a "list" and circulated it. The rules state something like the sysop must "take all reasonable precautions" to prevent offensive material circulating. So a sysop filtering/human reading is OK if an offensive message got through, whereas one not doing anything lost their licence. This still leaves the problem of the English-only speaking sysop getting smtp/nntp in a different language! Regards Chris PS Sorry about this being OT, my one and only post on this topic. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
Jason, > > there simply is no definitive list of words > The fact is content filtering does not work without a heavy dose of human > intervention. > It is quite shocking that large numbers of well known corporations deploy > misconfigured content-filtering software which rejects perfectly innocent > email. Earlier contributions highlight this admirably. I was at an "Information" show last week where a stand displayed the good work of the EU organisation in the field. I spoke to a Brussels(*) wonk/weenie/suit about such legislation (proposed and awaiting national enactment). I suggested that it would be unfair to 'bring to justice' anyone for apparently offending some employee/user's sensitivities without first defining WHAT would cause offence (eg the requested list of "vulgar words"). Otherwise the first you might know about it is when a court decides (against you) that is unacceptable in polite company. Accordingly I suggested that his department publish such a list (in all of the languages/cultures of the EU???), but observed that he would have a serious problem being able to distribute it without his own office prosecuting itself! As is to be expected, he failed to see the humor (and failed to see the sense/requirement to do so)... The joke is on our Indian friends @upv.pertamina.co.id whose 'filter' simply bounces messages containing "Dirty Words", because as you say there is no human involvement so they can't even benefit from Jon's observations. This policy means that every contact/contract with Sc*nthorp that they lose, is deservedly so, and an unfortunate advertisement not to use that country for out-sourcing if the culture-gap is so great!? Summary attitude: hey I'll code it if you want it/pay me to do so, but what are you going to do when you meet the rest of society as soon as you come off the email system? NB it has 'always' been illegal to use such language in (British, and many others) phone conversations, but who does that stop/what filters are in place there? =dn *for the benefit of more distant members: Brussels is the home of many European Union (EU) offices, and the source of much bureaucratic 'stupidity' such as the legislation mentioned earlier. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
> > > if you want a partial list of offensive terms - try looking > > > at the meta keywords on a few porn sites ... > > > > Excellent idea! > > Unfortunately I'd have to explain that to my boss... "No, > > really, I'm doing some research..." > > but won't your gateway/web server's filter prevent access to > such sites anyway? Nope, only the really offensive ones like http://www.thisisscunthorpe.co.uk/index.jsp ;-) Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
> > if you want a partial list of offensive terms - try looking > > at the meta keywords on a few porn sites ... > > Excellent idea! > Unfortunately I'd have to explain that to my boss... "No, really, I'm doing > some research..." =guess monopolising the color printer for a whole afternoon would give you away, huh? =but won't your gateway/web server's filter prevent access to such sites anyway? =dn -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
On Wednesday 11 December 2002 23:12, Sean Burlington wrote: > > there simply is no definitive list of words > > you can't stop people talking about pussy cats, turkey breasts or even > shag pile carpets (and words have different meanings from one place to > the next) > > > automated filters can be a usefull aid to human moderation - flagging up > messages for review. > > if you want a partial list of offensive terms - try looking at the > meta keywords on a few porn sites ... This last paragraph is about the most useful comment on this whole subject :) The fact is content filtering does not work without a heavy dose of human intervention. It is quite shocking that large numbers of well known corporations deploy misconfigured content-filtering software which rejects perfectly innocent email. It is particularly amusing when some 'security-conscious' person in one of these organisations subscribe to a list discussing for example the latest virus threats. However said 'security-conscious' person may find that they receive nothing of importance from the list because messages containing just *the name* of a known virus is enough for the misconfigured content-filtering software to step in and reject the message. -- Jason Wong -> Gremlins Associates -> www.gremlins.biz Open Source Software Systems Integrators * Web Design & Hosting * Internet & Intranet Applications Development * /* I finally went to the eye doctor. I got contacts. I only need them to read, so I got flip-ups. -- Steven Wright */ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
Hi Sean, > if you want a partial list of offensive terms - try looking > at the meta keywords on a few porn sites ... Excellent idea! Unfortunately I'd have to explain that to my boss... "No, really, I'm doing some research..." Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
DL Neil wrote: Hi Jon, [SNIP] May I suggest, rather than picking your way through this minefield, you provide a "report abusive comment" link instead? Most sensible! The employment of a technological solution to a social problem is somewhat shooting the messenger. However some countries are now legislating responsibility that ISPs/employers must discharge (shooting the person who shoes the horses that the Pony Express messenger is riding!?) I've worked on several projects where the client initially wanted filtering - in some cases we have implemented some kind of filtering - but it has always ended up being a human that doe the real work. there simply is no definitive list of words you can't stop people talking about pussy cats, turkey breasts or even shag pile carpets (and words have different meanings from one place to the next) automated filters can be a usefull aid to human moderation - flagging up messages for review. if you want a partial list of offensive terms - try looking at the meta keywords on a few porn sites ... -- Sean -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
Hi, > I think we've seen this discussion on the list before > (so Christopher, check the archives!) Quite :-) > > The problems that others have experienced in the past are: > > - what happens with "mis"spellings, e.g. "fsck"? > > - what happens with dodgy formatting, e.g "f s c k"? > > - what happens with words like "Scunthorpe"? > > Problem 1: add likely/popular mis-spellings to the list of > vulger/vulgar language So when I'm giving a Linux user advice on how to recover from a disk crash, my "run fsck" comment will get trapped the problem here is that context is *everything*. You just can't know, by seeing the word "fsck" without any of the surrounding text, whether I'm swearing at another geek or helping them out :-) There will also be problems with slang and idiom - e.g. "fag" in .uk is a cigarette, but it's something quite different on the other side of the pond. Again, this can only be judged from the context. Finally, the more words you have in your list (to cover common misspellings), the more likely you are to get a false positive (again, context) - and you *will* cause offense if you trap someone's name, for example. > Problem 2: (contrived) very few single-letter words exist so remove > intervening white space prior to analysis Yup, also line breaks, dashes, asterisks, plus signs, etc etc :-) > Problem 3: Scunthorpe contains an unfortunate series of letters (amongst the > town's many disadvantages) however the critical four are not a word in and > of their own right so employ whitespace (\s) in the RegEx or token analysis. That's a good solution, but it's something that obviously is being missed by many developers of this sort of algorithm... see the couple of followups I made immediately after my original response. > > May I suggest, rather than picking your way through this minefield, you > > provide a "report abusive comment" link instead? > > However some countries are now legislating responsibility that > ISPs/employers must discharge Whoops, forgot about that... > In this case perhaps one could analyse the incoming text and place an > embargo on its publication on the web site until it has been reviewed by a > human editor? Looks like the best solution possible. If the OP is interested I will see if I can get our content filter word list from the network manager here... no promises though. Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Filter vulger / controversial words - need word source
Hi Jon, I think we've seen this discussion on the list before (so Christopher, check the archives!) > > I'm wondering if someone has a great source for a master-list > > of controversial and vulger words that I can use on my site. > > I would like to pattern match input text against this master-list > > in order to prevent vulger and controversial words from appearing > > on my site. > Once you've got the routine working, post it here, because there are many > people who would like to know how to do this properly. > The problems that others have experienced in the past are: > - what happens with "mis"spellings, e.g. "fsck"? > - what happens with dodgy formatting, e.g "f s c k"? > - what happens with words like "Scunthorpe"? Problem 1: add likely/popular mis-spellings to the list of vulger/vulgar language Problem 2: (contrived) very few single-letter words exist so remove intervening white space prior to analysis Problem 2a: (the more popular f*ck - someone suffering the misapprehension that (s)he is somehow NOT guilty of using bad language/being offensive when (s)he plainly is not only doing so but attempting to be deceptive as well...) see response to Problem 1 (the probably habit would be to replace/remove vowels) Problem 3: Scunthorpe contains an unfortunate series of letters (amongst the town's many disadvantages) however the critical four are not a word in and of their own right so employ whitespace (\s) in the RegEx or token analysis. > May I suggest, rather than picking your way through this minefield, you > provide a "report abusive comment" link instead? Most sensible! The employment of a technological solution to a social problem is somewhat shooting the messenger. However some countries are now legislating responsibility that ISPs/employers must discharge (shooting the person who shoes the horses that the Pony Express messenger is riding!?) In this case perhaps one could analyse the incoming text and place an embargo on its publication on the web site until it has been reviewed by a human editor? If we were talking about filtering incoming email, then perhaps the original message could be forwarded/wrapped with a message from the EmailAdmin/System pointing out that a message has arrived from xyz (etc) and has been flagged for a stated reason (but that there is room for interpretation within the mechanical observation) and that the message should not be opened by anyone fearing offence. (this similar to 'security' gateways that don't allow msgs with attachments unless the 'employee' first authorises a 'pass-through') Euro 0.02's worth? =dn -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
> Following up to my own post And again... > I wonder if it was "Shorpe" I suppose I'll > find out when/if I get another bounce :-) I got another bounce :-) Whoever is running this filter obviously doesn't want to do business with any of the 70,000 odd people who live in a particular town in the north of England. Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
Following up to my own post > > Once you've got the routine working, post it here, > > because there are many people who would like to know > > how to do this properly. I didn't use any profanity I was aware of in this post, but I still received this a few minutes later: > Trend SMEX Content Filter has detected sensitive content. > Place = 'Christopher Raymond'; [EMAIL PROTECTED]; > Sender = Jon Haworth > Subject = RE: [PHP] Filter vulger / controversial words > - need word source > Delivery Time = December 11, 2002 (Wednesday) 22:13:00 > Policy = Dirty Words > Action on this mail = Delete message I wonder if it was "Scunthorpe" I suppose I'll find out when/if I get another bounce :-) Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Filter vulger / controversial words - need word source
Hi Christopher, > I'm wondering if someone has a great source for a master-list > of controversial and vulger words that I can use on my site. > I would like to pattern match input text against this master-list > in order to prevent vulger and controversial words from appearing > on my site. Before you spend ages finding a good list, get the routine working. Once you've got the routine working, post it here, because there are many people who would like to know how to do this properly. The problems that others have experienced in the past are: - what happens with "mis"spellings, e.g. "fsck"? - what happens with dodgy formatting, e.g "f s c k"? - what happens with words like "Scunthorpe"? Additionally, from my experience of the mail content filters we use here at $WORKPLACE, you will also need to be careful not to cause offence by catching peoples' names. We have a Chinese gentleman as a client with a surname that could be mistaken for an offensive word - he was not best pleased to receive a bounce message telling him that his email hadn't been delivered because he was using profanity. May I suggest, rather than picking your way through this minefield, you provide a "report abusive comment" link instead? Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php