RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Brinkman, Theodore
Don't forget that people are very good at coming up with easy to read
displays that a computer will have trouble processing.  I'll include a
lightly sanitized (so I don't get in trouble at work) example below:

- - - - -

W   d   y   f   m   S
h   0   o   s   e   c
y   n   u   c   *
'   k   in  n
t   comethorpe!

- - - - -

Most people on this list will have little (if any) trouble reading the above
sentence, but find me an algorithm that'll flag it, and you'll have my
undying respect.

- Theo

-Original Message-
From: Jon Haworth [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, December 11, 2002 9:14 AM
To: 'Christopher Raymond'; [EMAIL PROTECTED]
Subject: RE: [PHP] Filter vulger / controversial words - need word
source


Hi Christopher,

> I'm wondering if someone has a great source for a master-list 
> of controversial and vulger words that I can use on my site. 
> I would like to pattern match input text against this master-list 
> in order to prevent vulger and controversial words from appearing 
> on my site.

Before you spend ages finding a good list, get the routine working. 

Once you've got the routine working, post it here, because there are many
people who would like to know how to do this properly.

The problems that others have experienced in the past are:

  - what happens with "mis"spellings, e.g. "fsck"?
  - what happens with dodgy formatting, e.g "f s c k"?
  - what happens with words like "Scunthorpe"?

Additionally, from my experience of the mail content filters we use here at
$WORKPLACE, you will also need to be careful not to cause offence by
catching peoples' names. We have a Chinese gentleman as a client with a
surname that could be mistaken for an offensive word - he was not best
pleased to receive a bounce message telling him that his email hadn't been
delivered because he was using profanity.

May I suggest, rather than picking your way through this minefield, you
provide a "report abusive comment" link instead?


Cheers
Jon


  

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Chris Hewitt



there simply is no definitive list of words


I agree. In Amateur Radio we faced this problem in the UK in the mid-90s 
on the ax25/tcpip network when the Home Office made the sysops 
responsible for content. Filtering messages off for human reading if any 
words on the "list" occured was the only practical solution. The sysops 
gradually built up a "list" and circulated it. The rules state something 
like the sysop must "take all reasonable precautions" to prevent 
offensive material circulating. So a sysop filtering/human reading is OK 
if an offensive message got through, whereas one not doing anything lost 
their licence.

This still leaves the problem of the English-only speaking sysop getting 
smtp/nntp in a different language!

Regards

Chris
PS Sorry about this being OT, my one and only post on this topic.



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread DL Neil
Jason,

> > there simply is no definitive list of words

> The fact is content filtering does not work without a heavy dose of human
> intervention.
> It is quite shocking that large numbers of well known corporations deploy
> misconfigured content-filtering software which rejects perfectly innocent
> email.

Earlier contributions highlight this admirably.

I was at an "Information" show last week where a stand displayed the good
work of the EU organisation in the field. I spoke to a Brussels(*)
wonk/weenie/suit about such legislation (proposed and awaiting national
enactment). I suggested that it would be unfair to 'bring to justice' anyone
for apparently offending some employee/user's sensitivities without first
defining WHAT would cause offence (eg the requested list of "vulgar words").
Otherwise the first you might know about it is when a court decides (against
you) that  is unacceptable in polite company. Accordingly I suggested
that his department publish such a list (in all of the languages/cultures of
the EU???), but observed that he would have a serious problem being able to
distribute it without his own office prosecuting itself! As is to be
expected, he failed to see the humor (and failed to see the
sense/requirement to do so)...

The joke is on our Indian friends @upv.pertamina.co.id whose 'filter' simply
bounces messages containing "Dirty Words", because as you say there is no
human involvement so they can't even benefit from Jon's observations. This
policy means that every contact/contract with Sc*nthorp that they lose, is
deservedly so, and an unfortunate advertisement not to use that country for
out-sourcing if the culture-gap is so great!?

Summary attitude: hey I'll code it if you want it/pay me to do so, but what
are you going to do when you meet the rest of society as soon as you come
off the email system? NB it has 'always' been illegal to use such language
in (British, and many others) phone conversations, but who does that
stop/what filters are in place there?
=dn

*for the benefit of more distant members: Brussels is the home of many
European Union (EU) offices, and the source of much bureaucratic 'stupidity'
such as the legislation mentioned earlier.



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
> > > if you want a partial list of offensive terms - try looking
> > > at the meta keywords on a few porn sites ...
> >
> > Excellent idea!
> > Unfortunately I'd have to explain that to my boss... "No, 
> > really, I'm doing some research..."
> 
> but won't your gateway/web server's filter prevent access to 
> such sites anyway?

Nope, only the really offensive ones like
http://www.thisisscunthorpe.co.uk/index.jsp ;-)

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread DL Neil
> > if you want a partial list of offensive terms - try looking
> > at the meta keywords on a few porn sites ...
>
> Excellent idea!
> Unfortunately I'd have to explain that to my boss... "No, really, I'm
doing
> some research..."


=guess monopolising the color printer for a whole afternoon would give you
away, huh?

=but won't your gateway/web server's filter prevent access to such sites
anyway?

=dn


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jason Wong
On Wednesday 11 December 2002 23:12, Sean Burlington wrote:

>
> there simply is no definitive list of words
>
> you can't stop people talking about pussy cats, turkey breasts or even
> shag pile carpets (and words have different meanings from one place to
> the next)
>
>
> automated filters can be a usefull aid to human moderation - flagging up
> messages for review.
>
> if you want a partial list of offensive terms - try looking at the
> meta keywords on a few porn sites ...

This last paragraph is about the most useful comment on this whole subject :)

The fact is content filtering does not work without a heavy dose of human 
intervention.

It is quite shocking that large numbers of well known corporations deploy 
misconfigured content-filtering software which rejects perfectly innocent 
email.

It is particularly amusing when some 'security-conscious' person in one of 
these organisations subscribe to a list discussing for example the latest 
virus threats. However said 'security-conscious' person may find that they 
receive nothing of importance from the list because messages containing just 
*the name* of a known virus is enough for the misconfigured content-filtering 
software to step in and reject the message.

-- 
Jason Wong -> Gremlins Associates -> www.gremlins.biz
Open Source Software Systems Integrators
* Web Design & Hosting * Internet & Intranet Applications Development *

/*
I finally went to the eye doctor.  I got contacts.  I only need them to
read, so I got flip-ups.
-- Steven Wright
*/


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
Hi Sean,

> if you want a partial list of offensive terms - try looking 
> at the meta keywords on a few porn sites ...

Excellent idea!

Unfortunately I'd have to explain that to my boss... "No, really, I'm doing
some research..."

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Sean Burlington
DL Neil wrote:

Hi Jon,

[SNIP]

May I suggest, rather than picking your way through this minefield, you
provide a "report abusive comment" link instead?



Most sensible! The employment of a technological solution to a social
problem is somewhat shooting the messenger. However some countries are now
legislating responsibility that ISPs/employers must discharge (shooting the
person who shoes the horses that the Pony Express messenger is riding!?)



I've worked on several projects where the client initially wanted 
filtering - in some cases we have implemented some kind of filtering - 
but it has always ended up being a human that doe the real work.

there simply is no definitive list of words

you can't stop people talking about pussy cats, turkey breasts or even 
shag pile carpets (and words have different meanings from one place to 
the next)


automated filters can be a usefull aid to human moderation - flagging up 
messages for review.

if you want a partial list of offensive terms - try looking at the
meta keywords on a few porn sites ...


--

Sean


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
Hi,

> I think we've seen this discussion on the list before
> (so Christopher, check the archives!)

Quite :-)

> > The problems that others have experienced in the past are:
> > - what happens with "mis"spellings, e.g. "fsck"?
> > - what happens with dodgy formatting, e.g "f s c k"?
> > - what happens with words like "Scunthorpe"?
> 
> Problem 1: add likely/popular mis-spellings to the list of 
> vulger/vulgar language

So when I'm giving a Linux user advice on how to recover from a disk crash,
my "run fsck" comment will get trapped the problem here is that context
is *everything*. You just can't know, by seeing the word "fsck" without any
of the surrounding text, whether I'm swearing at another geek or helping
them out :-)

There will also be problems with slang and idiom - e.g. "fag" in .uk is a
cigarette, but it's something quite different on the other side of the pond.
Again, this can only be judged from the context.

Finally, the more words you have in your list (to cover common
misspellings), the more likely you are to get a false positive (again,
context) - and you *will* cause offense if you trap someone's name, for
example.

> Problem 2: (contrived) very few single-letter words exist so remove
> intervening white space prior to analysis

Yup, also line breaks, dashes, asterisks, plus signs, etc etc :-)

> Problem 3: Scunthorpe contains an unfortunate series of letters (amongst
the
> town's many disadvantages) however the critical four are not a word in and
> of their own right so employ whitespace (\s) in the RegEx or token
analysis.

That's a good solution, but it's something that obviously is being missed by
many developers of this sort of algorithm... see the couple of followups I
made immediately after my original response.

> > May I suggest, rather than picking your way through this minefield, you
> > provide a "report abusive comment" link instead?
> 
> However some countries are now legislating responsibility that 
> ISPs/employers must discharge 

Whoops, forgot about that... 

> In this case perhaps one could analyse the incoming text and place an
> embargo on its publication on the web site until it has been reviewed by a
> human editor?

Looks like the best solution possible.

If the OP is interested I will see if I can get our content filter word list
from the network manager here... no promises though.

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread DL Neil
Hi Jon,
I think we've seen this discussion on the list before
(so Christopher, check the archives!)


> > I'm wondering if someone has a great source for a master-list
> > of controversial and vulger words that I can use on my site.
> > I would like to pattern match input text against this master-list
> > in order to prevent vulger and controversial words from appearing
> > on my site.
> Once you've got the routine working, post it here, because there are many
> people who would like to know how to do this properly.

> The problems that others have experienced in the past are:
>   - what happens with "mis"spellings, e.g. "fsck"?
>   - what happens with dodgy formatting, e.g "f s c k"?
>   - what happens with words like "Scunthorpe"?


Problem 1: add likely/popular mis-spellings to the list of vulger/vulgar
language

Problem 2: (contrived) very few single-letter words exist so remove
intervening white space prior to analysis

Problem 2a: (the more popular f*ck - someone suffering the misapprehension
that (s)he is somehow NOT guilty of using bad language/being offensive when
(s)he plainly is not only doing so but attempting to be deceptive as
well...) see response to Problem 1 (the probably habit would be to
replace/remove vowels)

Problem 3: Scunthorpe contains an unfortunate series of letters (amongst the
town's many disadvantages) however the critical four are not a word in and
of their own right so employ whitespace (\s) in the RegEx or token analysis.

> May I suggest, rather than picking your way through this minefield, you
> provide a "report abusive comment" link instead?

Most sensible! The employment of a technological solution to a social
problem is somewhat shooting the messenger. However some countries are now
legislating responsibility that ISPs/employers must discharge (shooting the
person who shoes the horses that the Pony Express messenger is riding!?)

In this case perhaps one could analyse the incoming text and place an
embargo on its publication on the web site until it has been reviewed by a
human editor?

If we were talking about filtering incoming email, then perhaps the original
message could be forwarded/wrapped with a message from the EmailAdmin/System
pointing out that a message has arrived from xyz (etc) and has been flagged
for a stated reason (but that there is room for interpretation within the
mechanical observation) and that the message should not be opened by anyone
fearing offence. (this similar to 'security' gateways that don't allow msgs
with attachments unless the 'employee' first authorises a 'pass-through')

Euro 0.02's worth?
=dn


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
> Following up to my own post

And again...

> I wonder if it was "Shorpe" I suppose I'll 
> find out when/if I get another bounce :-)

I got another bounce :-)

Whoever is running this filter obviously doesn't want to do business with
any of the 70,000 odd people who live in a particular town in the north of
England.

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
Following up to my own post

> > Once you've got the routine working, post it here, 
> > because there are many people who would like to know 
> > how to do this properly.

I didn't use any profanity I was aware of in this post, but I still received
this a few minutes later:

> Trend SMEX Content Filter has detected sensitive content. 
> Place = 'Christopher Raymond'; [EMAIL PROTECTED];
> Sender = Jon Haworth 
> Subject = RE: [PHP] Filter vulger / controversial words 
> - need word source 
> Delivery Time = December 11, 2002 (Wednesday) 22:13:00 
> Policy = Dirty Words 
> Action on this mail = Delete message 

I wonder if it was "Scunthorpe" I suppose I'll find out when/if I get
another bounce :-)

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




RE: [PHP] Filter vulger / controversial words - need word source

2002-12-11 Thread Jon Haworth
Hi Christopher,

> I'm wondering if someone has a great source for a master-list 
> of controversial and vulger words that I can use on my site. 
> I would like to pattern match input text against this master-list 
> in order to prevent vulger and controversial words from appearing 
> on my site.

Before you spend ages finding a good list, get the routine working. 

Once you've got the routine working, post it here, because there are many
people who would like to know how to do this properly.

The problems that others have experienced in the past are:

  - what happens with "mis"spellings, e.g. "fsck"?
  - what happens with dodgy formatting, e.g "f s c k"?
  - what happens with words like "Scunthorpe"?

Additionally, from my experience of the mail content filters we use here at
$WORKPLACE, you will also need to be careful not to cause offence by
catching peoples' names. We have a Chinese gentleman as a client with a
surname that could be mistaken for an offensive word - he was not best
pleased to receive a bounce message telling him that his email hadn't been
delivered because he was using profanity.

May I suggest, rather than picking your way through this minefield, you
provide a "report abusive comment" link instead?


Cheers
Jon


  

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php