Re: "Approximate" hashes

2004-09-06 Thread Len Sassaman
On Wed, 1 Sep 2004, Marcel Popescu wrote:

> Hence my question: is there some "approximate" hash function (which I could
> use instead of SHA-1) which can verify that a text hashes "very close" to a
> value? So that if I change, say, tabs into spaces, I won't get exactly the
> same value, but I would get a "good enough"?

Hi Marcel,

You may wish to look at Cmeclax's nilsimsa. It has been used to detect
slightly-modified message floods in anonymous remailer systems, and was
also used in Spamassassin at some point.

http://lexx.shinn.net/cmeclax/nilsimsa.html



-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


RE: "Approximate" hashes

2004-09-06 Thread Bill Stewart

> On Behalf Of Marcel Popescu
...
> My problem is that I don't know what happens with the email in transit
> (this, I believe, is an observation in the hashcash FAQ). I
> am worried that some mail server might dislike ASCII characters with 
>
> Hence my question: is there some "approximate" hash function
> (which I could > use instead of SHA-1) which can verify that a
> text hashes  "very close" to a value?
At 12:27 PM 9/1/2004, Keith Ray wrote:
nilsimsa
Computes nilsimsa codes of messages and compares the codes and finds
clusters of similar messages so as to trash spam.
Check out Vipul's Razor, which uses an approach similar to this.
You'll find information at Cloudmark and on Sourceforge.
There are several different kinds of differences to work around -
- damage in transit, as noted, though it's the least of your worries
in spite of Unicode, MS Codesets, and 8-bit-uncleanness
- different mail headers getting added or subtracted or mimed
(Some people include relevant parts in their message indexes, some 
don't.)
- deliberate differences introduced in the message to discourage detection,
ranging from the simple "Dear Alice"/"Dear Bob" to
removal addresses than encode each spam victim's info,
to different random word-scramble that's also there to
discourage Bayesian spam-detectors.
This one's really common these days, especially as mail systems
have decreased the number of users they'll send to
in a given SMTP session / envelope because of spamming -
if you can only spam 5-10 recipients per TCP session,
might as well make each session somewhat different
so you only get hit by local detectors, not global indexers.

Vipul's Razor and related approaches try to calculate a unique id
for each message so that if a human detects that a message is spam,
the id can be published so everybody else trashes it.
This usually needs more than one human rating something as spam
to prevent abuse, and there's some tuning, but it's a good start.
-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: "Approximate" hashes

2004-09-06 Thread Jon Callas
Hence my question: is there some "approximate" hash function (which I 
could
use instead of SHA-1) which can verify that a text hashes "very close" 
to a
value? So that if I change, say, tabs into spaces, I won't get exactly 
the
same value, but I would get a "good enough"?

What you want is what's called a canonicalization function. You want to 
hash not T, but F(T), and F can de-tabify, and so on.

As has been mentioned, OpenPGP has a canonicalization for text and 
cleartext signatures (and we debate what it should be, even). XML 
Digital Signatures has one.

The major other issue you have to deal with is whether the 
canonicalization is an interpretation, a conversion, or an assurance.

If it's an interpretation, then you are hashing F(T), and there's 
always some slim chance that there's a collision between two texts that 
canonicalize to different values, but hash to the same. I wouldn't 
worry about it, myself. Much. (Assuming a decent canonicalizer, of 
course.)

If it's a conversion, then you're replacing the text with the 
canonicalized text. This puts the canonicalization in the face of the 
users, but removes the problems handwaved at above.

If it's an assurance, then if the text is not canonicalized, then it's 
not a valid message if it doesn't meet the requirements. A message with 
a tab, for example, just isn't a valid message. Don't even hash it, 
return an error. Or alert that it was an invalid message.

Jon
-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


RE: "Approximate" hashes

2004-09-01 Thread Jerrold Leichter
| nilsimsa
| Computes nilsimsa codes of messages and compares the codes and finds
| clusters of similar messages so as to trash spam.
|
| What's a nilsimsa code?
|
| A nilsimsa code is something like a hash, but unlike hashes, a small change
| in the message results in a small change in the nilsimsa code.
|
| http://lexx.shinn.net/cmeclax/nilsimsa.html
I had a look at the code (which isn't easy to follow).  This appears to be a
new application of Bloom filters.
-- Jerry

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


RE: "Approximate" hashes

2004-09-01 Thread Keith Ray
> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Marcel Popescu
> Sent: Wednesday, September 01, 2004 9:56 AM
> To: [EMAIL PROTECTED]
> Subject: "Approximate" hashes
> 
> I am trying to build a Windows anti-spam thingy; it's 
> supposed to "sit" in
> between the mail client and the outer world, and indicate through mail
> headers whether the incoming mail has a valid hashcash
> http://www.hashcash.org/ "coin" (and, of course, to automatically add
> hashcash to outgoing emails).
> 
> My problem is that I don't know what happens with the email in transit
> (this, I believe, is an observation in the hashcash FAQ). I 
> am worried that
> some mail server might dislike ASCII characters with the high 
> bit set, or
> that a client uses some encoding which for some reason 
> doesn't make it to
> the destination unchanged.
> 
> Hence my question: is there some "approximate" hash function 
> (which I could
> use instead of SHA-1) which can verify that a text hashes 
> "very close" to a
> value? So that if I change, say, tabs into spaces, I won't 
> get exactly the
> same value, but I would get a "good enough"?
> 
> I don't know if this is possible. But if it is, I though this 
> would be a
> good place to find out about it.

nilsimsa

Computes nilsimsa codes of messages and compares the codes and finds
clusters of similar messages so as to trash spam.

What's a nilsimsa code?

A nilsimsa code is something like a hash, but unlike hashes, a small change
in the message results in a small change in the nilsimsa code.

http://lexx.shinn.net/cmeclax/nilsimsa.html

 --
Keith Ray <[EMAIL PROTECTED]> -- OpenPGP Key: 0x79269A12

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: "Approximate" hashes

2004-09-01 Thread David Honig
At 06:02 PM 9/1/04 +0300, Marcel Popescu wrote:
>From: "Marcel Popescu" <[EMAIL PROTECTED]>
>
>> Hence my question: is there some "approximate" hash function (which I
>could
>> use instead of SHA-1) which can verify that a text hashes "very close" to
>a
>> value? So that if I change, say, tabs into spaces, I won't get exactly the
>> same value, but I would get a "good enough"?

This is completely what secure hashes are supposed to prevent.  *Any* change
in the input will flip half the hash-bits on average.  Just like a block
cipher.  There is no input "nearness" preservation.  That's part of the point
of the algorithm.


>I just had an idea. Would this work?
>
>- let S be the input string, whose hash I want to verify
>- make S uppercase
>- remove everything but A-Z, 0-9, and common punctuation (!;:'",.?)
>- calculate the SHA1 hash of the result
>
>This should keep any insignificant changes out of the final result. 

You can encode your message in some format which is not subject
to mangling, and use a hash of that encoding.  You can then
decode the encoding back into unicode or whatever funky but
net-fragile character set you like.  This is somewhat like
ascii-armoring of PGP-encrypted messages.





=
36 Laurelwood Dr
Irvine CA 92620-1299

VOX: (714) 544-9727 (home) mnemonic: P1G JIG WRAP

ICBM: -117.7621, 33.7275
HTTP: http://68.5.216.23:81 (back up, but not 99.999% reliable)
PGP PUBLIC KEY: by arrangement

Send plain ASCII text not HTML lest ye be misquoted

--

"Don't 'sir' me, young man, you have no idea who you're dealing with"
Tommy Lee Jones, MIB



No, you're not 'tripping', that is an emu ---Hank R. Hill

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: "Approximate" hashes

2004-09-01 Thread Marcel Popescu
From: ""Hal Finney"" <[EMAIL PROTECTED]>

> As you are probably aware, existing hashcash implementations do not base
> the stamp on the message content.  Instead they only lock the stamp to
> the receiver's email address.  Then the receiver keeps a list of the
> hashcash stamps he has seen recently, to prevent reuse.  I'm not sure
> what you hope to gain by locking the stamp to the message content.

Me dumb, sorry. I was actually thinking of some other thing I wanted to do
about spam (which involved the whole content), and haven't re-read the
hashcash site in about a week.

You are perfectly right, of course. Windows spam victims, here I come! 

Thanks,
Marcel

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: "Approximate" hashes

2004-09-01 Thread C. Scott Ananian
On Wed, 1 Sep 2004, Marcel Popescu wrote:

> My problem is that I don't know what happens with the email in transit
> some mail server might dislike ASCII characters with the high bit set, or
> Hence my question: is there some "approximate" hash function (which I could

PGP has this issue with 'clear-signed' output.  An excerpt from the pgp
man page:
If CLEARSIG is enabled, then when signing and ASCII-armoring a text
file, PGP uses a different format that includes the plaintext in
human-readable form.  Lines beginning with "-" are quoted with "- ".
To cope with some of the stupider mailers in the world, lines
beginning with "From"  are also quoted, and trailing whitespace on
lines is stripped.  PGP will remove the quoting if you use it to
decrypt the message, but the trailing whitespace is not recovered.
This is still useful enough to be enabled by default.
You might find more details about typical mailer-munging from the PGP docs
or source.
 --scott

GRALLSPICE Moscow MKSEARCH affinity group KUDESK Nazi operative Clinton
   Register to vote!  http://www.yourvotematters.org/VerifiedVoting
 ( http://cscott.net/ )

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: "Approximate" hashes

2004-09-01 Thread Marcel Popescu
From: "Marcel Popescu" <[EMAIL PROTECTED]>

> Hence my question: is there some "approximate" hash function (which I
could
> use instead of SHA-1) which can verify that a text hashes "very close" to
a
> value? So that if I change, say, tabs into spaces, I won't get exactly the
> same value, but I would get a "good enough"?

I just had an idea. Would this work?

- let S be the input string, whose hash I want to verify
- make S uppercase
- remove everything but A-Z, 0-9, and common punctuation (!;:'",.?)
- calculate the SHA1 hash of the result

This should keep any insignificant changes out of the final result. Does
anyone know of a mail transformation which could upset it? Can anyone see a
way to "attack" this by letting a significantly different message collide on
the same hash? (I'm ignoring the recent discoveries - they're not that
practical, I'm only trying to fight spam, not the government.)

Thanks,
Marcel

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


"Approximate" hashes

2004-09-01 Thread Marcel Popescu
I am trying to build a Windows anti-spam thingy; it's supposed to "sit" in
between the mail client and the outer world, and indicate through mail
headers whether the incoming mail has a valid hashcash
http://www.hashcash.org/ "coin" (and, of course, to automatically add
hashcash to outgoing emails).

My problem is that I don't know what happens with the email in transit
(this, I believe, is an observation in the hashcash FAQ). I am worried that
some mail server might dislike ASCII characters with the high bit set, or
that a client uses some encoding which for some reason doesn't make it to
the destination unchanged.

Hence my question: is there some "approximate" hash function (which I could
use instead of SHA-1) which can verify that a text hashes "very close" to a
value? So that if I change, say, tabs into spaces, I won't get exactly the
same value, but I would get a "good enough"?

I don't know if this is possible. But if it is, I though this would be a
good place to find out about it.

Thanks,
Marcel

-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]