Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 01:41, Richard Baron Penman wrote:
 I found the MD5 and SHA hashes slow to calculate.

Slow? For URLs? Are you kidding? How many URLs per second do you want to
calculate?

 The builtin hash is fast but I was concerned about collisions. What
 rate of collisions could I expect?

MD5 has 16 bytes (128 bit), SHA1 has 20 bytes (160 bit). Utilizing the
birthday paradox and some approximations, I can tell you that when using
the full MD5 you'd need around 2.609e16 hashes in the same namespace to
get a one in a million chance of a collision. That is, 26090
filenames.

For SHA1 This number rises even further and you'd need around 1.71e21 or
171000 hashes in one namespace for the one-in-a-million.

I really have no clue about how many URLs you want to hash, and it seems
to be LOTS since the speed of MD5 seems to be an issue for you. Let me
estimate that you'd want to calculate a million hashes per second then
when you use MD5, you'd have about 827 years to fill the namespace up
enough to get a one-in-a-million.

If you need even more hashes (say a million million per second), I'd
suggest you go with SHA-1, giving you 54 years to get the one-in-a-million.

Then again, if you went for a million million hashes per second, Python
would probably not be the language of your choice.

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-14 Thread Richard
thanks for perspective!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 02:39, Roy Smith wrote:

 The next step is to reduce the number of bits you are encoding.  You 
 said in another post that 1 collision in 10 million hashes would be 
 tolerable.  So you need:
 
 math.log(10*1000*1000, 2)
 23.25349666421154
 
 24 bits worth of key. 

Nope :-)

 Base64 encoded, that's only 4 characters.  
 Actually, I probably just proved that I don't really understand how 
 probabilities work, so maybe what you really need is 32 or 48 or 64 
 bits.

:-))

When doing these calculations, it's important to keep the birthday
paradox in mind (this is kind of counter-intuitive): The chance of a
collission raises tremendously when we're looking for *any* arbitrary
two hashes colliding within a certain namespace. The probability you've
calculated is the pre-image probability (which you also again need to
multiply with a factor of two, because when trying to collide one given
hash, in the mean case you'll only have to search *half* the namespace
before finding a collision).

There are three things you need to know before you can give an estimate:
1. The permissible probability of a collision (1e-7 in this case)
2. The hash size
3. The worst-case number of elements in the namespace

You neglected 3 completely -- but knowing this is really important. This
becomes obvious when looking at the extreme cases: Let's say you have a
hash of arbitrary size, but only hash one element. The chance of a
collision is *always* zero. Or look at a hash of size 2^n. Then put 2^n
+ 1 elements in the namespace. The chance of a collision is *always* one.

Doing the calculations (formulas can be found on wikipedia on the site
of the birthday phaenomenon), you can come up with these following
bitlenghts of the hash with a 1e-7 probability of collision in respect
to the worst-case number of elements

10k elements: 49 bit
100k elements: 56 bit
1e6 elements: 63 bit
100e6 elements: 76 bit
1e9 elements: 83 bit
1e12 elements: 102 bit

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-14 Thread Dave Angel
On 11/14/2012 06:29 AM, Johannes Bauer wrote:
 snip

 When doing these calculations, it's important to keep the birthday
 paradox in mind (this is kind of counter-intuitive): The chance of a
 collission raises tremendously when we're looking for *any* arbitrary
 two hashes colliding within a certain namespace. The probability you've
 calculated is the pre-image probability (which you also again need to
 multiply with a factor of two, because when trying to collide one given
 hash, in the mean case you'll only have to search *half* the namespace
 before finding a collision).
 snip

Te birthday paradox could have been important had the OP stated his goal
differently.  What he said was:

Ideally I would want to avoid collisions altogether. But if that means 
significant extra CPU time then 1 collision in 10 million hashes would be 
tolerable.

That means that he's willing to do the necessary overhead of collision
resolution, once in every 10 million lookups.  That's not the same as
saying that he wants only one chance in 10 million of having ANY
collisions among his data items.



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 13:33, Dave Angel wrote:

 Te birthday paradox could have been important had the OP stated his goal
 differently.  What he said was:
 
 Ideally I would want to avoid collisions altogether. But if that means 
 significant extra CPU time then 1 collision in 10 million hashes would be 
 tolerable.
 
 That means that he's willing to do the necessary overhead of collision
 resolution, once in every 10 million lookups.  That's not the same as
 saying that he wants only one chance in 10 million of having ANY
 collisions among his data items.

Since he stated in a later post that he actually went with MD5, the
calculations are indeed relevant. They give the number of bits a perfect
hash needs to have in order to get the desired low probablility of
collision resolutions. And for that the birthday paradox probability
must be considered instead of the (much lower) pre-image probability.

In any case, it appeared to me as if the OP was rather looking for ideas
and wasn't sure himself what approach to take -- so I find it quite
appropriate to give suggestions one way or another (even if they might
not fit the exact phrasing of one of his postings).

Best regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread John Gordon
In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard 
richar...@gmail.com writes:

 I want to create a URL-safe unique ID for URL's.
 Currently I use:
 url_id = base64.urlsafe_b64encode(url)

  base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'

 I would prefer more concise ID's. 
 What do you recommend? - Compression?

Does the ID need to contain all the information necessary to recreate the
original URL?

-- 
John Gordon   A is for Amy, who fell down the stairs
gor...@panix.com  B is for Basil, assaulted by bears
-- Edward Gorey, The Gashlycrumb Tinies

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
Good point - one way encoding would be fine.

Also this is performed millions of times so ideally efficient.


On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
 In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard 
 richar...@gmail.com writes:
 
 
 
  I want to create a URL-safe unique ID for URL's.
 
  Currently I use:
 
  url_id = base64.urlsafe_b64encode(url)
 
 
 
   base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
 
  'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
 
 
 
  I would prefer more concise ID's. 
 
  What do you recommend? - Compression?
 
 
 
 Does the ID need to contain all the information necessary to recreate the
 
 original URL?
 
 
 
 -- 
 
 John Gordon   A is for Amy, who fell down the stairs
 
 gor...@panix.com  B is for Basil, assaulted by bears
 
 -- Edward Gorey, The Gashlycrumb Tinies

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Chris Kaynor
One option would be using a hash. Python's built-in hash, a 32-bit
CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
depending on the needs. Higher bit counts will reduce the odds of
accidental collisions; cryptographically secure ones if outside
attacks matter. In such a case, you'd have to roll your own means of
converting the hash back into the string if you ever need it for
debugging, and there is always the possibility of collisions. A
similar solution would be using a pseudo-random GUID using the url as
the seed.

You could use a counter if all IDs are generated by a single process
(and even in other cases with some work).

If you want to be able to go both ways, using base64 encoding is
probably your best bet, though you might get benefits by using
compression.
Chris


On Tue, Nov 13, 2012 at 3:56 PM, Richard richar...@gmail.com wrote:
 Good point - one way encoding would be fine.

 Also this is performed millions of times so ideally efficient.


 On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
 In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard 
 richar...@gmail.com writes:



  I want to create a URL-safe unique ID for URL's.

  Currently I use:

  url_id = base64.urlsafe_b64encode(url)



   base64.urlsafe_b64encode('docs.python.org/library/uuid.html')

  'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'



  I would prefer more concise ID's.

  What do you recommend? - Compression?



 Does the ID need to contain all the information necessary to recreate the

 original URL?



 --

 John Gordon   A is for Amy, who fell down the stairs

 gor...@panix.com  B is for Basil, assaulted by bears

 -- Edward Gorey, The Gashlycrumb Tinies

 --
 http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard Baron Penman
I found the MD5 and SHA hashes slow to calculate.
The builtin hash is fast but I was concerned about collisions. What
rate of collisions could I expect?

Outside attacks not an issue and multiple processes would be used.


On Wed, Nov 14, 2012 at 11:26 AM, Chris Kaynor ckay...@zindagigames.com wrote:
 One option would be using a hash. Python's built-in hash, a 32-bit
 CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
 depending on the needs. Higher bit counts will reduce the odds of
 accidental collisions; cryptographically secure ones if outside
 attacks matter. In such a case, you'd have to roll your own means of
 converting the hash back into the string if you ever need it for
 debugging, and there is always the possibility of collisions. A
 similar solution would be using a pseudo-random GUID using the url as
 the seed.

 You could use a counter if all IDs are generated by a single process
 (and even in other cases with some work).

 If you want to be able to go both ways, using base64 encoding is
 probably your best bet, though you might get benefits by using
 compression.
 Chris


 On Tue, Nov 13, 2012 at 3:56 PM, Richard richar...@gmail.com wrote:
 Good point - one way encoding would be fine.

 Also this is performed millions of times so ideally efficient.


 On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
 In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard 
 richar...@gmail.com writes:



  I want to create a URL-safe unique ID for URL's.

  Currently I use:

  url_id = base64.urlsafe_b64encode(url)



   base64.urlsafe_b64encode('docs.python.org/library/uuid.html')

  'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'



  I would prefer more concise ID's.

  What do you recommend? - Compression?



 Does the ID need to contain all the information necessary to recreate the

 original URL?



 --

 John Gordon   A is for Amy, who fell down the stairs

 gor...@panix.com  B is for Basil, assaulted by bears

 -- Edward Gorey, The Gashlycrumb Tinies

 --
 http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:26, schrieb Chris Kaynor:
 One option would be using a hash. Python's built-in hash, a 32-bit
 CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
 depending on the needs. Higher bit counts will reduce the odds of
 accidental collisions; cryptographically secure ones if outside
 attacks matter. In such a case, you'd have to roll your own means of
 converting the hash back into the string if you ever need it for
 debugging, and there is always the possibility of collisions. A
 similar solution would be using a pseudo-random GUID using the url as
 the seed.

A hash is the wrong answer to the issue as a hash is open to all sorts
of attack vectors like length extension attack. If Robert needs to
ensure any kind of collision resistance than he needs a MAC, for example
a HMAC with a secret key.

If he needs some kind of persistent identifier than some like a URN or
DOI may be a better answer.

Christian
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
These URL ID's would just be used internally for quick lookups, not exposed 
publicly in a web application.

Ideally I would want to avoid collisions altogether. But if that means 
significant extra CPU time then 1 collision in 10 million hashes would be 
tolerable.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:41, schrieb Richard Baron Penman:
 I found the MD5 and SHA hashes slow to calculate.
 The builtin hash is fast but I was concerned about collisions. What
 rate of collisions could I expect?

Seriously? It takes about 1-5msec to sha1() one MB of data on a modern
CPU, 1.5 on my box. The openssl variants of Python's hash code release
the GIL so you use the power of all cores.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:50, schrieb Richard:
 These URL ID's would just be used internally for quick lookups, not exposed 
 publicly in a web application.
 
 Ideally I would want to avoid collisions altogether. But if that means 
 significant extra CPU time then 1 collision in 10 million hashes would be 
 tolerable.

Are you storing the URLs in any kind of database like a SQL database? A
proper index on the data column will avoid full table scans. It will
give you almost O(1) complexity on lookups and O(n) worst case
complexity for collisions.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower.

No database or else I would just use their ID.


On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote:
 Am 14.11.2012 01:41, schrieb Richard Baron Penman:
 
  I found the MD5 and SHA hashes slow to calculate.
 
  The builtin hash is fast but I was concerned about collisions. What
 
  rate of collisions could I expect?
 
 
 
 Seriously? It takes about 1-5msec to sha1() one MB of data on a modern
 
 CPU, 1.5 on my box. The openssl variants of Python's hash code release
 
 the GIL so you use the power of all cores.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com,
 Richard richar...@gmail.com wrote:

 Hello,
 
 I want to create a URL-safe unique ID for URL's.
 Currently I use:
 url_id = base64.urlsafe_b64encode(url)
 
  base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
 
 I would prefer more concise ID's. 
 What do you recommend? - Compression?

If you're generating random id strings, there's only two ways to make 
them shorter.  Either encode fewer bits of information, or encode them 
more compactly.

Let's start with the second one.  You're already using base64, so you're 
getting 6 bits per character.  You can do a little better than that, but 
not much.  The set of URL-safe characters is the 96-ish printable ascii 
set, minus a few pieces of punctuation.  Maybe you could get it up to 
6.3 or 6.4 bits per character, but that's about it.  For the complexity 
this would add it's probably not worth it.

The next step is to reduce the number of bits you are encoding.  You 
said in another post that 1 collision in 10 million hashes would be 
tolerable.  So you need:

 math.log(10*1000*1000, 2)
23.25349666421154

24 bits worth of key.  Base64 encoded, that's only 4 characters.  
Actually, I probably just proved that I don't really understand how 
probabilities work, so maybe what you really need is 32 or 48 or 64 
bits.  Certainly not the 264 bits you're encoding with your example 
above.

So, something like:

hash = md5.md5('docs.python.org/library/uuid.html').digest()
hash64 = base64.urlsafe_b64encode(hash)
id = hash64[:8]  # or 12, or whatever

But, I still don't really understand your use case.  You've already 
mentioned the following requirements:

just be used internally for quick lookups, not exposed publicly
URL-safe
unique
1 collision in 10 million hashes would be tolerable
one way encoding would be fine
performed millions of times so ideally efficient

but haven't really explained what it is that you're trying to do.

If they're not going to be exposed publicly, why do you care if they're 
URL-safe?

What's wrong with just using the URLs directly as dictionary keys and 
not worrying about it until you've got some hard data showing that this 
is not sufficient?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
I am dealing with URL's rather than integers
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
So the use case - I'm storing webpages on disk and want a quick retrieval 
system based on URL. 
I can't store the files in a single directory because of OS limitations so have 
been using a sub folder structure.
For example to store data at URL abc: a/b/c/index.html
This data is also viewed locally through a web app.

If you can suggest a better approach I would welcome it. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
 The next step is to reduce the number of bits you are encoding.  You 
 
 said in another post that 1 collision in 10 million hashes would be 
 
 tolerable.  So you need:
 
 
 
  math.log(10*1000*1000, 2)
 
 23.25349666421154


I think a difficulty would be finding a hash algorithm that maps evenly across 
those bits.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article 1ce88f36-bfc7-4a55-89f8-70d1645d2...@googlegroups.com,
 Richard richar...@gmail.com wrote:

 So the use case - I'm storing webpages on disk and want a quick retrieval 
 system based on URL. 
 I can't store the files in a single directory because of OS limitations so 
 have been using a sub folder structure.
 For example to store data at URL abc: a/b/c/index.html
 This data is also viewed locally through a web app.
 
 If you can suggest a better approach I would welcome it. 

Ah, so basically, you're reinventing Varnish?

Maybe do what Varnish (and MongoDB, and a few other things) do?  Bypass 
the file system entirely.  Juar mmap() a chunk of memory large enough to 
hold everything and let the OS figure out how to page things to disk.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
thanks for pointer to Varnish. 

I found MongoDB had a lot of size overhead so that it ended up using 4x the 
data stored. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Chris Angelico
On Wed, Nov 14, 2012 at 2:25 PM, Richard richar...@gmail.com wrote:
 So the use case - I'm storing webpages on disk and want a quick retrieval 
 system based on URL.
 I can't store the files in a single directory because of OS limitations so 
 have been using a sub folder structure.
 For example to store data at URL abc: a/b/c/index.html
 This data is also viewed locally through a web app.

 If you can suggest a better approach I would welcome it.

The cost of a crypto hash on the URL will be completely dwarfed by the
cost of storing/retrieving on disk. You could probably do some
arithmetic and figure out exactly how many URLs (at an average length
of, say, 100 bytes) you can hash in the time of one disk seek.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Generate unique ID for URL

2012-11-13 Thread Richard
yeah good point - I have gone with md5 for now.


On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote:
 On Wed, Nov 14, 2012 at 2:25 PM, Richard richar...@gmail.com wrote:
 
  So the use case - I'm storing webpages on disk and want a quick retrieval 
  system based on URL.
 
  I can't store the files in a single directory because of OS limitations so 
  have been using a sub folder structure.
 
  For example to store data at URL abc: a/b/c/index.html
 
  This data is also viewed locally through a web app.
 
 
 
  If you can suggest a better approach I would welcome it.
 
 
 
 The cost of a crypto hash on the URL will be completely dwarfed by the
 
 cost of storing/retrieving on disk. You could probably do some
 
 arithmetic and figure out exactly how many URLs (at an average length
 
 of, say, 100 bytes) you can hash in the time of one disk seek.
 
 
 
 ChrisA

-- 
http://mail.python.org/mailman/listinfo/python-list