Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 01:41, Richard Baron Penman wrote: I found the MD5 and SHA hashes slow to calculate. Slow? For URLs? Are you kidding? How many URLs per second do you want to calculate? The builtin hash is fast but I was concerned about collisions. What rate of collisions could I expect? MD5

Re: Generate unique ID for URL

2012-11-14 Thread Richard
thanks for perspective! -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 02:39, Roy Smith wrote: The next step is to reduce the number of bits you are encoding. You said in another post that 1 collision in 10 million hashes would be tolerable. So you need: math.log(10*1000*1000, 2) 23.25349666421154 24 bits worth of key. Nope :-) Base64

Re: Generate unique ID for URL

2012-11-14 Thread Dave Angel
On 11/14/2012 06:29 AM, Johannes Bauer wrote: snip When doing these calculations, it's important to keep the birthday paradox in mind (this is kind of counter-intuitive): The chance of a collission raises tremendously when we're looking for *any* arbitrary two hashes colliding within a

Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 13:33, Dave Angel wrote: Te birthday paradox could have been important had the OP stated his goal differently. What he said was: Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be

Generate unique ID for URL

2012-11-13 Thread Richard
Hello, I want to create a URL-safe unique ID for URL's. Currently I use: url_id = base64.urlsafe_b64encode(url) base64.urlsafe_b64encode('docs.python.org/library/uuid.html') 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' I would prefer more concise ID's. What do you recommend? - Compression?

Re: Generate unique ID for URL

2012-11-13 Thread John Gordon
In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard richar...@gmail.com writes: I want to create a URL-safe unique ID for URL's. Currently I use: url_id = base64.urlsafe_b64encode(url) base64.urlsafe_b64encode('docs.python.org/library/uuid.html')

Re: Generate unique ID for URL

2012-11-13 Thread Richard
Good point - one way encoding would be fine. Also this is performed millions of times so ideally efficient. On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote: In 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com Richard richar...@gmail.com writes: I want to

Re: Generate unique ID for URL

2012-11-13 Thread Chris Kaynor
One option would be using a hash. Python's built-in hash, a 32-bit CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, depending on the needs. Higher bit counts will reduce the odds of accidental collisions; cryptographically secure ones if outside attacks matter. In such a case,

Re: Generate unique ID for URL

2012-11-13 Thread Richard Baron Penman
I found the MD5 and SHA hashes slow to calculate. The builtin hash is fast but I was concerned about collisions. What rate of collisions could I expect? Outside attacks not an issue and multiple processes would be used. On Wed, Nov 14, 2012 at 11:26 AM, Chris Kaynor ckay...@zindagigames.com

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:26, schrieb Chris Kaynor: One option would be using a hash. Python's built-in hash, a 32-bit CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, depending on the needs. Higher bit counts will reduce the odds of accidental collisions; cryptographically secure

Re: Generate unique ID for URL

2012-11-13 Thread Richard
These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application. Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable. --

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:41, schrieb Richard Baron Penman: I found the MD5 and SHA hashes slow to calculate. The builtin hash is fast but I was concerned about collisions. What rate of collisions could I expect? Seriously? It takes about 1-5msec to sha1() one MB of data on a modern CPU, 1.5 on my box.

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:50, schrieb Richard: These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application. Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be

Re: Generate unique ID for URL

2012-11-13 Thread Richard
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower. No database or else I would just use their ID. On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote: Am 14.11.2012 01:41, schrieb Richard Baron Penman: I found the MD5 and SHA hashes slow to

Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article 0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com, Richard richar...@gmail.com wrote: Hello, I want to create a URL-safe unique ID for URL's. Currently I use: url_id = base64.urlsafe_b64encode(url) base64.urlsafe_b64encode('docs.python.org/library/uuid.html')

Re: Generate unique ID for URL

2012-11-13 Thread Richard
I am dealing with URL's rather than integers -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-13 Thread Richard
So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. For example to store data at URL abc: a/b/c/index.html This data is also viewed

Re: Generate unique ID for URL

2012-11-13 Thread Richard
The next step is to reduce the number of bits you are encoding. You said in another post that 1 collision in 10 million hashes would be tolerable. So you need: math.log(10*1000*1000, 2) 23.25349666421154 I think a difficulty would be finding a hash algorithm that maps

Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article 1ce88f36-bfc7-4a55-89f8-70d1645d2...@googlegroups.com, Richard richar...@gmail.com wrote: So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. I can't store the files in a single directory because of OS limitations so have been using a

Re: Generate unique ID for URL

2012-11-13 Thread Richard
thanks for pointer to Varnish. I found MongoDB had a lot of size overhead so that it ended up using 4x the data stored. -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-13 Thread Chris Angelico
On Wed, Nov 14, 2012 at 2:25 PM, Richard richar...@gmail.com wrote: So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. For example

Re: Generate unique ID for URL

2012-11-13 Thread Richard
yeah good point - I have gone with md5 for now. On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote: On Wed, Nov 14, 2012 at 2:25 PM, Richard richar...@gmail.com wrote: So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL.