Re: hashing strings to integers
On 2014-05-27, Steven D'Aprano wrote: > On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote: >> Well, here's the way it works in my mind: >> >>I can store a set of a zillion strings (or a dict with a zillion >>string keys), but every time I test "if new_string in seen_strings", >>the computer hashes the new_string using some kind of "short hash", >>checks the set for matching buckets (I'm assuming this is how python >>tests set membership --- is that right?), > > So far so good. That applies to all objects, not just strings. > > >>then checks any >>hash-matches for string equality. Testing string equality is slower >>than integer equality, and strings (unless they are really short) >>take up a lot more memory than long integers. > > But presumably you have to keep the string around anyway. It's going to > be somewhere, you can't just throw the string away and garbage collect > it. The dict doesn't store a copy of the string, it stores a reference to > it, and extra references don't cost much. In the case where I did something like that, I wasn't keeping copies of the strings in memory after hashing (& otherwise processing them). I know that putting the strings' pointers in the set is a light memory load. [snipping the rest because...] You've convinced me. Thanks. -- I heard that Hans Christian Andersen lifted the title for "The Little Mermaid" off a Red Lobster Menu. [Bucky Katt] -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On 2014-05-28, Dan Sommers wrote: > On Tue, 27 May 2014 17:02:50 +, Steven D'Aprano wrote: > >> - rather than "zillions" of them, there are few enough of them that >> the chances of an MD5 collision is insignificant; > >> (Any MD5 collision is going to play havoc with your strategy of >> using hashes as a proxy for the real string.) > >> - and you can arrange matters so that you never need to MD5 hash a >> string twice. > > Hmmm... I'll use the MD5 hashes of the strings as a key, and the > strinsgs as the value (to detect MD5 collisions) ... Hey, I'm not *that* stupid. -- In the 1970s, people began receiving utility bills for -£999,999,996.32 and it became harder to sustain the myth of the infallible electronic brain. (Verity Stob) -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On Tue, 27 May 2014 17:02:50 +, Steven D'Aprano wrote: > - rather than "zillions" of them, there are few enough of them that > the chances of an MD5 collision is insignificant; > (Any MD5 collision is going to play havoc with your strategy of > using hashes as a proxy for the real string.) > - and you can arrange matters so that you never need to MD5 hash a > string twice. Hmmm... I'll use the MD5 hashes of the strings as a key, and the strinsgs as the value (to detect MD5 collisions) ... (But I'm sure that Steven was just waiting for someone to take that bait...) -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On Wed, May 28, 2014 at 3:02 AM, Steven D'Aprano wrote: > But I know that Python is a high-level language with > lots of high-level data structures like dicts which trade-off time and > memory for programmer convenience, and that I'd want to see some real > benchmarks proving that my application was too slow before giving up that > convenience with a complicated strategy like this. And they trade off that time and memory on the basis of X years of development expertise. A while ago (about ten years or so, now... wow, that's quite a while) I wrote a C++ program that needed an ever-growing array; for simplicity, I went with a very basic system of doubling the size every time, from a base of something like 1024 or 8192. (Note that it was storing and moving around only pointers, so it's comparable to Python's list.) That means it has an average 25% slack space at all times, more until its first allocation, and every now and then it has a huge job of copying a pile of pointers into a new array. (Imagine it's now at 16777216 and it needs to add the 16,777,217th string to the array. Bye-bye CPU caches.) These boundaries became *user-visible pauses*, fortunately at predictable points, but on a not-terrible computer it could cause a >1s pause just copying heaps of pointers. How do you think a Python list will perform, under the same workload (periodic appending of single strings or small groups of strings, very frequent retrieval based on index - probably about a 20:1 read:write ratio)? Not only would it be far more convenient, it's probably going to outperform my hand-rolled code - unless I'm so brilliant (or lucky) that I can stumble to something as good as can be achieved with years of dedicated development. Yeah, I don't think so. Same goes for hashing algorithms. I can at least boast that I've never written one of those... although I have several times written a binary tree of one form or another, in order to provide comparable features to a dict. And once again, now that I know how convenient and performant high level languages can be (which may or may not have been true 15 years ago, but I certainly didn't know it), I don't rewrite what can be done better by just tossing in a couple of curly braces and saying "here, map this to that". ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote: > On 2014-05-23, Chris Angelico wrote: > >> On Fri, May 23, 2014 at 8:27 PM, Adam Funk >> wrote: >>> I've also used hashes of strings for other things involving >>> deduplication or fast lookups (because integer equality is faster than >>> string equality). I guess if it's just for deduplication, though, a >>> set of byte arrays is as good as a set of int? >> >> Want a better way to do that? Use a set or dict and let Python do the >> hashing. It'll be every bit as fast as explicit hashing, plus you get >> the bonus of simplicity. > > Well, here's the way it works in my mind: > >I can store a set of a zillion strings (or a dict with a zillion >string keys), but every time I test "if new_string in seen_strings", >the computer hashes the new_string using some kind of "short hash", >checks the set for matching buckets (I'm assuming this is how python >tests set membership --- is that right?), So far so good. That applies to all objects, not just strings. >then checks any >hash-matches for string equality. Testing string equality is slower >than integer equality, and strings (unless they are really short) >take up a lot more memory than long integers. But presumably you have to keep the string around anyway. It's going to be somewhere, you can't just throw the string away and garbage collect it. The dict doesn't store a copy of the string, it stores a reference to it, and extra references don't cost much. As for string equality, in pseudo-code it looks something like this: def __eq__(self, other): # Consider only the case where other is a string as well. if self is other: # object identity return True if len(self) != len(other): # checking the length is fast return False # iterate over both strings in lock-step for a in self, b in other: if a != b: return False return True only written in fast C code. So the equality test bails out as quickly as possible, and the only time you have to look at every character is if the two strings are equal but not the same object. MD5 hashing, on the other hand, *always* has to look at every character, and perform quite a complicated set of computations. It's expensive. > So for that kind of thing, I tend to store MD5 hashes or something > similar. Is that wrong? First rule of optimization: Don't do it. Second rule of optimization (for experts only): Don't do it yet. You're making the cardinal mistake of optimization, not what is slow, but what you *assume* will be slow. (Or, replace memory consumption for speed if you prefer.) Python is not C, and your intuitions for what's fast and slow (low and high memory consumption) are likely to be way off. Consider trying to store an MD5 hash instead of the string "Hello World!". If I try this in Python 2.7, I get this: py> import md5, sys py> s = "Hello World!" py> n = int(md5.md5(s).hexdigest(), 16) py> sys.getsizeof(s) 33 py> sys.getsizeof(n) 30 I save a lousy 3 bytes. But in order to save that 3 bytes, I have to generate an md5 object (32 bytes) and a hex string (53 bytes), both of which take time to create and then time to garbage collect. Actually, I don't even save those 3 bytes, since for nearly all reasonable applications, I'll need to keep the string around somewhere. So instead of saving three bytes, it's costing me thirty bytes for the hash, plus who-knows-what needed to manage the book keeping to link that hash to the actual string. Ah, but maybe we can save time hashing them? py> from timeit import Timer py> setup = "from __main__ import s, n" py> t1 = Timer("hash(s)", setup) py> t2 = Timer("hash(n)", setup) py> min(t1.repeat(5)) 0.14668607711791992 py> min(t2.repeat(5)) 0.15973114967346191 Times are in seconds for one million iterations of the code being timed. Or alternatively, it costs about 0.14 microseconds to hash the string, and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate the MD5 hash in the first place, that takes *much* longer: py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'") py> min(t3.repeat(5)) 2.3326950073242188 but merely to perform a lightweight hash on the long int so as to find which bucket of the dict it belongs to. So, yes, chances are **very** good that you're supposed optimizations in this area are actually pessimizations. If you have not measured where the bottlenecks in your code are, you have no idea which parts of the code need to be optimized. Now, it is conceivable that there may be some circumstances where your strategy makes sense. I'm guessing you would need something like this before you even come close to saving memory and/or time: - rather than short keys like "Hello World!", you are dealing with keys that are typically many tens of thousands of characters long; - most of the strings are the same length
Re: hashing strings to integers
On 2014-05-23, Terry Reedy wrote: > On 5/23/2014 6:27 AM, Adam Funk wrote: > >> that. The only thing that really bugs me in Python 3 is that execfile >> has been removed (I find it useful for testing things interactively). > > The spelling has been changed to exec(open(...).read(), ... . It you use > it a lot, add a customized def execfile(filename, ... to your site > module or local utils module. Are you talking about this? https://docs.python.org/3/library/site.html Is there a dummies/quick-start guide to using USER_SITE stuff? -- No sport is less organized than Calvinball! -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On 2014-05-23, Chris Angelico wrote: > On Fri, May 23, 2014 at 8:27 PM, Adam Funk wrote: >> I've also used hashes of strings for other things involving >> deduplication or fast lookups (because integer equality is faster than >> string equality). I guess if it's just for deduplication, though, a >> set of byte arrays is as good as a set of int? > > Want a better way to do that? Use a set or dict and let Python do the > hashing. It'll be every bit as fast as explicit hashing, plus you get > the bonus of simplicity. Well, here's the way it works in my mind: I can store a set of a zillion strings (or a dict with a zillion string keys), but every time I test "if new_string in seen_strings", the computer hashes the new_string using some kind of "short hash", checks the set for matching buckets (I'm assuming this is how python tests set membership --- is that right?), then checks any hash-matches for string equality. Testing string equality is slower than integer equality, and strings (unless they are really short) take up a lot more memory than long integers. So for that kind of thing, I tend to store MD5 hashes or something similar. Is that wrong? -- With the breakdown of the medieval system, the gods of chaos, lunacy, and bad taste gained ascendancy. --- Ignatius J Reilly -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On 5/23/2014 6:27 AM, Adam Funk wrote: that. The only thing that really bugs me in Python 3 is that execfile has been removed (I find it useful for testing things interactively). The spelling has been changed to exec(open(...).read(), ... . It you use it a lot, add a customized def execfile(filename, ... to your site module or local utils module. Execfile was a separate statement *only) because exec was a statememt. Once exec was was changed to a function taking arguments, that justification disappeared. Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On Fri, May 23, 2014 at 8:36 PM, Adam Funk wrote: > BTW, I just tested that & it should be "big" for consistency with the > hexdigest: Yes, it definitely should be parsed big-endianly. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys)
On Fri, May 23, 2014 at 8:27 PM, Adam Funk wrote: > I've also used hashes of strings for other things involving > deduplication or fast lookups (because integer equality is faster than > string equality). I guess if it's just for deduplication, though, a > set of byte arrays is as good as a set of int? Want a better way to do that? Use a set or dict and let Python do the hashing. It'll be every bit as fast as explicit hashing, plus you get the bonus of simplicity. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers
On 2014-05-23, Adam Funk wrote: > On 2014-05-22, Peter Otten wrote: >> In Python 3 there's int.from_bytes() >> > h = hashlib.sha1(b"Hello world") > int.from_bytes(h.digest(), "little") >> 538059071683667711846616050503420899184350089339 > > Excellent, thanks for pointing that out. I've just recently started > using Python 3 instead of 2, & appreciate pointers to new things like > that. BTW, I just tested that & it should be "big" for consistency with the hexdigest: Python 3.3.2+ (default, Feb 28 2014, 00:52:16) [GCC 4.8.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> h0 = hashlib.sha1(bytes('pants', 'UTF-8')).digest() >>> h1 = hashlib.sha1(bytes('pants', 'UTF-8')).hexdigest() >>> int.from_bytes(h0, 'big') 1315090007003469710610607131160586168131917298749 >>> int.from_bytes(h0, 'little') 352462323236431222976527983157432783788229548774 >>> int(h1, 16) 1315090007003469710610607131160586168131917298749 Thanks. -- The kid's a hot prospect. He's got a good head for merchandising, an agent who can take you downtown and one of the best urine samples I've seen in a long time. [Dead Kennedys t-shirt] -- https://mail.python.org/mailman/listinfo/python-list
hashing strings to integers (was: hashing strings to integers for sqlite3 keys)
On 2014-05-22, Peter Otten wrote: > Adam Funk wrote: >> Well, J*v* returns a byte array, so I used to do this: >> >> digester = MessageDigest.getInstance("MD5"); >> ... >> digester.reset(); >> byte[] digest = digester.digest(bytes); >> return new BigInteger(+1, digest); > > In Python 3 there's int.from_bytes() > h = hashlib.sha1(b"Hello world") int.from_bytes(h.digest(), "little") > 538059071683667711846616050503420899184350089339 Excellent, thanks for pointing that out. I've just recently started using Python 3 instead of 2, & appreciate pointers to new things like that. The only thing that really bugs me in Python 3 is that execfile has been removed (I find it useful for testing things interactively). >> I dunno why language designers don't make it easy to get a single big >> number directly out of these things. > > You hardly ever need to manipulate the numerical value of the digest. And on > its way into the database it will be re-serialized anyway. I now agree that my original plan to hash strings for the SQLite3 table was pointless, so I've changed the subject header. :-) I have had good reason to use int hashes in the past, however. I was doing some experiments with Andrei Broder's "sketches of shingles" technique for finding partial duplication between documents, & you need integers for that so you can do modulo arithmetic. I've also used hashes of strings for other things involving deduplication or fast lookups (because integer equality is faster than string equality). I guess if it's just for deduplication, though, a set of byte arrays is as good as a set of int? -- Classical Greek lent itself to the promulgation of a rich culture, indeed, to Western civilization. Computer languages bring us doorbells that chime with thirty-two tunes, alt.sex.bestiality, and Tetris clones. (Stoll 1995) -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
Adam Funk wrote: > On 2014-05-22, Chris Angelico wrote: > >> On Thu, May 22, 2014 at 11:54 PM, Adam Funk wrote: > >>> That ties in with a related question I've been wondering about lately >>> (using MD5s & SHAs for other things) --- getting a hash value (which >>> is internally numeric, rather than string, right?) out as a hex string >>> & then converting that to an int looks inefficient to me --- is there >>> any better way to get an int? (I haven't seen any other way in the >>> API.) >> >> I don't know that there is, at least not with hashlib. You might be >> able to use digest() followed by the struct module, but it's no less >> convoluted. It's the same in several other languages' hashing >> functions; the result is a string, not an integer. > > Well, J*v* returns a byte array, so I used to do this: > > digester = MessageDigest.getInstance("MD5"); > ... > digester.reset(); > byte[] digest = digester.digest(bytes); > return new BigInteger(+1, digest); In Python 3 there's int.from_bytes() >>> h = hashlib.sha1(b"Hello world") >>> int.from_bytes(h.digest(), "little") 538059071683667711846616050503420899184350089339 > I dunno why language designers don't make it easy to get a single big > number directly out of these things. You hardly ever need to manipulate the numerical value of the digest. And on its way into the database it will be re-serialized anyway. > I just had a look at the struct module's fearsome documentation & > think it would present a good shoot(self, foot) opportunity. -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On Fri, May 23, 2014 at 12:47 AM, Adam Funk wrote: >> I don't know that there is, at least not with hashlib. You might be >> able to use digest() followed by the struct module, but it's no less >> convoluted. It's the same in several other languages' hashing >> functions; the result is a string, not an integer. > > Well, J*v* returns a byte array... I counted byte arrays along with strings. Whether it's notionally a string of bytes or characters makes no difference - it's not an integer. > I dunno why language designers don't make it easy to get a single big > number directly out of these things. It's probably because these sorts of hashes are usually done on large puddles of memory, to create a smaller puddle of memory. How you interpret the resulting puddle is up to you; maybe you want to think of it as a number, maybe as a string, but really it's just a sequence of bytes. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22, Chris Angelico wrote: > On Thu, May 22, 2014 at 11:54 PM, Adam Funk wrote: >> That ties in with a related question I've been wondering about lately >> (using MD5s & SHAs for other things) --- getting a hash value (which >> is internally numeric, rather than string, right?) out as a hex string >> & then converting that to an int looks inefficient to me --- is there >> any better way to get an int? (I haven't seen any other way in the >> API.) > > I don't know that there is, at least not with hashlib. You might be > able to use digest() followed by the struct module, but it's no less > convoluted. It's the same in several other languages' hashing > functions; the result is a string, not an integer. Well, J*v* returns a byte array, so I used to do this: digester = MessageDigest.getInstance("MD5"); ... digester.reset(); byte[] digest = digester.digest(bytes); return new BigInteger(+1, digest); I dunno why language designers don't make it easy to get a single big number directly out of these things. I just had a look at the struct module's fearsome documentation & think it would present a good shoot(self, foot) opportunity. -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22, Chris Angelico wrote: > On Thu, May 22, 2014 at 11:41 PM, Adam Funk wrote: >> On further reflection, I think I asked for that. In fact, the table >> I'm using only has one column for the hashes --- I wasn't going to >> store the strings at all in order to save disk space (maybe my mind is >> stuck in the 1980s). > > That's a problem, then, because you will see hash collisions. Maybe > not often, but they definitely will occur if you have enough strings > (look up the birthday paradox - with a 32-bit arbitrarily selected > integer (such as a good crypto hash that you then truncate to 32 > bits), you have a 50% chance of a collision at just 77,000 strings). Ah yes, there's a handy table for that: https://en.wikipedia.org/wiki/Birthday_attack#Mathematics > Do you have enough RAM to hold all the strings directly? Just load 'em > all up into a Python set. Set operations are fast, clean, and easy. > Your already_seen function becomes a simple 'in' check. These days you > can get 16GB or 32GB of RAM in a PC inexpensively enough; with an > average string size of 80 characters, and assuming Python 3.3+, that's > about 128 bytes each - close enough, and a nice figure. 16GB divided > by 128 gives 128M strings - obviously you won't get all of that, but > that's your ball-park. Anything less than, say, a hundred million > strings, and you can dump the lot into memory. Easy! Good point, & since (as I explained in my other post) the substrings are being deduplicated in their own table anyway it's probably not worth bothering with persistence between runs for this bit. -- Some say the world will end in fire; some say in segfaults. [XKCD 312] -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On Thu, 22 May 2014 12:47:31 +0100, Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. I'm > processing a lot of strings from input files (among other things, values > of headers in e-mail & news messages) and suppressing duplicates using a > table of seen strings in the database. > > It seems to me --- from past experience with other things, where testing > integers for equality is faster than testing strings, as well as from > reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the > SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY > value rather than TEXT PRIMARY KEY. Is that right? > > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo the > maxint value. (Security isn't an issue --- i.e., I'm not worried about > someone trying to create a hash collision.) > > Thanks, > Adam why not just set the filed in the DB to be unique & then catch the error when you try to Wright a duplicate? let the DB engine handle the task -- Your step will soil many countries. -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On Thu, May 22, 2014 at 11:54 PM, Adam Funk wrote: >> >>> from hashlib import sha1 >> >>> s = "Hello world" >> >>> h = sha1(s) >> >>> h.hexdigest() >> '7b502c3a1f48c8609ae212cdfb639dee39673f5e' >> >>> int(h.hexdigest(), 16) >> 703993777145756967576188115661016000849227759454L > > That ties in with a related question I've been wondering about lately > (using MD5s & SHAs for other things) --- getting a hash value (which > is internally numeric, rather than string, right?) out as a hex string > & then converting that to an int looks inefficient to me --- is there > any better way to get an int? (I haven't seen any other way in the > API.) I don't know that there is, at least not with hashlib. You might be able to use digest() followed by the struct module, but it's no less convoluted. It's the same in several other languages' hashing functions; the result is a string, not an integer. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On Thu, May 22, 2014 at 11:41 PM, Adam Funk wrote: > On further reflection, I think I asked for that. In fact, the table > I'm using only has one column for the hashes --- I wasn't going to > store the strings at all in order to save disk space (maybe my mind is > stuck in the 1980s). That's a problem, then, because you will see hash collisions. Maybe not often, but they definitely will occur if you have enough strings (look up the birthday paradox - with a 32-bit arbitrarily selected integer (such as a good crypto hash that you then truncate to 32 bits), you have a 50% chance of a collision at just 77,000 strings). Do you have enough RAM to hold all the strings directly? Just load 'em all up into a Python set. Set operations are fast, clean, and easy. Your already_seen function becomes a simple 'in' check. These days you can get 16GB or 32GB of RAM in a PC inexpensively enough; with an average string size of 80 characters, and assuming Python 3.3+, that's about 128 bytes each - close enough, and a nice figure. 16GB divided by 128 gives 128M strings - obviously you won't get all of that, but that's your ball-park. Anything less than, say, a hundred million strings, and you can dump the lot into memory. Easy! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22, Tim Chase wrote: > On 2014-05-22 12:47, Adam Funk wrote: >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other >> things, values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as >> well as from reading the SQLite3 documentation about INTEGER >> PRIMARY KEY --- that the SELECT tests should be faster if I am >> looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY >> KEY. Is that right? > > If sqlite can handle the absurd length of a Python long, you *can* do > it as ints: It can't. SQLite3 INTEGER is an 8-byte signed one. https://www.sqlite.org/datatype3.html But after reading the other replies to my question, I've concluded that what I was trying to do is pointless. > >>> from hashlib import sha1 > >>> s = "Hello world" > >>> h = sha1(s) > >>> h.hexdigest() > '7b502c3a1f48c8609ae212cdfb639dee39673f5e' > >>> int(h.hexdigest(), 16) > 703993777145756967576188115661016000849227759454L That ties in with a related question I've been wondering about lately (using MD5s & SHAs for other things) --- getting a hash value (which is internally numeric, rather than string, right?) out as a hex string & then converting that to an int looks inefficient to me --- is there any better way to get an int? (I haven't seen any other way in the API.) -- A firm rule must be imposed upon our nation before it destroys itself. The United States needs some theology and geometry, some taste and decency. I suspect that we are teetering on the edge of the abyss. --- Ignatius J Reilly -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22, Chris Angelico wrote: > On Thu, May 22, 2014 at 9:47 PM, Adam Funk wrote: >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > It might be faster to use an integer primary key, but the possibility > of even a single collision means you can't guarantee uniqueness > without a separate check. I don't know sqlite3 well enough to say, but > based on what I know of PostgreSQL, it's usually best to make your > schema mimic your logical structure, rather than warping it for the > sake of performance. With a good indexing function, the performance of > a textual PK won't be all that much worse than an integral one, and > everything you do will read correctly in the code - no fiddling around > with hashes and collision checks. > > Stick with the TEXT PRIMARY KEY and let the database do the database's > job. If you're processing a really large number of strings, you might > want to consider moving from sqlite3 to PostgreSQL anyway (I've used > psycopg2 quite happily), as you'll get better concurrency; and that > might solve your performance problem as well, as Pg plays very nicely > with caches. Well, actually I'm thinking about doing away with checking for duplicates at this stage, since the substrings that I pick out of the deduplicated header values go into another table as the TEXT PRIMARY KEY anyway, with deduplication there. So I think this stage reeks of premature optimization. -- The history of the world is the history of a privileged few. --- Henry Miller -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22, Peter Otten wrote: > Adam Funk wrote: > >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > My gut feeling tells me that this would matter more for join operations than > lookup of a value. If you plan to do joins you could use an autoinc integer > as the primary key and an additional string key for lookup. I'm not doing any join operations. I'm using sqlite3 for storing big piles of data & persistence between runs --- not really "proper relational database use". In this particular case, I'm getting header values out of messages & doing this: for this_string in these_strings: if not already_seen(this_string): process(this_string) # ignore if already seen ... > and only if you can demonstrate a significant speedup keep the complication > in your code. > > If you find such a speedup I'd like to see the numbers because this cries > PREMATURE OPTIMIZATION... On further reflection, I think I asked for that. In fact, the table I'm using only has one column for the hashes --- I wasn't going to store the strings at all in order to save disk space (maybe my mind is stuck in the 1980s). -- But the government always tries to coax well-known writers into the Establishment; it makes them feel educated. [Robert Graves] -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On 2014-05-22 12:47, Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other > things, values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as > well as from reading the SQLite3 documentation about INTEGER > PRIMARY KEY --- that the SELECT tests should be faster if I am > looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY > KEY. Is that right? If sqlite can handle the absurd length of a Python long, you *can* do it as ints: >>> from hashlib import sha1 >>> s = "Hello world" >>> h = sha1(s) >>> h.hexdigest() '7b502c3a1f48c8609ae212cdfb639dee39673f5e' >>> int(h.hexdigest(), 16) 703993777145756967576188115661016000849227759454L That's a pretty honkin' huge int for a DB key, but you can use it. And it's pretty capped on length regardless of the underlying string's length. > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo > the maxint value. (Security isn't an issue --- i.e., I'm not > worried about someone trying to create a hash collision.) You could truncate that to something like >>> int(h.hexdigest()[-8:], 16) which should give you something that would result in a 32-bit number that should fit in sqlite's int. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
On Thu, May 22, 2014 at 9:47 PM, Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other things, > values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as well > as from reading the SQLite3 documentation about INTEGER PRIMARY KEY > --- that the SELECT tests should be faster if I am looking up an > INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that > right? It might be faster to use an integer primary key, but the possibility of even a single collision means you can't guarantee uniqueness without a separate check. I don't know sqlite3 well enough to say, but based on what I know of PostgreSQL, it's usually best to make your schema mimic your logical structure, rather than warping it for the sake of performance. With a good indexing function, the performance of a textual PK won't be all that much worse than an integral one, and everything you do will read correctly in the code - no fiddling around with hashes and collision checks. Stick with the TEXT PRIMARY KEY and let the database do the database's job. If you're processing a really large number of strings, you might want to consider moving from sqlite3 to PostgreSQL anyway (I've used psycopg2 quite happily), as you'll get better concurrency; and that might solve your performance problem as well, as Pg plays very nicely with caches. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: hashing strings to integers for sqlite3 keys
Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other things, > values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as well > as from reading the SQLite3 documentation about INTEGER PRIMARY KEY > --- that the SELECT tests should be faster if I am looking up an > INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that > right? My gut feeling tells me that this would matter more for join operations than lookup of a value. If you plan to do joins you could use an autoinc integer as the primary key and an additional string key for lookup. > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo the > maxint value. (Security isn't an issue --- i.e., I'm not worried > about someone trying to create a hash collision.) Start with the cheapest operation you can think of, md5(s) % MAXINT or even hash(s) % MAXINT # don't forget to set PYTHONHASHSEED then compare performance with just s and only if you can demonstrate a significant speedup keep the complication in your code. If you find such a speedup I'd like to see the numbers because this cries PREMATURE OPTIMIZATION... -- https://mail.python.org/mailman/listinfo/python-list
hashing strings to integers for sqlite3 keys
I'm using Python 3.3 and the sqlite3 module in the standard library. I'm processing a lot of strings from input files (among other things, values of headers in e-mail & news messages) and suppressing duplicates using a table of seen strings in the database. It seems to me --- from past experience with other things, where testing integers for equality is faster than testing strings, as well as from reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that right? If so, what sort of hashing function should I use? The "maxint" for SQLite3 is a lot smaller than the size of even MD5 hashes. The only thing I've thought of so far is to use MD5 or SHA-something modulo the maxint value. (Security isn't an issue --- i.e., I'm not worried about someone trying to create a hash collision.) Thanks, Adam -- "It is the role of librarians to keep government running in difficult times," replied Dramoren. "Librarians are the last line of defence against chaos." (McMullen 2001) -- https://mail.python.org/mailman/listinfo/python-list