On 2017/06/10 6:27 AM, Jens Alfke wrote:
On Jun 9, 2017, at 3:05 PM, Simon Slavin <[email protected]> wrote:

Tangential to SQLite, but there’s little on the list at the moment so perhaps 
some of you might like this.
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439 
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>>
He makes some questionable points, like saying that an ASCII string of hex has 
a “9x cost in size” compared to a binary representation, or that hex strings 
would somehow get larger when converted from ISO-8859-1 to UTF-8.

Just in case people wonder about your assertion that his assertion is wrong - to be specific: UTF8 consumes the exact same space as ASCII when you use only characters from the first block (0..7F) - which is exactly what a UUID uses, at a maximum characters 0..9, A..F or a..f, -, {, and }. Nothing outside of the first Unicode block - though it should be noted that some systems may use Double-byte (16 bit) character representations internally whenever the DB table text *storage* type is set to UTF-anything (which should be noted is not the same thing as the *DB-Interface* type being UTF-anything).

So the blogger's point doesn't hold on that assertion.

Further, a UUID/GUID as per the standard (RFC4122) consists of 128 bit value formatted to present like this 36-character sequence:
xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx

where A is a variant and B is the version of the UUID represented. Variants define different methods of calculation, like whether the MAC address with a time component was used, or a Domain/Namespace based UUID etc. In DB systems we usually use variant 1 (MAC+Time with 100 nanosecond precision) which, unless mechanical failure or intentional deceit, must be unique (i.e. probability for global collision = 0 if created exactly as described and all systems work as designed, and some cosmic ray doesn't hit your processor just right [or is it just wrong?]).

Anyway, about the layout, you can of course simply store the UUID as a 128 bit value (or 2 64-bit INTs - considering you use the exact same variant and version for all your IDs, but this takes processing and you end up with a value that needs to be re-computed before it can be compared to anything outside of your system), or at a minimum remove any dashes and braces, but in reality most people will just plop it as-is into a Text/Varchar field that's been Uniqued and probably PK'd.

In that worst case scenario (all of the UUID plus dashes and braces), the full storage requirement for a UUID would look like this: {xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of ASCII (or UTF-8) text space which totals 38 bytes. Let's be generous and assume the user made VARCHAR(40) provision on an old-style DB which reserves all the bytes, or better yet, a modern one with a length definition that takes a further 32-bit value, so 42 bytes then. Even in this very worst case scenario, the full space requirement for a UUID is a dismal ~2.7 times more than the 16 bytes of space the original 128-bit value consumed. Let's further assume the worst text storage system using DBCS to store 16 bits per character (and nobody really does this), even then we only get to just over 5 times. Where did he get 9 times from?? The typical usage, storing full text UUID minus braces in an ASCII/UTF-8 sequence will result in a hair over 2.3 times[1] the storage of INTs. Not really that bad I think.

I find it fascinating that the number 1 reason to not use UUIDs, and probably the only reason, he never even mentioned. Sheer speed. (He refers sorting speed, but the real gain is look-up speed, which gets compounded in a compound query). In MSSQL I measured almost double the lookup speed using INTs in a PK in stead of VARCHARs (I didn't even use UUIDs, simply 6-character client codes of the form ABC001 etc.).

Where I DO agree with the blogger: Where space is not a big concern, use both UUIDs and INTs locally in your DB, that way it is always scalable, always merge-able with other global data and always fast with the right query.

Cheers,
Ryan

[1] - It's hard to say exactly, most DBs use extra bits/bytes for field specifications, lengths etc, even for the INT fields, so making an exact blanket assertion here about ratio of char vs. int storage is not possible, but the given ratio should be close.

_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to