Re: serial + db key, or guid?

2020-08-15 Thread Mark Phillips
Thanks to everyone who replied. All helpful. I learned and have new ideas to 
work with.

> On Aug 11, 2020, at 10:42 AM, Adam Brusselback  
> wrote:
> 
> I mentioned this in another email thread yesterday about a similar topic, but 
> I'd highly suggest if you do go the UUID route, do not use the standard UUID 
> generation functions, they all suck for database use (v1 also sucks).
> 
> I use: https://pgxn.org/dist/sequential_uuids/ 
>  written by Thomas Vondara (writeup 
> here: https://2ndquadrant.com/en/blog/sequential-uuid-generators/ 
>  )
> 
> I don't mind having a time component correlated with my UUID's because it's 
> simply not a threat model that matters for my use case, so I use the time 
> based variant. It helped me immensely with FPW and 
> write amplification when I switched from V4 UUIDs. It is still not as fast as 
> an int, but it is much much better than random UUIDs.
> 
>  



Re: serial + db key, or guid?

2020-08-11 Thread Adam Brusselback
I mentioned this in another email thread yesterday about a similar topic,
but I'd highly suggest if you do go the UUID route, do not use the standard
UUID generation functions, they all suck for database use (v1 also sucks).

I use: https://pgxn.org/dist/sequential_uuids/ written by Thomas Vondara
(writeup here: https://2ndquadrant.com/en/blog/sequential-uuid-generators/ )

I don't mind having a time component correlated with my UUID's because it's
simply not a threat model that matters for my use case, so I use the time
based variant. It helped me immensely with FPW and
write amplification when I switched from V4 UUIDs. It is still not as fast
as an int, but it is much much better than random UUIDs.


Re: serial + db key, or guid?

2020-08-11 Thread Christopher Browne
On Tue, 11 Aug 2020 at 12:40, Christophe Pettus  wrote:

> > On Aug 11, 2020, at 09:37, Mark Phillips 
> wrote:
> >
> > I posed the question on the chance things had evolved since 2012,
> specifically as it relates to postgres.
>
> The essentials haven't changed.  Keys (such as UUIDs, especially UUID v4)
> that have most of their randomness in the most significant bits can cause
> significant cache hit problems on large indexes.  128 bit keys are usually
> overkill for most applications, unless you need actual *global* uniqueness
> across more than a single database or installation; 64 bit keys are usually
> sufficient.
>

Thus, if performance is highly significant (e.g. - there's lots of data in
the table, and it is heavily read/written) then it may be tempting to use a
sequential value instead because that can be smaller, faster to compare,
and won't trash caches (e.g. - with UUIDs, seemingly adjacent data will be
spread wider across indexes and will need more cache accesses to get to the
data).

If the table is small, or data is infrequently queried/updated, these costs
may be irrelevant.


> UUIDs (and similar very large random keys) do have the advantage that they
> are somewhat self-secure: You can expose them to outsiders without having
> to worry about other keys being guessable.
>

Not overly obvious from this; the "not guessable" part comes in that
chronologically adjacent records won't have any apparent similarity.

With serially assigned transaction IDs, if you, as a user, buy something,
and discover that your transaction ID was 1460795, you might well guess
that other recent purchases were on transactions 1460794, 1460793, and
such, and maybe get at someone else's data by messing with a web URL or
such.   Whereas, here's 5 uuids I just generated (dbus-uuidgen isn't
generating RFC 4122 compliant values, but in context of a little
illustration, who cares?)
$ for i in 1 2 3 4 5; do
\ dbus-uuidgen
\ done
0ff745301515c646498cd1165f32cc6e
a9ca459ab6330f24d24af5095f32cc6e
b1cff235d77b1f4d8504920a5f32cc6e
58773af20b34b3c550f4eebf5f32cc6e
f9a13ce961b28751b102c5545f32cc6e

There are some identical low-order bits, but they are pretty well hidden by
the high-order stuff.

It's somewhat cache-destroying, but not especially well guessable.

There is something which has evolved since 2012; see <
https://www.postgresql.org/docs/9.5/uuid-ossp.html>, notably
F.43.2. Building uuid-ossp

It used to be that adding UUID generator support required the OSSP library,
which was sufficiently inconvenient that this would often not be built-in.
Since 9.4 (released in 2014), uuid-ossp can use common built-in libraries
on Linux, OSX, BSD systems, so it's more likely that it will be included
"out of the box" on package-managed deployments of PostgreSQL.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: serial + db key, or guid?

2020-08-11 Thread Christophe Pettus



> On Aug 11, 2020, at 09:37, Mark Phillips  wrote:
> 
> I posed the question on the chance things had evolved since 2012, 
> specifically as it relates to postgres.

The essentials haven't changed.  Keys (such as UUIDs, especially UUID v4) that 
have most of their randomness in the most significant bits can cause 
significant cache hit problems on large indexes.  128 bit keys are usually 
overkill for most applications, unless you need actual *global* uniqueness 
across more than a single database or installation; 64 bit keys are usually 
sufficient.

UUIDs (and similar very large random keys) do have the advantage that they are 
somewhat self-secure: You can expose them to outsiders without having to worry 
about other keys being guessable.
--
-- Christophe Pettus
   x...@thebuild.com





Re: serial + db key, or guid?

2020-08-11 Thread Mark Phillips
Thank you for the reply.

The article is a good one. I posed the question on the chance things had 
evolved since 2012, specifically as it relates to postgres.

> On Aug 10, 2020, at 3:21 PM, Christophe Pettus  wrote:
> 
> 
> 
>> On Aug 10, 2020, at 15:19, Mark Phillips  wrote:
>> Advice, cautionary tales, suggestions and such will be warmly received.
> 
> Here's one solution a company found for this; it seems to work very well:
> 
>   https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c
> 
> --
> -- Christophe Pettus
>   x...@thebuild.com
> 





Re: serial + db key, or guid?

2020-08-10 Thread Christophe Pettus



> On Aug 10, 2020, at 15:19, Mark Phillips  wrote:
> Advice, cautionary tales, suggestions and such will be warmly received.

Here's one solution a company found for this; it seems to work very well:

https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c

--
-- Christophe Pettus
   x...@thebuild.com





serial + db key, or guid?

2020-08-10 Thread Mark Phillips
Given four instances of posgres, each with a database, each instance receiving 
new data, and desiring a data “merge” a la BDR or similar multiple database 
solutions, my team has been discussing the pros and cons of generating unique 
keys in each table.

1. create a unique “database” id for each database, add a column to all tables 
for this id value and combine that with a serial id
2. use guid from pg functions
3. create id ranges for each database, e.g. db1 gets 1 to 1M, db2 gets 1M+1 to 
2M, and so on

All get us to the finish line, but each has its drawbacks. 

Advice, cautionary tales, suggestions and such will be warmly received.

 - Mark