Re: Sequence vs UUID

2023-02-09 Thread Merlin Moncure
On Wed, Feb 8, 2023 at 5:33 AM Peter J. Holzer  wrote:

> On 2023-02-08 14:48:03 +0530, veem v wrote:
> > So wanted to know from experts here, is there really exists any scenario
> in
> > which UUID really cant be avoided?
>
> Probably not. The question is usually not "is this possible" but "does
> this meet the requirements at acceptable cost".
>
>
> > Sequence Number = n*d+m+offset. Where n is the sequence order number, d
> is the
> > dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> > number assigned to each node in the replication, and offset is the
> number to
> > offset the sequence numbers.
>
> Yes, you can do this. In fact, people (including me) have already done
> this.
>
> But it's relatively easy to mess this up:
>
> Firstly, you have to make sure that d is larger than your number of
> (active) replicas will ever be, but still small enough that you will
> never overflow. Probably not a problem with 64 bit sequences (if you set
> d to 1E6, you can still count to 9E12 on each node), but might be a
> problem if you are for some reason limited to 32 bits.
>
> Secondly (and IMHO more importantly) you have to make sure each node
> gets its own unique offset. So this needs to be ensured during
> deployment, but also during migrations, restores from backups and other
> infrequent events.


??  All you have to do is ensure each node has its own unique id, and that
id is involved in sequence generation.  This has to be done for other
reasons than id generation, and is a zero effort/risk process.

The id would then contain the identifier of the node that *generated* the
id, rather than the node that contains the id.  This is exactly analogous
to strategies that use mac# as part of id prefix for example.  Once
generated, it's known unique and you don't have to consider anything.
 This is exactly what I do, and there is no interaction with backups,
deployments, migrations, etc.  Node expansion does require that each node
requires a unique node id, and that's it.

merlin


Re: Sequence vs UUID

2023-02-08 Thread Miles Elam
On Wed, Feb 8, 2023 at 11:56 AM Kirk Wolak  wrote:
>
> CREATE FUNCTION generate_ulid() RETURNS uuid
> LANGUAGE sql
> RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * 
> (100)::numeric)))::bigint), 14, '0'::text)
>   || encode(gen_random_bytes(9), 'hex'::text)))::uuid;

You can save yourself some CPU by skipping the extra cast, omitting
the lpad, to_hex, and floor, and just grabbing the bytes from the
bigint directly along with the random part since bytea can be
concatenated.

SELECT encode(int8send((EXTRACT(epoch FROM clock_timestamp()) *
100)::bigint)
  || gen_random_bytes(8), 'hex')::uuid
;

Note that you refer to it as a ULID, but it is stored as a UUID.
Hopefully nothing downstream ever relies on UUID versioning/spec
compliance. Now that I think of it, I could probably speed up my
tagged interval UUID implementation using some of this at the expense
of configurability.




Re: Sequence vs UUID

2023-02-08 Thread Kirk Wolak
On Wed, Feb 8, 2023 at 4:18 AM veem v  wrote:

> Thank you So much all for such valuable feedback.
> ..
> So wanted to know from experts here, is there really exists any scenario
> in which UUID really cant be avoided?
>
> Funny you are asking about this.  My recent experience is that UUIDs
really get crushed on performance in medium (> 5 million rows) tables.
I found an article by Dave Allie on ULID, and I modified his implementation
to create a timestamp(6) (microsecond level) sequenced version.

Doing an article on this soon.  But WITHOUT calling the "gen_random_bytes"
I can generate 2 timestamps at the same microsecond level.
Once that call is included in the function, I've never been close to
returning 2 timestamps at the same microsecond level.  Although I did not
run this on multiple threads.  This fit our needs for an efficient UUID
formatted key...

9 Bytes (18 Hex Digits) of Randomness at the far right.

Oh, and some time after the year 10,000 you will get some wrap around...
But I expect 256 bit UUIDs will take over before then.


CREATE FUNCTION generate_ulid() RETURNS uuid
LANGUAGE sql
RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) *
(100)::numeric)))::bigint), 14, '0'::text)

  || encode(gen_random_bytes(9), 'hex'::text)))::uuid;


Re: Sequence vs UUID

2023-02-08 Thread Peter J. Holzer
On 2023-02-08 14:48:03 +0530, veem v wrote:
> So wanted to know from experts here, is there really exists any scenario in
> which UUID really cant be avoided?

Probably not. The question is usually not "is this possible" but "does
this meet the requirements at acceptable cost".


> Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the
> dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> number assigned to each node in the replication, and offset is the number to
> offset the sequence numbers. 

Yes, you can do this. In fact, people (including me) have already done
this.

But it's relatively easy to mess this up:

Firstly, you have to make sure that d is larger than your number of
(active) replicas will ever be, but still small enough that you will
never overflow. Probably not a problem with 64 bit sequences (if you set
d to 1E6, you can still count to 9E12 on each node), but might be a
problem if you are for some reason limited to 32 bits.

Secondly (and IMHO more importantly) you have to make sure each node
gets its own unique offset. So this needs to be ensured during
deployment, but also during migrations, restores from backups and other
infrequent events.

With random Ids you don't have to worry about this.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Sequence vs UUID

2023-02-08 Thread veem v
Thank you So much all for such valuable feedback.

As "Julian" was pointing, I also tried to test the INSERT independently(as
in below test case) without keeping the "generate_series" in the inline
query. But in all the cases sequence is performing better as compared to
both UUID V4 and UUID V7. And same with Index access path observed i.e. an
index on sequence performs better as compared to an index on UUID column.
So i believe , its understood that the performance wise sequence is better
as compared to even sequential UUID (like UUID-V7). And it seems that, even
the UUID V7 maintains better locality because of its sequential nature, but
still it all may tied to the higher space/memory consumption of UUID
because of its bigger size as compared to sequence and thus the performance
hit.

But as i saw few comments in this discussion regarding the scenarios which
will mandate the usage of UUID like "multi-master replication", "sharding",
"Need to be able to move data between databases" Etc..So wanted to
understand , why cant we use sequences as PK in these scenarios? Say for
e.g. in case of multimaster replication we can use sequence someway as
below..,

So wanted to know from experts here, is there really exists any scenario in
which UUID really cant be avoided?

Sequence Number = n*d+m+offset. Where n is the sequence order number, d is
the dimensions of the multi-master replication, m ranges from 0 to n-1 is
the number assigned to each node in the replication, and offset is the
number to offset the sequence numbers.

For a 4-ways multi-master replication where m=4, y is in (0, 1, 2, 3), and
offset is 100.
Node #1 (m=0) :Sequence number = n*4+100
Node #2 (m=1): Sequence number = n*4+101
Node #3 (m=2): Sequence number = n*4+102
Node #4 (m=3): Sequence number = n*4+103

Each sequence will have:
100, 104, 108,112, 116, 120,...
101, 105, 109, 113, 117, 121,...
102, 106, 110, 114, 118, 122...
103, 107, 111, 115, 119, 123

* Test case *

CREATE UNLOGGED TABLE test_bigint ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_bigint_1 ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid_1 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7_1 ( id uuid PRIMARY KEY);
create sequence myseq cache 32767;

*** Insert Test***
INSERT INTO test_bigint select nextval('myseq') from
generate_series(1,10);
INSERT INTO test_uuid select gen_random_uuid() from
generate_series(1,10);
INSERT INTO test_uuid7 select uuid_generate_v7() from
generate_series(1,10);


explain analyze INSERT INTO test_bigint_1 select id from test_bigint;

EXPLAIN
QUERY PLAN
Insert on test_bigint_1 (cost=0.00..1444.18 rows=0 width=0) (actual
time=220.689..220.690 rows=0 loops=1)
-> Seq Scan on test_bigint (cost=0.00..1444.18 rows=100118 width=8) (actual
time=0.012..17.488 rows=10 loops=1)
Planning Time: 0.137 ms
Execution Time: 220.714 ms

explain analyze INSERT INTO test_uuid_1 select id from test_uuid;

EXPLAIN
QUERY PLAN
Insert on test_uuid_1 (cost=0.00..1541.85 rows=0 width=0) (actual
time=311.949..311.950 rows=0 loops=1)
-> Seq Scan on test_uuid (cost=0.00..1541.85 rows=100085 width=16) (actual
time=0.010..20.173 rows=10 loops=1)
Planning Time: 0.082 ms
Execution Time: 311.973 ms


explain analyze INSERT INTO test_uuid7_1 select id from test_uuid7;

EXPLAIN
QUERY PLAN
Insert on test_uuid7_1 (cost=0.00..1541.85 rows=0 width=0) (actual
time=244.531..244.532 rows=0 loops=1)
-> Seq Scan on test_uuid7 (cost=0.00..1541.85 rows=100085 width=16) (actual
time=0.011..16.390 rows=10 loops=1)
Planning Time: 0.084 ms
Execution Time: 244.554 ms

Explain analyze select * from test_bigint where id in (select id from
test_bigint limit 10);

Nested Loop (cost=2692.77..3793.23 rows=50059 width=8) (actual
time=53.739..368.110 rows=10 loops=1)
-> HashAggregate (cost=2692.48..2694.48 rows=200 width=8) (actual
time=53.681..93.044 rows=10 loops=1)
Group Key: test_bigint_1.id
Batches: 5 Memory Usage: 11073kB Disk Usage: 208kB
-> Limit (cost=0.00..1442.48 rows=10 width=8) (actual
time=0.020..18.985 rows=10 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118
width=8) (actual time=0.019..11.330 rows=10 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..6.53
rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 10
Planning Time: 0.373 ms
Execution Time: 373.440 ms
EXPLAIN

Explain analyze select * from test_uuid where id in (select id from
test_uuid limit 10);

QUERY PLAN
Nested Loop (cost=2790.96..4006.29 rows=50042 width=16) (actual
time=48.251..410.786 rows=10 loops=1)
-> HashAggregate (cost=2790.54..2792.54 rows=200 width=16) (actual
time=48.157..76.176 rows=10 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory 

Re: Sequence vs UUID

2023-02-07 Thread Dominique Devienne
On Tue, Feb 7, 2023 at 3:47 PM Merlin Moncure  wrote:

> On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer  wrote:
>
>> On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
>> But UUIDs are random and that plays havoc with locality.
>
>
> This is really key.  [...] the databases I've seen that are written with
> the
>
UUID pattern appear to be written by developers oblivious to this fact.
>

Well, perhaps these developers are not dealing with temporally clustered
data, like commerce related DB,
and more scientific data? In any case, this developer will definitely
investigate ULIDs, vs UUIDs, vs Ints (sequences),
based on the info from this thread. I'm aware of fragmentation issues, and
cache hit/miss issues etc, in general;
but was probably not sufficiently making the mental connection with UUIDs
and PostgreSQL. So thanks everyone.


Re: Sequence vs UUID

2023-02-07 Thread Merlin Moncure
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer  wrote:

> On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
> > I don't really understand what you mean by 'performance'. To me it is not
> > surprising that incrementing (I know it is not just incrementing) a
> > 64bit integer is faster than generating 128 bit data with a good amount
> of
> > random data even if it seems to be too slow.
>
> But UUIDs are random and that plays havoc with locality. For example
> consider one table with invoices and another with invoice items. If you
> want to get all the invoices including the items of a single day, the
> data is probably nicely clustered together in the tables. But the join
> needs to look up random ids in the index, which will be spread all over
> the index. In a simple benchmark for this scenario the UUIDs were about
> 4.5 times slower than sequential ids. (In other benchmarks the
> difference was only a few percent)


This is really key.

While many of the people posting here may understand this, all of the
databases I've seen that are written with the UUID pattern appear to be
written by developers oblivious to this fact.  The UUID pattern seems to be
popular with developers who see abstract away the database underneath the
code and might use an ORM and be weaker in terms of database facing
constraint checking.  My direct observation is that these databases scale
poorly and the developers spend a lot of time building tools that fix
broken data stemming from application bugs.

I'm certain this is not the experience of everyone here.  I do however find
the counter sequence arguments to be somewhat silly; partition safe
sequence generation is simple to solve using simple methods. "ID guessing"
is not insecure along similar lines; if your application relies on id
obfuscation to be secure you might have much bigger issues to contend with
IMO.

merlin


Re: Sequence vs UUID

2023-02-06 Thread Peter J. Holzer
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
> I don't really understand what you mean by 'performance'. To me it is not
> surprising that incrementing (I know it is not just incrementing) a
> 64bit integer is faster than generating 128 bit data with a good amount of
> random data even if it seems to be too slow.

That's not really the problem with UUIDs, though. My (not very fast)
laptop can call getrandom() 1 million times per second (in a single
thread). Avoiding the system call[1] could make this fast enough to be
completely negligible compared to the time of writing a row to disk.

But UUIDs are random and that plays havoc with locality. For example
consider one table with invoices and another with invoice items. If you
want to get all the invoices including the items of a single day, the
data is probably nicely clustered together in the tables. But the join
needs to look up random ids in the index, which will be spread all over
the index. In a simple benchmark for this scenario the UUIDs were about
4.5 times slower than sequential ids. (In other benchmarks the
difference was only a few percent)

So depending on the querys the difference may be negligible or huge.

It really depends on your access patterns.

hp


[1] There was even a discussion about making that much faster on the
LKML recently.


-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Sequence vs UUID

2023-02-06 Thread Julian Backes
I don't really understand what you mean by 'performance'. To me it is not
surprising that incrementing (I know it is not just incrementing) a
64bit integer is faster than generating 128 bit data with a good amount of
random data even if it seems to be too slow. So in my opinion you need to
separate
1) generating data (which might happen on the client in case of UUID and
not in the db...)
2) inserting data
3) selecting data

in both sequential as well as parallel scenarios.

Am Mo., 6. Feb. 2023 um 19:32 Uhr schrieb veem v :

> So, it may be the machine on which the code is getting executed behind the
> scene , in the site "https://dbfiddle.uk/; is playing a key role in the
> speed, however, the comparative performance of UUID vs sequence should stay
> the same.
>  So I think, after this test we can safely conclude that if we compare
> the performance of the UUID(both version-4, version 7) VS sequence. The
> UUID performs a lot worse as compared to sequence. So unless there exists
> some strong reason/justification for UUID, we should default use the
> sequence. Correct me if I'm wrong. And also  I understand the cases
> of multi master replication/sharding etc, may be a factor but other than
> that I can't think of any scenario where sequences can be used.
>
>
>
> On Fri, 3 Feb 2023 at 23:07, Dominique Devienne 
> wrote:
>
>> On Fri, Feb 3, 2023 at 5:48 PM veem v  wrote:
>>
>>> Actually I did the testing by connecting to "https://dbfiddle.uk/;
>>> postgres version -15.
>>>
>>> PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
>>> 20210514 (Red Hat 8.5.0-10), 64-bit
>>>
>>> Am I doing it wrong, please confirm?
>>>
>>>
>> No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are
>> equivalent) is 1/3 of the time, so 30x faster.
>> So your timings of generating 100K uuids and counting them seems way too
>> slow to me. --DD
>>
>> sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
>> QUERY PLAN
>> `--SCAN generate_series VIRTUAL TABLE INDEX 3:
>> addr  opcode p1p2p3p4 p5  comment
>>   -        -  --  -
>> 0 Init   0 1500   Start at 15
>> 1 Null   0 1 10   r[1..1]=NULL
>> 2 VOpen  0 0 0 vtab:274D3E0   0
>> 3 Integer1 4 00   r[4]=1
>> 4 Multiply   6 6 50   r[5]=r[6]*r[6]
>> 5 Integer3 2 00   r[2]=3
>> 6 Integer2 3 00   r[3]=2
>> 7 VFilter0 1120   iplan=r[2]
>> zplan=''
>> 8   Function   1 8 7 randomblob(1)  0
>> r[7]=func(r[8])
>> 9   AggStep0 7 1 count(1)   1   accum=r[1]
>> step(r[7])
>> 10VNext  0 8 00
>> 11AggFinal   1 1 0 count(1)   0   accum=r[1] N=1
>> 12Copy   1 9 00   r[9]=r[1]
>> 13ResultRow  9 1 00   output=r[9]
>> 14Halt   0 0 00
>> 15Transaction0 0 1 0  1
>> usesStmtJournal=0
>> 16Integer1000  6 00   r[6]=1000
>> 17Integer168 00   r[8]=16
>> 18Goto   0 1 00
>> ┌───┐
>> │ count(randomblob(16)) │
>> ├───┤
>> │ 100   │
>> └───┘
>> Run Time: real 0.278 user 0.25 sys 0.00
>>
>>
>>> On Fri, 3 Feb 2023 at 21:28, Dominique Devienne 
>>> wrote:
>>>
 Something's off regarding Guid generations IMHO...
 You generate 100K Guids in ~1s. While we generate (in C++, Windows
 Release, using Boost) 16M of them in +/- the same time:

>>>


Re: Sequence vs UUID

2023-02-06 Thread veem v
So, it may be the machine on which the code is getting executed behind the
scene , in the site "https://dbfiddle.uk/; is playing a key role in the
speed, however, the comparative performance of UUID vs sequence should stay
the same.
 So I think, after this test we can safely conclude that if we compare
the performance of the UUID(both version-4, version 7) VS sequence. The
UUID performs a lot worse as compared to sequence. So unless there exists
some strong reason/justification for UUID, we should default use the
sequence. Correct me if I'm wrong. And also  I understand the cases
of multi master replication/sharding etc, may be a factor but other than
that I can't think of any scenario where sequences can be used.



On Fri, 3 Feb 2023 at 23:07, Dominique Devienne  wrote:

> On Fri, Feb 3, 2023 at 5:48 PM veem v  wrote:
>
>> Actually I did the testing by connecting to "https://dbfiddle.uk/;
>> postgres version -15.
>>
>> PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
>> 20210514 (Red Hat 8.5.0-10), 64-bit
>>
>> Am I doing it wrong, please confirm?
>>
>>
> No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are
> equivalent) is 1/3 of the time, so 30x faster.
> So your timings of generating 100K uuids and counting them seems way too
> slow to me. --DD
>
> sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
> QUERY PLAN
> `--SCAN generate_series VIRTUAL TABLE INDEX 3:
> addr  opcode p1p2p3p4 p5  comment
>   -        -  --  -
> 0 Init   0 1500   Start at 15
> 1 Null   0 1 10   r[1..1]=NULL
> 2 VOpen  0 0 0 vtab:274D3E0   0
> 3 Integer1 4 00   r[4]=1
> 4 Multiply   6 6 50   r[5]=r[6]*r[6]
> 5 Integer3 2 00   r[2]=3
> 6 Integer2 3 00   r[3]=2
> 7 VFilter0 1120   iplan=r[2]
> zplan=''
> 8   Function   1 8 7 randomblob(1)  0   r[7]=func(r[8])
> 9   AggStep0 7 1 count(1)   1   accum=r[1]
> step(r[7])
> 10VNext  0 8 00
> 11AggFinal   1 1 0 count(1)   0   accum=r[1] N=1
> 12Copy   1 9 00   r[9]=r[1]
> 13ResultRow  9 1 00   output=r[9]
> 14Halt   0 0 00
> 15Transaction0 0 1 0  1   usesStmtJournal=0
> 16Integer1000  6 00   r[6]=1000
> 17Integer168 00   r[8]=16
> 18Goto   0 1 00
> ┌───┐
> │ count(randomblob(16)) │
> ├───┤
> │ 100   │
> └───┘
> Run Time: real 0.278 user 0.25 sys 0.00
>
>
>> On Fri, 3 Feb 2023 at 21:28, Dominique Devienne 
>> wrote:
>>
>>> Something's off regarding Guid generations IMHO...
>>> You generate 100K Guids in ~1s. While we generate (in C++, Windows
>>> Release, using Boost) 16M of them in +/- the same time:
>>>
>>


Re: Sequence vs UUID

2023-02-03 Thread Dominique Devienne
On Fri, Feb 3, 2023 at 5:48 PM veem v  wrote:

> Actually I did the testing by connecting to "https://dbfiddle.uk/;
> postgres version -15.
>
> PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
> 20210514 (Red Hat 8.5.0-10), 64-bit
>
> Am I doing it wrong, please confirm?
>
>
No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are
equivalent) is 1/3 of the time, so 30x faster.
So your timings of generating 100K uuids and counting them seems way too
slow to me. --DD

sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr  opcode p1p2p3p4 p5  comment
  -        -  --  -
0 Init   0 1500   Start at 15
1 Null   0 1 10   r[1..1]=NULL
2 VOpen  0 0 0 vtab:274D3E0   0
3 Integer1 4 00   r[4]=1
4 Multiply   6 6 50   r[5]=r[6]*r[6]
5 Integer3 2 00   r[2]=3
6 Integer2 3 00   r[3]=2
7 VFilter0 1120   iplan=r[2]
zplan=''
8   Function   1 8 7 randomblob(1)  0   r[7]=func(r[8])
9   AggStep0 7 1 count(1)   1   accum=r[1]
step(r[7])
10VNext  0 8 00
11AggFinal   1 1 0 count(1)   0   accum=r[1] N=1
12Copy   1 9 00   r[9]=r[1]
13ResultRow  9 1 00   output=r[9]
14Halt   0 0 00
15Transaction0 0 1 0  1   usesStmtJournal=0
16Integer1000  6 00   r[6]=1000
17Integer168 00   r[8]=16
18Goto   0 1 00
┌───┐
│ count(randomblob(16)) │
├───┤
│ 100   │
└───┘
Run Time: real 0.278 user 0.25 sys 0.00


> On Fri, 3 Feb 2023 at 21:28, Dominique Devienne 
> wrote:
>
>> Something's off regarding Guid generations IMHO...
>> You generate 100K Guids in ~1s. While we generate (in C++, Windows
>> Release, using Boost) 16M of them in +/- the same time:
>>
>


Re: Sequence vs UUID

2023-02-03 Thread veem v
Actually I did the testing by connecting to "https://dbfiddle.uk/;
postgres version -15.

PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
20210514 (Red Hat 8.5.0-10), 64-bit

Am I doing it wrong, please confirm?


On Fri, 3 Feb 2023 at 21:28, Dominique Devienne  wrote:

> Copying the list...
>
> -- Forwarded message -
> From: Dominique Devienne 
> Date: Fri, Feb 3, 2023 at 4:57 PM
> Subject: Re: Sequence vs UUID
> To: veem v 
>
> On Thu, Feb 2, 2023 at 8:47 PM veem v  wrote:
>
>> Tested the UUIDv7 generator for postgres as below.
>> With regards to performance , It's still way behind the sequence. [...]
>> explain analyze  select count(nextval('myseq') ) from
>> generate_series(1,10);
>> Execution Time: 59.687 ms
>>
>> explain analyze  select count(gen_random_uuid()) from
>> generate_series(1,100'000);
>> Execution Time: 904.868 ms
>>
>> explain analyze  select count(uuid_generate_v7()) from
>> generate_series(1,10);
>> Execution Time: 1711.187 ms
>>
>
> Something's off regarding Guid generations IMHO...
>
> You generate 100K Guids in ~1s. While we generate (in C++, Windows
> Release, using Boost) 16M of them in +/- the same time:
>
> Enabling Performance tests
>>>>
>>>
>
>> generate 16'000'000 guids in 0.980s (user: 0.984s) 12 MB
>>>>
>>> generate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s)
>>>> 12 MB
>>>>
>>>
> That's 2 orders of magnitude faster. Sure there's some overhead from the
> SQL, but still. Something seems fishy.
> And that's on a 2.5y old desktop. --DD
>


Fwd: Sequence vs UUID

2023-02-03 Thread Dominique Devienne
Copying the list...

-- Forwarded message -
From: Dominique Devienne 
Date: Fri, Feb 3, 2023 at 4:57 PM
Subject: Re: Sequence vs UUID
To: veem v 

On Thu, Feb 2, 2023 at 8:47 PM veem v  wrote:

> Tested the UUIDv7 generator for postgres as below.
> With regards to performance , It's still way behind the sequence. [...]
> explain analyze  select count(nextval('myseq') ) from
> generate_series(1,10);
> Execution Time: 59.687 ms
>
> explain analyze  select count(gen_random_uuid()) from
> generate_series(1,100'000);
> Execution Time: 904.868 ms
>
> explain analyze  select count(uuid_generate_v7()) from
> generate_series(1,10);
> Execution Time: 1711.187 ms
>

Something's off regarding Guid generations IMHO...

You generate 100K Guids in ~1s. While we generate (in C++, Windows Release,
using Boost) 16M of them in +/- the same time:

Enabling Performance tests
>>>
>>

> generate 16'000'000 guids in 0.980s (user: 0.984s) 12 MB
>>>
>> generate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s)
>>> 12 MB
>>>
>>
That's 2 orders of magnitude faster. Sure there's some overhead from the
SQL, but still. Something seems fishy.
And that's on a 2.5y old desktop. --DD


Re: Sequence vs UUID

2023-02-03 Thread Miles Elam
On Thu, Feb 2, 2023 at 11:47 AM veem v  wrote:

> Tested the UUIDv7 generator for postgres as below.
>
> With regards to performance , It's still way behind the sequence. I was
> expecting the insert performance of UUID v7 to be closer to the sequence ,
> but it doesn't seem so, as it's 500ms vs 3000ms. And the generation takes a
> lot longer time as compared to sequence too i.e. 59ms vs 1700ms. Read time
> or the index scan looks close i.e. 2.3ms vs 2.6ms.
>

Thank you for taking the effort in testing and measuring this.

Those numbers make some intuitive sense to me. The function is written in
plpgsql, not C, and is dependent on generating a UUIDv4 and then modifying
it to include the timestamp and version change. While I suspect it will
never beat a bigint by virtue of 64-bits will always be half the size of
128-bit, the read time on the index scan after it is generated is
encouraging with a strong suggestion there's a lot of low-hanging fruit for
improvement.

Also, like UUIDv4, v7 can be generated by clients, ameliorating the
generation bottleneck.

Once again, thank you for following up with good quality analysis.


Re: Sequence vs UUID

2023-02-02 Thread Rob Sargent



> On Feb 2, 2023, at 1:26 PM, Benedict Holland  
> wrote:
> 
> 
> No idea at all. We had the data for the insert and had to insert it again. It 
> was extremely confusing but oh boy did it wreck our systems.
> 
> Thanks,
> Ben

Someone has a baked-in uuid in a script I suspect. 

 
> 




Re: Sequence vs UUID

2023-02-02 Thread Benedict Holland
No idea at all. We had the data for the insert and had to insert it again.
It was extremely confusing but oh boy did it wreck our systems.

Thanks,
Ben

On Thu, Feb 2, 2023, 6:17 PM Ron  wrote:

> On 2/2/23 17:11, Peter J. Holzer wrote:
> > On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
> >> Well... until two processes generate an identical UUID. That happened
> to me
> >> several times.
> > How did that happen? Pure software implementation with non-random seed?
> > Hardware with insufficient entropy source?
>
> Poorly implemented algorithm?
>
> --
> Born in Arizona, moved to Babylonia.
>
>
>


Re: Sequence vs UUID

2023-02-02 Thread Ron

On 2/2/23 17:11, Peter J. Holzer wrote:

On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:

Well... until two processes generate an identical UUID. That happened to me
several times.

How did that happen? Pure software implementation with non-random seed?
Hardware with insufficient entropy source?


Poorly implemented algorithm?

--
Born in Arizona, moved to Babylonia.




Re: Sequence vs UUID

2023-02-02 Thread Peter J. Holzer
On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
> Well... until two processes generate an identical UUID. That happened to me
> several times.

How did that happen? Pure software implementation with non-random seed?
Hardware with insufficient entropy source?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Sequence vs UUID

2023-02-02 Thread veem v
st=27.91..29.91 rows=200 width=16) (actual
time=0.399..0.556 rows=1000 loops=1)
Group Key: test_uuid_1.id
Batches: 1  Memory Usage: 145kB
->  Limit  (cost=0.00..15.41 rows=1000 width=16) (actual
time=0.011..0.185 rows=1000 loops=1)
  ->  Seq Scan on test_uuid test_uuid_1  (cost=0.00..1541.85
rows=100085 width=16) (actual time=0.010..0.093 rows=1000 loops=1)
  ->  Index Only Scan using test_uuid_pkey on test_uuid  (cost=0.42..7.21
rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 1000
Planning Time: 0.234 ms
Execution Time: 3.419 ms

Explain analyze  select * from test_uuid7 where  id in (select id from
test_uuid7 limit 1000);

QUERY PLAN
Nested Loop  (cost=28.32..1416.01 rows=1000 width=16) (actual
time=0.403..2.586 rows=1000 loops=1)
  ->  HashAggregate  (cost=27.91..29.91 rows=200 width=16) (actual
time=0.371..0.546 rows=1000 loops=1)
Group Key: test_uuid7_1.id
Batches: 1  Memory Usage: 145kB
->  Limit  (cost=0.00..15.41 rows=1000 width=16) (actual
time=0.011..0.161 rows=1000 loops=1)
  ->  Seq Scan on test_uuid7 test_uuid7_1  (cost=0.00..1541.85
rows=100085 width=16) (actual time=0.010..0.091 rows=1000 loops=1)
  ->  Index Only Scan using test_uuid7_pkey on test_uuid7  (cost=0.42..6.99
rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = test_uuid7_1.id)
Heap Fetches: 1000
Planning Time: 0.101 ms
Execution Time: 2.661 ms

On Thu, 2 Feb 2023 at 20:52, Benedict Holland 
wrote:

> Well... until two processes generate an identical UUID. That happened to
> me several times. It's rare but when that happens, oh boy that is a mess to
> figure out.
>
> Thanks,
> Ben
>
> On Thu, Feb 2, 2023, 10:17 AM Miles Elam 
> wrote:
>
>> On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak  wrote:
>>
>>>
>>>
>>> On Wed, Feb 1, 2023 at 1:34 PM veem v  wrote:
>>>
>>>>
>>>> 1) sequence generation vs UUID generation, execution time increased
>>>> from ~291ms to 5655ms.
>>>> 2) Insert performance of "sequence" vs "UUID"  execution time increased
>>>> from ~2031ms to 10599ms.
>>>> 3) Index performance for sequence vs UUID,  execution time increased
>>>> from ~.3ms to .5ms.
>>>>
>>>>
>>> Yes, assuming that UUIDs would be efficient as keys when they are
>>> randomly generated, versus sequences (which tend to expand in one
>>> direction, and have been relatively optimized for years).
>>>
>>> This article explains in detail what is going on.  If I were doing this,
>>> I would strongly consider a ULID because of getting the best of both worlds.
>>> https://blog.daveallie.com/ulid-primary-keys
>>>
>>> Of course, YMMV...  And since ULIDs are not native to PG, there is
>>> overhead, but it is far more performant, IMO...
>>>
>>
>> Biased comparison. ULIDs have a timestamp component. The closest UUID
>> equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4.
>> Another difference not mentioned in the blog article is that UUID is
>> versioned, meaning you can figure out what kind of data is in the UUID,
>> whereas ULIDs are a "one size fits all" solution.
>>
>> There is an implementation of sequential UUIDs for Postgres I posted
>> earlier in this thread. In addition, here is an implementation of UUIDv7
>> for Postgres:
>>
>> https://gist.github.com/kjmph/5bd772b2c2df145aa645b837da7eca74
>>
>> I would suggest running your tests against v1, v7, and sequential UUID
>> before jumping on ULID, which has no native type/indexing in Postgres.
>>
>> It should also be noted that apps cannot provide a bigint ID due to
>> collisions, but an app can generate UUIDs and ULIDs without fear,
>> essentially shifting the generation time metric in UUID/ULID's favor over a
>> bigserial.
>>
>> - Miles
>>
>>
>>


Re: Sequence vs UUID

2023-02-02 Thread Benedict Holland
Well... until two processes generate an identical UUID. That happened to me
several times. It's rare but when that happens, oh boy that is a mess to
figure out.

Thanks,
Ben

On Thu, Feb 2, 2023, 10:17 AM Miles Elam  wrote:

> On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak  wrote:
>
>>
>>
>> On Wed, Feb 1, 2023 at 1:34 PM veem v  wrote:
>>
>>>
>>> 1) sequence generation vs UUID generation, execution time increased from
>>> ~291ms to 5655ms.
>>> 2) Insert performance of "sequence" vs "UUID"  execution time increased
>>> from ~2031ms to 10599ms.
>>> 3) Index performance for sequence vs UUID,  execution time increased
>>> from ~.3ms to .5ms.
>>>
>>>
>> Yes, assuming that UUIDs would be efficient as keys when they are
>> randomly generated, versus sequences (which tend to expand in one
>> direction, and have been relatively optimized for years).
>>
>> This article explains in detail what is going on.  If I were doing this,
>> I would strongly consider a ULID because of getting the best of both worlds.
>> https://blog.daveallie.com/ulid-primary-keys
>>
>> Of course, YMMV...  And since ULIDs are not native to PG, there is
>> overhead, but it is far more performant, IMO...
>>
>
> Biased comparison. ULIDs have a timestamp component. The closest UUID
> equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4.
> Another difference not mentioned in the blog article is that UUID is
> versioned, meaning you can figure out what kind of data is in the UUID,
> whereas ULIDs are a "one size fits all" solution.
>
> There is an implementation of sequential UUIDs for Postgres I posted
> earlier in this thread. In addition, here is an implementation of UUIDv7
> for Postgres:
>
> https://gist.github.com/kjmph/5bd772b2c2df145aa645b837da7eca74
>
> I would suggest running your tests against v1, v7, and sequential UUID
> before jumping on ULID, which has no native type/indexing in Postgres.
>
> It should also be noted that apps cannot provide a bigint ID due to
> collisions, but an app can generate UUIDs and ULIDs without fear,
> essentially shifting the generation time metric in UUID/ULID's favor over a
> bigserial.
>
> - Miles
>
>
>


Re: Sequence vs UUID

2023-02-02 Thread Miles Elam
On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak  wrote:

>
>
> On Wed, Feb 1, 2023 at 1:34 PM veem v  wrote:
>
>>
>> 1) sequence generation vs UUID generation, execution time increased from
>> ~291ms to 5655ms.
>> 2) Insert performance of "sequence" vs "UUID"  execution time increased
>> from ~2031ms to 10599ms.
>> 3) Index performance for sequence vs UUID,  execution time increased from
>> ~.3ms to .5ms.
>>
>>
> Yes, assuming that UUIDs would be efficient as keys when they are randomly
> generated, versus sequences (which tend to expand in one direction, and
> have been relatively optimized for years).
>
> This article explains in detail what is going on.  If I were doing this, I
> would strongly consider a ULID because of getting the best of both worlds.
> https://blog.daveallie.com/ulid-primary-keys
>
> Of course, YMMV...  And since ULIDs are not native to PG, there is
> overhead, but it is far more performant, IMO...
>

Biased comparison. ULIDs have a timestamp component. The closest UUID
equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4.
Another difference not mentioned in the blog article is that UUID is
versioned, meaning you can figure out what kind of data is in the UUID,
whereas ULIDs are a "one size fits all" solution.

There is an implementation of sequential UUIDs for Postgres I posted
earlier in this thread. In addition, here is an implementation of UUIDv7
for Postgres:

https://gist.github.com/kjmph/5bd772b2c2df145aa645b837da7eca74

I would suggest running your tests against v1, v7, and sequential UUID
before jumping on ULID, which has no native type/indexing in Postgres.

It should also be noted that apps cannot provide a bigint ID due to
collisions, but an app can generate UUIDs and ULIDs without fear,
essentially shifting the generation time metric in UUID/ULID's favor over a
bigserial.

- Miles


Re: Sequence vs UUID

2023-02-01 Thread Kirk Wolak
On Wed, Feb 1, 2023 at 1:34 PM veem v  wrote:

> I tried to test quickly below on dbfiddle, below with the UUID as data
> type and in each of the below cases the UUID performance seems
> drastically reduced as compared to sequence performance. Let me know if
> anything is wrong in my testing here?
>
> 1) sequence generation vs UUID generation, execution time increased from
> ~291ms to 5655ms.
> 2) Insert performance of "sequence" vs "UUID"  execution time increased
> from ~2031ms to 10599ms.
> 3) Index performance for sequence vs UUID,  execution time increased from
> ~.3ms to .5ms.
>
>
Yes, assuming that UUIDs would be efficient as keys when they are randomly
generated, versus sequences (which tend to expand in one direction, and
have been relatively optimized for years).

This article explains in detail what is going on.  If I were doing this, I
would strongly consider a ULID because of getting the best of both worlds.
https://blog.daveallie.com/ulid-primary-keys

Of course, YMMV...  And since ULIDs are not native to PG, there is
overhead, but it is far more performant, IMO...

Also, I hold out hope that one of the Gods of PostgreSQL on this list might
give us an internal ULID implementation fixing that last problem...

HTH



>
>  PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
> 20210514 (Red Hat 8.5.0-10), 64-bit
>
> CREATE UNLOGGED TABLE test_bigint (  id bigint  PRIMARY KEY);
> CREATE UNLOGGED TABLE test_uuid (   id uuid  PRIMARY KEY);
> create sequence myseq cache 32767;
>
>  sequence generation vs UUID generation Test**
> explain analyze  select count(nextval('myseq') ) from
> generate_series(1,10);
>
> QUERY PLAN
> Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual
> time=291.030..291.030 rows=1 loops=1)
>   ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=10
> width=0) (actual time=53.332..63.941 rows=10 loops=1)
> Planning Time: 0.155 ms
> Execution Time: 291.719 ms
>
> explain analyze  select count(gen_random_uuid()) from
> generate_series(1,10);
>
> QUERY PLAN
> Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual
> time=5654.453..5654.454 rows=1 loops=1)
>   ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=10
> width=0) (actual time=84.328..514.214 rows=10 loops=1)
> Planning Time: 0.082 ms
> Execution Time: 5655.158 ms
>
> *** Insert Test***
>
> explain analyze INSERT INTO test_bigint select nextval('myseq')  from
> generate_series(1,10);
>
> QUERY PLAN
> Insert on test_bigint  (cost=0.00..2250.00 rows=0 width=0) (actual
> time=2030.960..2030.961 rows=0 loops=1)
>   ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=10
> width=8) (actual time=48.102..636.311 rows=10 loops=1)
> Planning Time: 0.065 ms
> Execution Time: 2031.469 ms
>
> explain analyze  INSERT INTO test_uuid  select gen_random_uuid()  from
> generate_series(1,10);
>
>
> QUERY PLAN
> Insert on test_uuid  (cost=0.00..2250.00 rows=0 width=0) (actual
> time=10599.230..10599.230 rows=0 loops=1)
>   ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=10
> width=16) (actual time=36.975..6289.811 rows=10 loops=1)
> Planning Time: 0.056 ms
> Execution Time: 10599.805 ms
>
>  Index performance
>
> Explain analyze select * from test_bigint where id in (select id from
> test_bigint limit 100);
>
> QUERY PLAN
> Nested Loop  (cost=2.98..734.71 rows=100 width=8) (actual
> time=0.083..0.269 rows=100 loops=1)
>   ->  HashAggregate  (cost=2.69..3.69 rows=100 width=8) (actual
> time=0.046..0.062 rows=100 loops=1)
> Group Key: test_bigint_1.id
> Batches: 1  Memory Usage: 24kB
> ->  Limit  (cost=0.00..1.44 rows=100 width=8) (actual
> time=0.011..0.025 rows=100 loops=1)
>   ->  Seq Scan on test_bigint test_bigint_1
>  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017
> rows=100 loops=1)
>   ->  Index Only Scan using test_bigint_pkey on test_bigint
>  (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1
> loops=100)
> Index Cond: (id = test_bigint_1.id)
> Heap Fetches: 100
> Planning Time: 0.279 ms
> Execution Time: 0.302 ms
>
> Explain analyze  select * from test_uuid where  id in (select id from
> test_uuid limit 100);
>
> QUERY PLAN
> Nested Loop  (cost=3.21..783.31 rows=100 width=16) (actual
> time=0.080..0.474 rows=100 loops=1)
>   ->  HashAggregate  (cost=2.79..3.79 rows=100 width=16) (actual
> time=0.046..0.066 rows=100 loops=1)
> Group Key: test_uuid_1.id
> Batches: 1  Memory Usage: 24kB
>   

Re: Sequence vs UUID

2023-02-01 Thread veem v
I tried to test quickly below on dbfiddle, below with the UUID as data type
and in each of the below cases the UUID performance seems
drastically reduced as compared to sequence performance. Let me know if
anything is wrong in my testing here?

1) sequence generation vs UUID generation, execution time increased from
~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased
from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from
~.3ms to .5ms.


 PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
20210514 (Red Hat 8.5.0-10), 64-bit

CREATE UNLOGGED TABLE test_bigint (  id bigint  PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid (   id uuid  PRIMARY KEY);
create sequence myseq cache 32767;

 sequence generation vs UUID generation Test**
explain analyze  select count(nextval('myseq') ) from
generate_series(1,10);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual
time=291.030..291.030 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=10
width=0) (actual time=53.332..63.941 rows=10 loops=1)
Planning Time: 0.155 ms
Execution Time: 291.719 ms

explain analyze  select count(gen_random_uuid()) from
generate_series(1,10);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual
time=5654.453..5654.454 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=10
width=0) (actual time=84.328..514.214 rows=10 loops=1)
Planning Time: 0.082 ms
Execution Time: 5655.158 ms

*** Insert Test***

explain analyze INSERT INTO test_bigint select nextval('myseq')  from
generate_series(1,10);

QUERY PLAN
Insert on test_bigint  (cost=0.00..2250.00 rows=0 width=0) (actual
time=2030.960..2030.961 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=10
width=8) (actual time=48.102..636.311 rows=10 loops=1)
Planning Time: 0.065 ms
Execution Time: 2031.469 ms

explain analyze  INSERT INTO test_uuid  select gen_random_uuid()  from
generate_series(1,10);


QUERY PLAN
Insert on test_uuid  (cost=0.00..2250.00 rows=0 width=0) (actual
time=10599.230..10599.230 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=10
width=16) (actual time=36.975..6289.811 rows=10 loops=1)
Planning Time: 0.056 ms
Execution Time: 10599.805 ms

 Index performance

Explain analyze select * from test_bigint where id in (select id from
test_bigint limit 100);

QUERY PLAN
Nested Loop  (cost=2.98..734.71 rows=100 width=8) (actual time=0.083..0.269
rows=100 loops=1)
  ->  HashAggregate  (cost=2.69..3.69 rows=100 width=8) (actual
time=0.046..0.062 rows=100 loops=1)
Group Key: test_bigint_1.id
Batches: 1  Memory Usage: 24kB
->  Limit  (cost=0.00..1.44 rows=100 width=8) (actual
time=0.011..0.025 rows=100 loops=1)
  ->  Seq Scan on test_bigint test_bigint_1
 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017
rows=100 loops=1)
  ->  Index Only Scan using test_bigint_pkey on test_bigint
 (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1
loops=100)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 100
Planning Time: 0.279 ms
Execution Time: 0.302 ms

Explain analyze  select * from test_uuid where  id in (select id from
test_uuid limit 100);

QUERY PLAN
Nested Loop  (cost=3.21..783.31 rows=100 width=16) (actual
time=0.080..0.474 rows=100 loops=1)
  ->  HashAggregate  (cost=2.79..3.79 rows=100 width=16) (actual
time=0.046..0.066 rows=100 loops=1)
Group Key: test_uuid_1.id
Batches: 1  Memory Usage: 24kB
->  Limit  (cost=0.00..1.54 rows=100 width=16) (actual
time=0.010..0.025 rows=100 loops=1)
  ->  Seq Scan on test_uuid test_uuid_1  (cost=0.00..1541.85
rows=100085 width=16) (actual time=0.009..0.016 rows=100 loops=1)
  ->  Index Only Scan using test_uuid_pkey on test_uuid  (cost=0.42..7.79
rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=100)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 100
Planning Time: 0.180 ms
Execution Time: 0.510 ms

On Tue, 31 Jan 2023 at 03:28, Ron  wrote:

>
> And populate that column with UUIDs generated by the gen_random_uuid()
> function.
>
> (Requires v13.)
>
> On 1/30/23 13:46, Adrian Klaver wrote:
> > On 1/30/23 11:43, veem v wrote:
> >> Thank You So much for the details. I am a bit new to postgres. And
> these
> >> test results I picked were from a dev system. If I understand it
> >> correctly, do you mean these settings(usage of C locale or "native"
> >> 16-byte uuid) which you mentioned should be there in a production
> system
> >>   and thus we should test the performance of the UUID vs sequence on a
> >&g

Re: Sequence vs UUID

2023-01-30 Thread Ron



And populate that column with UUIDs generated by the gen_random_uuid() function.

(Requires v13.)

On 1/30/23 13:46, Adrian Klaver wrote:

On 1/30/23 11:43, veem v wrote:
Thank You So much for the details. I am a bit new to postgres. And these 
test results I picked were from a dev system. If I understand it 
correctly, do you mean these settings(usage of C locale or "native" 
16-byte uuid) which you mentioned should be there in a production system 
  and thus we should test the performance of the UUID vs sequence on a 
similar setup? Or say if this sort of degradation of UUID performance is 
not expected then , how to get these settings tweaked ON, so as to see 
the best string type or UUID performance, can you please guide me here?


No what is being said is change:

source_id varchar(36)

to

source_id uuid

as i:

https://www.postgresql.org/docs/current/datatype-uuid.html



On Mon, 30 Jan 2023 at 22:18, Tom Lane > wrote:


    Dominique Devienne mailto:ddevie...@gmail.com>> writes:
 > On Mon, Jan 30, 2023 at 5:11 PM veem v mailto:veema0...@gmail.com>> wrote:
 >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
    PRIMARY KEY, Name varchar(20) );

 > Maybe if you used a "native" 16-byte uuid, instead of its textual
 > serialization with dashes (36 bytes + length overhead), the gap would
 > narrow.

    Yeah, especially if your database is not using C locale. The
    strcoll or ICU-based comparisons done on string types can be
    enormously more expensive than the memcmp() used for binary
    types like native uuid.

                         regards, tom lane





--
Born in Arizona, moved to Babylonia.




Re: Sequence vs UUID

2023-01-30 Thread Adrian Klaver

On 1/30/23 11:43, veem v wrote:
Thank You So much for the details. I am a bit new to postgres. And these 
test results I picked were from a dev system. If I understand it 
correctly, do you mean these settings(usage of C locale or "native" 
16-byte uuid) which you mentioned should be there in a production system 
  and thus we should test the performance of the UUID vs sequence on a 
similar setup? Or say if this sort of degradation of UUID performance is 
not expected then , how to get these settings tweaked ON, so as to see 
the best string type or UUID performance, can you please guide me here?


No what is being said is change:

source_id varchar(36)

to

source_id uuid

as i:

https://www.postgresql.org/docs/current/datatype-uuid.html



On Mon, 30 Jan 2023 at 22:18, Tom Lane > wrote:


Dominique Devienne mailto:ddevie...@gmail.com>> writes:
 > On Mon, Jan 30, 2023 at 5:11 PM veem v mailto:veema0...@gmail.com>> wrote:
 >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
PRIMARY KEY, Name varchar(20) );

 > Maybe if you used a "native" 16-byte uuid, instead of its textual
 > serialization with dashes (36 bytes + length overhead), the gap would
 > narrow.

Yeah, especially if your database is not using C locale.  The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.

                         regards, tom lane



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Sequence vs UUID

2023-01-30 Thread veem v
Thank You So much for the details. I am a bit new to postgres. And these
test results I picked were from a dev system. If I understand it correctly,
do you mean these settings(usage of C locale or "native" 16-byte uuid)
which you mentioned should be there in a production system  and thus we
should test the performance of the UUID vs sequence on a similar setup? Or
say if this sort of degradation of UUID performance is not expected then ,
how to get these settings tweaked ON, so as to see the best string type or
UUID performance, can you please guide me here?

On Mon, 30 Jan 2023 at 22:18, Tom Lane  wrote:

> Dominique Devienne  writes:
> > On Mon, Jan 30, 2023 at 5:11 PM veem v  wrote:
> >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY,
> Name varchar(20) );
>
> > Maybe if you used a "native" 16-byte uuid, instead of its textual
> > serialization with dashes (36 bytes + length overhead), the gap would
> > narrow.
>
> Yeah, especially if your database is not using C locale.  The
> strcoll or ICU-based comparisons done on string types can be
> enormously more expensive than the memcmp() used for binary
> types like native uuid.
>
> regards, tom lane
>


Re: Sequence vs UUID

2023-01-30 Thread Tom Lane
Dominique Devienne  writes:
> On Mon, Jan 30, 2023 at 5:11 PM veem v  wrote:
>> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name 
>> varchar(20) );

> Maybe if you used a "native" 16-byte uuid, instead of its textual
> serialization with dashes (36 bytes + length overhead), the gap would
> narrow.

Yeah, especially if your database is not using C locale.  The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.

regards, tom lane




Re: Sequence vs UUID

2023-01-30 Thread Dominique Devienne
On Mon, Jan 30, 2023 at 5:11 PM veem v  wrote:
> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name 
> varchar(20) );

Maybe if you used a "native" 16-byte uuid, instead of its textual
serialization with dashes (36 bytes + length overhead), the gap would
narrow.




Re: Sequence vs UUID

2023-01-30 Thread veem v
Was trying to test the performance for simple read/write for the bigint vs
UUID. What we see is , ~3 times performance degradation while joining on
bigint vs UUID columns. Also even just generation of sequence vs bigint
itself is degrading by ~3times too. Also even insert performance on same
table for ~10million rows is ~1min 39sec for bigint vs ~3minute 11 sec in
case of UUID. Is such extent of degradation in performance this expected
for UUID?

CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name
varchar(20) );
CREATE TABLE test2_UUID (id bigint,source_id varchar(36) PRIMARY KEY,Name
varchar(20) );

CREATE TABLE test1_bigint ( id bigint PRIMARY KEY, source_id varchar(36) ,
Name varchar(20));
CREATE TABLE test2_bigint ( id bigint PRIMARY KEY, source_id varchar(36) ,
Name varchar(20));

Loaded same 10million rows.

explain Analyze select * from test1_bigint a , test2_bigint b where a.id =
b.id

Merge Join (cost=12.31..875534.52 rows=1021 width=100) (actual
time=0.042..6974.575 rows=1000 loops=1)
  Merge Cond: (a.id = b.id)
  -> Index Scan using test1_bigint_pkey on test1_bigint a
(cost=0.43..362780.75 rows=1021 width=50) (actual time=0.020..2070.079
rows=1000 loops=1)
  -> Index Scan using test2_bigint_2_pkey on test2_bigint b
(cost=0.43..362780.75 rows=1021 width=50) (actual time=0.019..2131.086
rows=1000 loops=1)
Planning Time: 0.207 ms
Execution Time: 7311.210 ms

set enable_seqscan=off;

explain Analyze select * from test1_UUID a , test2_UUID b where a.source_id
= b.source_id;

Merge Join (cost=2.75..2022857.05 rows=1021 width=100) (actual
time=0.043..21954.213 rows=1000 loops=1)
  Merge Cond: ((a.source_id)::text = (b.source_id)::text)
  -> Index Scan using test1_uuid_pkey on test1_UUID a (cost=0.56..936420.18
rows=1021 width=50) (actual time=0.022..7854.143 rows=1000 loops=1)
  -> Index Scan using test2_uuid_2_pkey on test2_UUID b
(cost=0.56..936437.90 rows=1021 width=50) (actual time=0.017..7971.187
rows=1000 loops=1)
Planning Time: 0.516 ms
Execution Time: 22292.801 ms

**

create sequence myseq cache 32767;

select count(nextval('myseq') ) from generate_series(1,1000)
 1 row retrieved starting from 1 in 4 s 521 ms (execution: 4 s 502 ms,
fetching: 19 ms)

 select count(gen_random_uuid()) from generate_series(1,1000)
 1 row retrieved starting from 1 in 11 s 145 ms (execution: 11 s 128 ms,
fetching: 17 ms)



On Mon, 30 Jan, 2023, 4:59 pm veem v,  wrote:

> I have a question, As i understand here, usage wise there are multiple
> benefits of UUID over sequences like, in case of distributed app where we
> may not be able to rely on one point generator like sequences, in case of
> multi master architecture, sharding.
>
> If we just look in terms of performance wise, the key advantage of
> sequence is that for read queries, because of the storage size it will be
> smaller and thus it will cache more index rows and so will be beneficial
> during read queries and should also be beneficial even on joins because of
> its smaller size. Also fetching a value from sequence is cheaper than
> calculating the UUIDS. But the downside is during write operation, it can
> be a point of contention in case of concurrent data load as every incoming
> request will try to modify same table/index page/block. But as its
> mentioned in this blog (
> https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state
> this UUID can be made sequential so even this can be sequential using
> prefix-timestamp etc. However isn't it that making the UUID sequential will
> again actually be a disadvantage and can be contention point for this
> unique index as each incoming write will now fight for same block/page
> while doing concurrent data load and will contend for the same table block
> or say one side of the index branch/leaf block etc, whereas in case of
> random UUIDs the write was spreading across multiple blocks so there was no
> contention on any specific blocks? Please correct if my understanding is
> wrong?
>
>
>
> On Sun, 29 Jan, 2023, 10:33 am Miles Elam, 
> wrote:
>
>> On Sat, Jan 28, 2023 at 8:02 PM Ron  wrote:
>> >
>> > Then it's not a Type 4 UUID, which is perfectly fine; just not random.
>>
>> Yep, which is why it really should be re-versioned to UUIDv8 to be
>> pedantic. In everyday use though, almost certainly doesn't matter.
>>
>> > Also, should now() be replaced by clock_timestamp(), so that it can be
>> > called multiple times in the same transaction?
>>
>> Not necessary. Instead of 122 bits of entropy, you get 106 bits of
>> entropy and a new incremented prefix every minute. now() vs
>> clock_timestamp() wouldn't make a substantive difference. Should still
>> be reasonably safe against the birthday paradox for more than a
>> century when creating more than a million UUIDs per second.
>>
>>
>>


Re: Sequence vs UUID

2023-01-30 Thread veem v
I have a question, As i understand here, usage wise there are multiple
benefits of UUID over sequences like, in case of distributed app where we
may not be able to rely on one point generator like sequences, in case of
multi master architecture, sharding.

If we just look in terms of performance wise, the key advantage of sequence
is that for read queries, because of the storage size it will be smaller
and thus it will cache more index rows and so will be beneficial during
read queries and should also be beneficial even on joins because of its
smaller size. Also fetching a value from sequence is cheaper than
calculating the UUIDS. But the downside is during write operation, it can
be a point of contention in case of concurrent data load as every incoming
request will try to modify same table/index page/block. But as its
mentioned in this blog (
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state
this UUID can be made sequential so even this can be sequential using
prefix-timestamp etc. However isn't it that making the UUID sequential will
again actually be a disadvantage and can be contention point for this
unique index as each incoming write will now fight for same block/page
while doing concurrent data load and will contend for the same table block
or say one side of the index branch/leaf block etc, whereas in case of
random UUIDs the write was spreading across multiple blocks so there was no
contention on any specific blocks? Please correct if my understanding is
wrong?



On Sun, 29 Jan, 2023, 10:33 am Miles Elam, 
wrote:

> On Sat, Jan 28, 2023 at 8:02 PM Ron  wrote:
> >
> > Then it's not a Type 4 UUID, which is perfectly fine; just not random.
>
> Yep, which is why it really should be re-versioned to UUIDv8 to be
> pedantic. In everyday use though, almost certainly doesn't matter.
>
> > Also, should now() be replaced by clock_timestamp(), so that it can be
> > called multiple times in the same transaction?
>
> Not necessary. Instead of 122 bits of entropy, you get 106 bits of
> entropy and a new incremented prefix every minute. now() vs
> clock_timestamp() wouldn't make a substantive difference. Should still
> be reasonably safe against the birthday paradox for more than a
> century when creating more than a million UUIDs per second.
>
>
>


Re: Sequence vs UUID

2023-01-28 Thread Miles Elam
On Sat, Jan 28, 2023 at 8:02 PM Ron  wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.

> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?

Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.




Re: Sequence vs UUID

2023-01-28 Thread Ron

Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Also, should now() be replaced by clock_timestamp(), so that it can be 
called multiple times in the same transaction?


On 1/28/23 21:28, Miles Elam wrote:

On Sat, Jan 28, 2023 at 6:02 PM Ron  wrote:

Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through 
the floor.

This lesson was burned into my psyche wy back in the Clinton 
administration.  It was my task to speed up a five hour batch job which read 
input records from a flat file, did some validations and then inserted them.  
Simply sorting the input file on the primary key fields -- we used natural 
keys, not synthetics -- dropped the run time to two hours.  (VMS SORT saved the 
day, because you could tell it the sort order you wanted; thus, I could 
preserve the header record at the top of the file, and the trailer record at 
the end of the file without jumping through a bunch of hoops.)

This can be mitigated with judicious use of a sequence at the front of
the uuidv4.

 https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/

More effort than just calling the built-in gen_random_uuid() or
equivalent in app code, but a substantial performance gain for your
effort.

 https://github.com/tvondra/sequential-uuids

And in a managed environment where you can't install custom
extensions, a fairly simple function with divide on unix epoch seconds
combined with a call to overlay(...) should suffice performance-wise.
At 60 seconds, this will loop every 45 days or so, and you can choose
how much "sequentialness" works for you, from 1 to 4 bytes at the
expense of pseudo-randomness.

-

-- Generate time interval UUID
CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
60, block_num_bytes int2 = 2)
RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
   SELECT encode(
 overlay(
   -- convert the uuid to byte array
   uuid_send(gen_random_uuid())
   -- use only the bottom bytes
   PLACING substring(
   int4send((extract(epoch FROM now()) / interval_seconds)::int4)
   FROM (5 - block_num_bytes)
   )
   -- place at the front two bytes of the uuid
   FROM 1
 )
 -- convert the resulting byte array to hex for conversion to uuid
 , 'hex')::uuid
   WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
$$;

-

Technically affecting the v4 spec. You could always convert to a
UUIDv8, which is the intention behind that new version even though the
standard hasn't been ratified yet.


Cheers,

Miles Elam




--
Born in Arizona, moved to Babylonia.




Re: Sequence vs UUID

2023-01-28 Thread Miles Elam
On Sat, Jan 28, 2023 at 6:02 PM Ron  wrote:
>
> Type 4 UUIDs are sub-optimal for big table because cache hit rates drop 
> through the floor.
>
> This lesson was burned into my psyche wy back in the Clinton 
> administration.  It was my task to speed up a five hour batch job which read 
> input records from a flat file, did some validations and then inserted them.  
> Simply sorting the input file on the primary key fields -- we used natural 
> keys, not synthetics -- dropped the run time to two hours.  (VMS SORT saved 
> the day, because you could tell it the sort order you wanted; thus, I could 
> preserve the header record at the top of the file, and the trailer record at 
> the end of the file without jumping through a bunch of hoops.)

This can be mitigated with judicious use of a sequence at the front of
the uuidv4.

https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/

More effort than just calling the built-in gen_random_uuid() or
equivalent in app code, but a substantial performance gain for your
effort.

https://github.com/tvondra/sequential-uuids

And in a managed environment where you can't install custom
extensions, a fairly simple function with divide on unix epoch seconds
combined with a call to overlay(...) should suffice performance-wise.
At 60 seconds, this will loop every 45 days or so, and you can choose
how much "sequentialness" works for you, from 1 to 4 bytes at the
expense of pseudo-randomness.

-

-- Generate time interval UUID
CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
60, block_num_bytes int2 = 2)
RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
  SELECT encode(
overlay(
  -- convert the uuid to byte array
  uuid_send(gen_random_uuid())
  -- use only the bottom bytes
  PLACING substring(
  int4send((extract(epoch FROM now()) / interval_seconds)::int4)
  FROM (5 - block_num_bytes)
  )
  -- place at the front two bytes of the uuid
  FROM 1
)
-- convert the resulting byte array to hex for conversion to uuid
, 'hex')::uuid
  WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
$$;

-

Technically affecting the v4 spec. You could always convert to a
UUIDv8, which is the intention behind that new version even though the
standard hasn't been ratified yet.


Cheers,

Miles Elam




Re: Sequence vs UUID

2023-01-28 Thread Ron
Type 4 UUIDs are sub-optimal for big table because cache hit rates drop 
through the floor.


This lesson was burned into my psyche wy back in the Clinton 
administration.  It was my task to speed up a five hour batch job which read 
input records from a flat file, did some validations and then inserted 
them.  Simply sorting the input file on the primary key fields -- we used 
natural keys, not synthetics -- dropped the run time to two hours.  (VMS 
SORT saved the day, because you could tell it the sort order you wanted; 
thus, I could preserve the header record at the top of the file, and the 
trailer record at the end of the file without jumping through a bunch of hoops.)


On 1/28/23 19:44, Benedict Holland wrote:
Why is it a terrible idea? I have been using them for years without a 
single problem. I don't rely on them for create order. Terrible seem a bit 
extreme.


Thanks,
Ben

On Sat, Jan 28, 2023, 3:39 PM Erik Wienhold  wrote:

> On 27/01/2023 01:48 CET Ron  wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too
much info and
> > allow attacks such as forced browsing[2].  The first I can
understand: you may
> > not want to reveal the number of users or customers.  But access
control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the
database?  UUIDs
> are all over the DBs I manage, but the PKs are all sequences.

Yes, I meant the application layer, not Postgres' access control.

--
Erik




--
Born in Arizona, moved to Babylonia.

Re: Sequence vs UUID

2023-01-28 Thread Benedict Holland
Why is it a terrible idea? I have been using them for years without a
single problem. I don't rely on them for create order. Terrible seem a bit
extreme.

Thanks,
Ben

On Sat, Jan 28, 2023, 3:39 PM Erik Wienhold  wrote:

> > On 27/01/2023 01:48 CET Ron  wrote:
> >
> > On 1/26/23 15:55, Erik Wienhold wrote:
> > >
> > > There are arguments against sequential PK, e.g. they give away too
> much info and
> > > allow attacks such as forced browsing[2].  The first I can understand:
> you may
> > > not want to reveal the number of users or customers.  But access
> control should
> > > prevent forced browsing.
> >
> > Shouldn't your application layer isolate the users from the database?
> UUIDs
> > are all over the DBs I manage, but the PKs are all sequences.
>
> Yes, I meant the application layer, not Postgres' access control.
>
> --
> Erik
>
>
>


Re: Sequence vs UUID

2023-01-28 Thread Erik Wienhold
> On 27/01/2023 01:48 CET Ron  wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much 
> > info and
> > allow attacks such as forced browsing[2].  The first I can understand: you 
> > may
> > not want to reveal the number of users or customers.  But access control 
> > should
> > prevent forced browsing.
> 
> Shouldn't your application layer isolate the users from the database?  UUIDs 
> are all over the DBs I manage, but the PKs are all sequences.

Yes, I meant the application layer, not Postgres' access control.

--
Erik




Re: Sequence vs UUID

2023-01-27 Thread Rob Sargent


> So forget about performance issues (there will ALWAYS be need for faster 
> systems). The ease and functionality with UUID
> is so mutch better. Sequence keys are a terrible idea!
> 
> // GH
> 
Wow. I am not alone
> 




Re: Sequence vs UUID

2023-01-27 Thread G Hasse

Hello.

I have been using UUID for quite a long time now. The reason I began to use 
UUID was the need to be able to move data between
databases and the need to create record outside the database. You should use 
UUID as a primary key for a record and also have
some bookkeeping UUID:s in the record like "origin" of the record. In this way it is 
"easy" to handle different sources of
data.

We have also written some nice replicating software on this basis (postsync) 
that can check for alterations in one database
and update others. In this way we can keep one or many replicas of databases.

So forget about performance issues (there will ALWAYS be need for faster 
systems). The ease and functionality with UUID
is so mutch better. Sequence keys are a terrible idea!

// GH


Den 2023-01-26 kl. 20:17, skrev veem v:

Hello, We were trying to understand whether we should use UUID or Sequence in 
general for primary keys. In many of the blogs(one is below) across multiple 
databases, I saw over the internet and all are mostly stating the sequence is 
better as compared to UUID. But I think in the case of concurrent data load 
scenarios UUID will spread the contention point whereas sequence can be a 
single point of contention.

So we want to understand from experts here, if there are anyclear rules 
available or if we haveany pros vs cons list available for each of those to 
understand the exact scenario in which we should go for one over other? 
Basically I wantedto see if we can perform some test on sample data to see the 
percentage of overhead on read and write performances of the query in presence 
of UUID VS Sequence to draw some conclusion in general? And also considering 
open source postgres as the base for many databases like redshift etc, so the 
results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
 




--
gor...@raditex.nu
http://www.raditex.nu
Raditex Control AB
Bo 229, 715 91 ODENSBACKEN
Mob: 070-5530148




Re: Sequence vs UUID

2023-01-26 Thread Ron

On 1/26/23 15:55, Erik Wienhold wrote:

On 26/01/2023 20:17 CET veem v  wrote:

Hello, We were trying to understand whether we should use UUID or Sequence in
general for primary keys. In many of the blogs (one is below) across multiple
databases, I saw over the internet and all are mostly stating the sequence is
better as compared to UUID. But I think in the case of concurrent data load
scenarios UUID will spread the contention point whereas sequence can be a
single point of contention.

So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those to
understand the exact scenario in which we should go for one over other?
Basically I wanted to see if we can perform some test on sample data to see
the percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases like redshift
etc, so the results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

I think that[1] provides a good summary.  Performance consideration is just one
aspect.  Is there a technical requirement for using UUID over sequential values?

If there's a single generator of primary keys use bigint sequences.  In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.

There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2].  The first I can understand: you may
not want to reveal the number of users or customers.  But access control should
prevent forced browsing.


Shouldn't your application layer isolate the users from the database?  UUIDs 
are all over the DBs I manage, but the PKs are all sequences.


--
Born in Arizona, moved to Babylonia.




Re: Sequence vs UUID

2023-01-26 Thread Benedict Holland
You could always create a uuid matching table for anything displayed to
users and keep a private ID for anything internal. From my particle
standpoint, one is 8 bytes, the other is 16 or 32. Any database
implementation should guarantee a unique value. I have had cases where it
didn't work but those were rare and possibly a coding error. Basically, i
don't see a particle difference apart from url access and giving away a PK.
I haven't had an issue with it. Some people have. I still use internally
but my requirements are not public facing.

Thanks,
Ben



On Thu, Jan 26, 2023, 4:55 PM Erik Wienhold  wrote:

> > On 26/01/2023 20:17 CET veem v  wrote:
> >
> > Hello, We were trying to understand whether we should use UUID or
> Sequence in
> > general for primary keys. In many of the blogs (one is below) across
> multiple
> > databases, I saw over the internet and all are mostly stating the
> sequence is
> > better as compared to UUID. But I think in the case of concurrent data
> load
> > scenarios UUID will spread the contention point whereas sequence can be a
> > single point of contention.
> >
> > So we want to understand from experts here, if there are any clear rules
> > available or if we have any pros vs cons list available for each of
> those to
> > understand the exact scenario in which we should go for one over other?
> > Basically I wanted to see if we can perform some test on sample data to
> see
> > the percentage of overhead on read and write performances of the query in
> > presence of UUID VS Sequence to draw some conclusion in general? And also
> > considering open source postgres as the base for many databases like
> redshift
> > etc, so the results which apply to progress would apply to others as
> well.
> >
> >
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
>
> I think that[1] provides a good summary.  Performance consideration is
> just one
> aspect.  Is there a technical requirement for using UUID over sequential
> values?
>
> If there's a single generator of primary keys use bigint sequences.  In
> case of
> multiple generators (multi-master replication, sharding, clients
> generating IDs)
> consider UUID.
>
> There are arguments against sequential PK, e.g. they give away too much
> info and
> allow attacks such as forced browsing[2].  The first I can understand: you
> may
> not want to reveal the number of users or customers.  But access control
> should
> prevent forced browsing.
>
> [1]
> https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
> [2] https://owasp.org/www-community/attacks/Forced_browsing
>
> --
> Erik
>
>
>


Re: Sequence vs UUID

2023-01-26 Thread Erik Wienhold
> On 26/01/2023 20:17 CET veem v  wrote:
>
> Hello, We were trying to understand whether we should use UUID or Sequence in
> general for primary keys. In many of the blogs (one is below) across multiple
> databases, I saw over the internet and all are mostly stating the sequence is
> better as compared to UUID. But I think in the case of concurrent data load
> scenarios UUID will spread the contention point whereas sequence can be a
> single point of contention.
>
> So we want to understand from experts here, if there are any clear rules
> available or if we have any pros vs cons list available for each of those to
> understand the exact scenario in which we should go for one over other?
> Basically I wanted to see if we can perform some test on sample data to see
> the percentage of overhead on read and write performances of the query in
> presence of UUID VS Sequence to draw some conclusion in general? And also
> considering open source postgres as the base for many databases like redshift
> etc, so the results which apply to progress would apply to others as well.
>
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

I think that[1] provides a good summary.  Performance consideration is just one
aspect.  Is there a technical requirement for using UUID over sequential values?

If there's a single generator of primary keys use bigint sequences.  In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.

There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2].  The first I can understand: you may
not want to reveal the number of users or customers.  But access control should
prevent forced browsing.

[1] 
https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2] https://owasp.org/www-community/attacks/Forced_browsing

--
Erik




Re: Sequence vs UUID

2023-01-26 Thread Rob Sargent

On 1/26/23 14:36, Merlin Moncure wrote:

On Thu, Jan 26, 2023 at 1:18 PM veem v  wrote:

Hello, We were trying to understand whether we should use UUID or
Sequence in general for primary keys. In many of the blogs(one is
below) across multiple databases, I saw over the internet and all
are mostly stating the sequence is better as compared to UUID. But
I think in the case of concurrent data load scenarios UUID will
spread the contention point whereas sequence can be a single point
of contention.

So we want to understand from experts here, if there are anyclear
rules available or if we haveany pros vs cons list available for
each of those to understand the exact scenario in which we should
go for one over other? Basically I wantedto see if we can perform
some test on sample data to see the percentage of overhead on read
and write performances of the query in presence of UUID VS
Sequence to draw some conclusion in general? And also considering
open source postgres as the base for many databases like redshift
etc, so the results which apply to progress would apply to others
as well.


https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/



* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to 
work around that
* UUIDS are basically random data causing page fragmentation.  this is 
particularly bad in auto clustering architectures like sql server
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping 
ranges of records on large tables.


merlin

"Sweeping ranges of records" by ID suggests you have information in your 
id.  If you're  comfortable with sequence generation as a surrogate for 
time-point of entry, by all means have at it.


For many of us, contemporaneously generated records have nothing to do 
with each other. (And they carry attributes which groups them.)


UUIDs do a very good job of avoiding id-overlap across domains (compare 
to a sequence for each domain).





Re: Sequence vs UUID

2023-01-26 Thread Merlin Moncure
On Thu, Jan 26, 2023 at 1:18 PM veem v  wrote:

> Hello, We were trying to understand whether we should use UUID or Sequence
> in general for primary keys. In many of the blogs(one is below) across
> multiple databases, I saw over the internet and all are mostly stating the
> sequence is better as compared to UUID. But I think in the case of
> concurrent data load scenarios UUID will spread the contention point
> whereas sequence can be a single point of contention.
>
> So we want to understand from experts here, if there are any clear rules
> available or if we have any pros vs cons list available for each of those
> to understand the exact scenario in which we should go for one over other?
> Basically I wanted to see if we can perform some test on sample data to
> see the percentage of overhead on read and write performances of the query
> in presence of UUID VS Sequence to draw some conclusion in general? And
> also considering open source postgres as the base for many databases like
> redshift etc, so the results which apply to progress would apply to others
> as well.
>
>
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
>
* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work
around that
* UUIDS are basically random data causing page fragmentation.  this is
particularly bad in auto clustering architectures like sql server
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges
of records on large tables.

merlin


Re: Sequence vs UUID

2023-01-26 Thread Christophe Pettus



> On Jan 26, 2023, at 11:17, veem v  wrote:
> So we want to understand from experts here, if there are any clear rules 
> available or if we have any pros vs cons list available for each of those to 
> understand the exact scenario in which we should go for one over other?

Clear rules are a bit difficult to come by here, but there are general 
guidelines.

First, the distinction isn't exactly UUIDs vs sequences.  There are two 
distinctions:

1. UUIDs vs bigints.
2. Sequential values vs random values.

The advantage of bigints vs UUIDs is that bigints will faster for PostgreSQL to 
process in a variety of ways, as well as being half the size of a UUID (so, 
smaller tables, smaller indexes, etc.).  The main advantage of UUIDs vs bigints 
is, if you are using random values, there's existing easy-to-use infrastructure 
for generating UUIDs vs generating random bigints.

The advantage of sequential values is that they interact much better with 
caching than random values.  Random values will have a harder time maintaining 
a reasonable in-memory working set than sequential values.  (Of course, if the 
database fits entirely in memory, this isn't as much of an issue).  The 
advantage of random values is that they are more opaque; if there is a 
123480102 in a sequential key, an attacker can be confident there's also a 
123480103, which a random value avoids.  There are algorithms for generating 
sequential values that avoid this by having hard-to-guess less significant 
digits.

Another advantage of sequential values is that they are (roughly) time-ordered, 
so they can be used to get "most recent" efficiently.

One concern about sequential values that generally is not a real issue is the 
bottleneck of creating new sequential values.  The sequence functionality in 
PostgreSQL is very concurrency-friendly.

UUIDs can be generated in such a way that they have sequential properties; see:

https://github.com/tvondra/sequential-uuids






Sequence vs UUID

2023-01-26 Thread veem v
Hello, We were trying to understand whether we should use UUID or Sequence
in general for primary keys. In many of the blogs(one is below) across
multiple databases, I saw over the internet and all are mostly stating the
sequence is better as compared to UUID. But I think in the case of
concurrent data load scenarios UUID will spread the contention point
whereas sequence can be a single point of contention.

So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those
to understand the exact scenario in which we should go for one over
other? Basically
I wanted to see if we can perform some test on sample data to see the
percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases like
redshift etc, so the results which apply to progress would apply to others
as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/