Re: [HACKERS] Generating Lots of PKs with nextval(): A Feature Proposal

2010-05-14 Thread Peter Crabtree
On Fri, May 14, 2010 at 5:29 PM, Robert Haas  wrote:
> On Fri, May 14, 2010 at 5:04 PM, hubert depesz lubaczewski
>  wrote:
>> On Fri, May 14, 2010 at 02:07:27PM -0500, Kenneth Marshall wrote:
>>> Hi Peter,
>>>
>>> All you need to do is define your own sequence with an
>>> increment of 500. Look at:
>>>
>>> http://www.postgresql.org/docs/8.4/static/sql-createsequence.html
>>
>> This is often not enough. For example - I want standard increment of 1,
>> but right now I'm importing 1 objects, and it would be simpler for
>> me to get 1 ids. Preferably in one block.
>>
>> This is not achievable now. I know I can 'alter sequence set increment
>> by' - but this will also affect concurrent sessions. which might not be
>> a problem, but it's a side effect that I don't want.
>>
>> +1 for original proposition, would love to get it.
>
> If we do this, I'm inclined to think that the extra argument to
> nextval() should be treated as overriding the base increment rather
> than specifying a multiplier for it.  Other than that nitpick, it
> sounds like a reasonable thing to allow.
>

After giving it some thought, that sounds better. You gain some
functionality that way (temporarily overriding the interval) and lose
none.

Peter

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generating Lots of PKs with nextval(): A Feature Proposal

2010-05-14 Thread Peter Crabtree
On Fri, May 14, 2010 at 5:27 PM, Tom Lane  wrote:
> Peter Crabtree  writes:
>> Now, I was reminded that I could simply do this:
>
>> SELECT nextval('my_seq') FROM generate_series(1, 500);
>
>> But of course then I would have no guarantee that I would get a
>> contiguous block of ids,
>
> The existing "cache" behavior will already handle that for you,
> I believe.  I don't really see a need for new features here.

I don't see how that works for this case, because the "cache" setting
is "static", and also shared between sessions. So if I have 10 records
one time, and 100 records the next, and 587 the third time, what
should my CACHE be set to for that sequence?

And if I do ALTER SEQUENCE SET CACHE each time, I have either killed
concurrency (because I'm locking other sessions out of using that
sequence until I'm finished with it), or I have a race condition (if
someone else issues an ALTER SEQUENCE before I call nextval()). The
same problem exists with using ALTER SEQUENCE SET INCREMENT BY.

Peter

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Generating Lots of PKs with nextval(): A Feature Proposal

2010-05-14 Thread Peter Crabtree
Recently, in preparation for migrating an application to postgres, I
got to this part of the manual (which is *excellent* so far, by the
way):

http://www.postgresql.org/docs/8.4/interactive/functions-sequence.html

A quick check with the folks on #postgresql confirmed my
understanding, which was that the locking semantics of setval() and
nextval() make this unsafe:

SELECT setval('my_seq', nextval('my_seq') + 500);

Now, I was reminded that I could simply do this:

SELECT nextval('my_seq') FROM generate_series(1, 500);

But of course then I would have no guarantee that I would get a
contiguous block of ids, which means if I'm using this to do a mass
insert of records which refer to each others' ids (example: storing a
directed, linear graph), I either have to do a correlated update on
the client side, after transferring the keys (consider the cost of
doing this for a few million records - 4 MB in keys per million
records, for, in extreme cases, 12 MB of data to be inserted -- 33%
overhead in the worst case, presuming symmetric bandwidth), or I have
to insert into a temporary table, then have the db backend do the
update, then insert from there to the real table. Both are imperfect
options in terms of performance and complexity.

Thus, before I start work on it, I propose an extension to the current
nextval():

SELECT nextval('my_seq', 500);

This would increment the my_seq sequence by its interval * 500, and
return the first valid key. This both makes client code that needs a
bunch of PKs simpler to implement, and saves in performance, since the
client can just replace all its PKs (presuming they're currently a
contiguous block from 1 to n) with my_starting_pk + current_pk, so
this:

 pk | next_node
+---
  0 | 1
  1 | 2
  2 | 0

can be easily updated like this:

SELECT nextval('my_seq', (SELECT count(*) FROM my_table));
UPDATE my_table SET pk = currval('my_seq') + pk, next_node =
currval('my_seq') + next_node;

to something like this:

   pk   | next_node
+--
 521650 |521651
 521651 |521652
 521652 |521650

This is a net gain of performance and ease of implementation in many
cases where a large number of ids from a sequence are needed -- with a
small added benefit of the keys being guaranteed to be contiguous.

I don't see any technical problems with this; postgres already can
"pre-allocate" more than one key, but the number is semi-static (the
CACHE parameter to CREATE SEQUENCE). This might break existing user
code if they've defined a nextval(regclass, integer), but I don't see
any way to

Finally, I've checked sequence.c -- this looks pretty straightforward
to implement, but I figured checking with this list was wise before
starting work. Apologies if I've been overly wordy.

Peter

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers