subject:"Re\: How to do faster DML"

Re: How to do faster DML

2024-02-17 Thread Peter J. Holzer

On 2024-02-16 01:34:01 +0100, Peter J. Holzer wrote:
> On 2024-02-15 16:51:56 -0700, David G. Johnston wrote:
> > What you see with an exact type is what you get, which allows for 
> > implementing
> > equality, unlike inexact which requires epsilon checking.
> 
> You can check binary fp values for equality. If they are equal, they
> will compare equal. If they aren't, they won't.
> 
> What you can't expect is that the laws of commutativity, associativity,
> etc. hold. If you compute a value in two different ways which should be
> equivalent mathematically (e.g. a*b/c vs. a/c*b), the result may be
> different and an equality test may fail. But that is also the case for
> numeric (and of course integer).

To illustrate that point:

hjp=> create table t_n (a numeric, b numeric, c numeric);
CREATE TABLE

hjp=> insert into t_n values(47, 52, 10);
INSERT 0 1

-- the numbers are not specially chosen. I just invoked
-- select (random()*100)::int;
-- three times, and they were the ones that came up.

hjp=> select *, a / b * c, a * c / b, a / b * c = a * c / b from t_n;
╔╤╤╤╤╤══╗
║ a  │ b  │ c  │?column?│  ?column?  │ ?column? ║
╟┼┼┼┼┼──╢
║ 47 │ 52 │ 10 │ 9.03846153846153846150 │ 9.0384615384615385 │ f║
╚╧╧╧╧╧══╝
(1 row)

So with type numeric two expressions which should be equal
mathematically, aren't in fact equal.

Now let's try the same thing with binary floating point:

hjp=> create table t_f (a float8, b float8, c float8);
CREATE TABLE

hjp=> insert into t_f values(47, 52, 10);
INSERT 0 1

hjp=> select *, a / b * c, a * c / b, a / b * c = a * c / b from t_f;
╔╤╤╤═══╤═══╤══╗
║ a  │ b  │ c  │ ?column?  │ ?column?  │ ?column? ║
╟┼┼┼───┼───┼──╢
║ 47 │ 52 │ 10 │ 9.038461538461538 │ 9.038461538461538 │ t║
╚╧╧╧═══╧═══╧══╝
(1 row)

Now they are indeed equal. This is *not* guaranteed and I got a bit
lucky here, but the fact that I got lucky on the first try shows that
"float bad, numeric good" is not backed up by reality.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-16 Thread Peter J. Holzer

On 2024-02-16 12:10:20 +0530, veem v wrote:
> 
> On Fri, 16 Feb 2024 at 06:04, Peter J. Holzer  wrote:
> 
> On 2024-02-15 16:51:56 -0700, David G. Johnston wrote:
> > On Thu, Feb 15, 2024 at 4:31 PM Peter J. Holzer  
> wrote:
> >     On 2024-02-14 22:55:01 -0700, David G. Johnston wrote:
> >     > On Tuesday, February 13, 2024, veem v  wrote:
> >     >
> >     >     float data types rather than numeric. This will give better
> >     >     performance.
> >     >
> >     >
> >     > Only use an inexact floating-point data type if you truly
> >     > understand what you are getting yourself into.  Quickly getting
> >     > the wrong answer isn’t tolerable solution.
> >
> >     Do NOT assume that a decimal type (even if it can grow to 
> ridiculuous
> >     lengths like PostgreSQL's numeric) is exact in the mathematical
> sense.
> >     It isn't. It cannot represent almost all real numbers
> >
> >
> 
> 
> Thank You.
> 
> So it looks like the use cases where we don't need precision or decimal point
> values to be stored in postgres , integer data type is the way to go without a
> doubt.
> 
> However in cases of precision is required, as you all mentioned there are
> certain issues(rounding error etc) with "Float" data type and considering a
> normal developers usage point of view, it should be the Numeric type which we
> should use.

You misunderstood. My point was that these rounding errors also happen
with numeric, and if you want to avoid or minimize them you have to
understand what you are doing. For a hilarious example of what happens
if you don't understand that, see
https://en.wikipedia.org/wiki/Vancouver_Stock_Exchange#Rounding_errors_on_its_Index_price

I basically see two reasons to use numeric:

* Your numbers are amounts of money. Accountants are neither mathematicians
  nor engineers, and numeric mimics the way they think. So the results
  will be wrong in the correct way ;-)
* You need lots (more than 15 or 18) digits.

For anything else there is a good chance that float8 or int8 is a better
choice, because those types behave much more consistently.

Of course there are valid reasons to use other types (including numeric)
but the point is that each type has real pros and cons and false
arguments like "numeric is an exact type and float isn't" is not
helpful. That said, "I don't understand binary numbers" might be a valid
reason.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-15 Thread veem v

On Fri, 16 Feb 2024 at 06:04, Peter J. Holzer  wrote:

> On 2024-02-15 16:51:56 -0700, David G. Johnston wrote:
> > On Thu, Feb 15, 2024 at 4:31 PM Peter J. Holzer 
> wrote:
> > On 2024-02-14 22:55:01 -0700, David G. Johnston wrote:
> > > On Tuesday, February 13, 2024, veem v  wrote:
> > >
> > > float data types rather than numeric. This will give better
> > > performance.
> > >
> > >
> > > Only use an inexact floating-point data type if you truly
> > > understand what you are getting yourself into.  Quickly getting
> > > the wrong answer isn’t tolerable solution.
> >
> > Do NOT assume that a decimal type (even if it can grow to ridiculuous
> > lengths like PostgreSQL's numeric) is exact in the mathematical
> sense.
> > It isn't. It cannot represent almost all real numbers
> >
> >
>
> Thank You.

So it looks like the use cases where we don't need precision or decimal
point values to be stored in postgres , integer data type is the way to go
without a doubt.

However in cases of precision is required, as you all mentioned there are
certain issues(rounding error etc) with "Float" data type and considering a
normal developers usage point of view, it should be the Numeric type which
we should use. I think the consistent working or functionality of an
application takes precedence over performance. And I believe , in most real
life scenarios, when we need precisions we expect them to behave
consistently across all the application and database platforms(mainly
banking industries), and thus it seems Numeric data type is the safest one
to use as a multi database platform type. Please correct me if I'm wrong.

Regards
Veem

Re: How to do faster DML

2024-02-15 Thread Peter J. Holzer

On 2024-02-15 16:51:56 -0700, David G. Johnston wrote:
> On Thu, Feb 15, 2024 at 4:31 PM Peter J. Holzer  wrote:
> On 2024-02-14 22:55:01 -0700, David G. Johnston wrote:
> > On Tuesday, February 13, 2024, veem v  wrote:
> >
> >     float data types rather than numeric. This will give better
> > performance.
> >
> >
> > Only use an inexact floating-point data type if you truly
> > understand what you are getting yourself into.  Quickly getting
> > the wrong answer isn’t tolerable solution.
> 
> Do NOT assume that a decimal type (even if it can grow to ridiculuous
> lengths like PostgreSQL's numeric) is exact in the mathematical sense.
> It isn't. It cannot represent almost all real numbers
> 
> 
> That is an unusual definition for exact, I wouldn't have considered the
> requirement to represent all real numbers to be included in it.
> 
> What you see with an exact type is what you get, which allows for implementing
> equality, unlike inexact which requires epsilon checking.

You can check binary fp values for equality. If they are equal, they
will compare equal. If they aren't, they won't.

What you can't expect is that the laws of commutativity, associativity,
etc. hold. If you compute a value in two different ways which should be
equivalent mathematically (e.g. a*b/c vs. a/c*b), the result may be
different and an equality test may fail. But that is also the case for
numeric (and of course integer). You might get around that by epsilon
checking, but whether that's the right thing to do depends on your
application.

And most importantly, and gets most people on the "oh noes, binary fp is
inexact" bandwagon is that decimal fractions (1/10, 1/100, ...) are not
exactly representable in binary, just like 1/3, 1/7, 1/11, ... aren't
exactly represntable in decimal. People are used the latter, but not
the former. But mathematically, that doesn't really make a difference.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-15 Thread David G. Johnston

On Thu, Feb 15, 2024 at 4:31 PM Peter J. Holzer  wrote:

> On 2024-02-14 22:55:01 -0700, David G. Johnston wrote:
> > On Tuesday, February 13, 2024, veem v  wrote:
> >
> > float data types rather than numeric. This will give better
> performance.
> >
> >
> >  Only use an inexact floating-point data type if you truly understand
> what you
> > are getting yourself into.  Quickly getting the wrong answer isn’t
> tolerable
> > solution.
>
> Do NOT assume that a decimal type (even if it can grow to ridiculuous
> lengths like PostgreSQL's numeric) is exact in the mathematical sense.
> It isn't. It cannot represent almost all real numbers
>

That is an unusual definition for exact, I wouldn't have considered the
requirement to represent all real numbers to be included in it.

What you see with an exact type is what you get, which allows for
implementing equality, unlike inexact which requires epsilon checking.
That you need to round some values to the nearest exact value is true but
doesn't make represented values less exact.  But yes, numbers in computers
are complicated and require attention to use.  But not having to worry
about epsilon is still a win.

David J.

Re: How to do faster DML

2024-02-15 Thread Peter J. Holzer

On 2024-02-14 22:55:01 -0700, David G. Johnston wrote:
> On Tuesday, February 13, 2024, veem v  wrote:
> 
> float data types rather than numeric. This will give better performance.
> 
> 
>  Only use an inexact floating-point data type if you truly understand what you
> are getting yourself into.  Quickly getting the wrong answer isn’t tolerable
> solution.

Do NOT assume that a decimal type (even if it can grow to ridiculuous
lengths like PostgreSQL's numeric) is exact in the mathematical sense.
It isn't. It cannot represent almost all real numbers. No pi or e of
course, but also no seemingly simple fractions like 1/3 or 1/7.

Unless you never divide anything, you will need to be aware of the
rounding behaviour, just as you have to with binary floating point
types. And if you use a finite precision you will also have to deal with
rounding on multiplication (and possibly even addition and subtraction,
if you use different precisions).

Almost 40 years ago, our numerical methods professor started his first
lecture with the sentence "You can use a computer for anything - except
computing". He spent the rest of the semester proving himself wrong,
of course, but computing correctly is hard - and choosing a data type
which more closely mimics the way we learn to compute in primary school
doesn't necessarily make it easier. Mostly it just makes it harder to
spot the errors ;-).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane

I really worry you are overthinking this. The only real concern is going
from INT to BIGINT, jumping from 4 to 8 bytes or storage. That really
covers 99% of real world cases, and the canonical advice is to start with
BIGINT if you ever think your rows are going to be numbered in the
billions. Also, a NUMERIC can cause a table rewrite - try changing the
scale, not just the precision. And if your scale is 0, why are you using
numeric? :)

Cheers,
Greg

Re: How to do faster DML

2024-02-15 Thread David G. Johnston

On Thu, Feb 15, 2024 at 12:18 PM veem v  wrote:

>
> So one learning for me, i.e. one of the downside of fixed length data type
> is, with fixed length data types any future changes to it , will be a full
> table rewrite. And thus this sort of change for big tables will be a
> nightmare.
>
>
Yes, using the wording in the documentation, there is no such thing as a
"binary coercible" change for a fixed-width data type.  Or for most types
really.  Text is one of the few for which the typmod has no meaning and
there are variant spellings like varchar that allow for the underlying
storage representation to be the same.

David J.

Re: How to do faster DML

2024-02-15 Thread veem v

On Thu, 15 Feb 2024 at 22:40, Adrian Klaver 
wrote:

> On 2/15/24 09:00, Greg Sabino Mullane wrote:
> > On Thu, Feb 15, 2024 at 11:43 AM Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>> wrote:
> >
> > That is a mixed bag:
> >
> >
> > Ha! Good point. Our contrived example table does suffer from that, so
> > perhaps the test should be:
> >
> > create table int_test(c1 int, c2 int);
>
> Alright now I see:
>
> test=# create table int_test(c1 int, c2 int);
> CREATE TABLE
>
> test=# select pg_relation_filenode('int_test');
>   pg_relation_filenode
> --
>  70021
> (1 row)
>
>
> test=# insert into int_test select a, a+1  from generate_series(1,
> 1, 1) as t(a);
> INSERT 0 1
>
> test=# select pg_relation_size('int_test');
>   pg_relation_size
> --
> 368640
> (1 row)
>
> test=# alter table int_test alter column c2 set data type bigint;
> ALTER TABLE
>
> test=# select pg_relation_filenode('int_test');
>   pg_relation_filenode
> --
>  70024
> (1 row)
>
> test=# select pg_relation_size('int_test');
>   pg_relation_size
> --
> 450560
> (1 row)
>
>
Thank you.

Did a similar test as below using DB fiddle. Same results for fixed length
data type i.e the size is getting increased. However for variable
length types (like numeric) , it remains the same, so it must be just
metadata change and thus should be quick enough even for a big table.
So one learning for me, i.e. one of the downside of fixed length data type
is, with fixed length data types any future changes to it , will be a full
table rewrite. And thus this sort of change for big tables will be a
nightmare.

https://dbfiddle.uk/_gNknf0D

Regards
Veem

Re: How to do faster DML

2024-02-15 Thread Adrian Klaver


On 2/15/24 09:00, Greg Sabino Mullane wrote:
On Thu, Feb 15, 2024 at 11:43 AM Adrian Klaver 
mailto:adrian.kla...@aklaver.com>> wrote:


That is a mixed bag:


Ha! Good point. Our contrived example table does suffer from that, so 
perhaps the test should be:


create table int_test(c1 int, c2 int);


Alright now I see:

test=# create table int_test(c1 int, c2 int);
CREATE TABLE

test=# select pg_relation_filenode('int_test');
 pg_relation_filenode
--
70021
(1 row)


test=# insert into int_test select a, a+1  from generate_series(1, 
1, 1) as t(a);

INSERT 0 1

test=# select pg_relation_size('int_test');
 pg_relation_size
--
   368640
(1 row)

test=# alter table int_test alter column c2 set data type bigint;
ALTER TABLE

test=# select pg_relation_filenode('int_test');
 pg_relation_filenode
--
70024
(1 row)

test=# select pg_relation_size('int_test');
 pg_relation_size
--
   450560
(1 row)



Cheers,
Greg



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane

On Thu, Feb 15, 2024 at 11:43 AM Adrian Klaver 
wrote:

> That is a mixed bag:
>

Ha! Good point. Our contrived example table does suffer from that, so
perhaps the test should be:

create table int_test(c1 int, c2 int);

Cheers,
Greg

Re: How to do faster DML

2024-02-15 Thread Adrian Klaver


On 2/15/24 08:16, Greg Sabino Mullane wrote:

So as I also tested the same as you posted, there has been no change
in "ctid" , when I altered the column data type from 'int' to
'bigint' in the table, so that means full table rewriting
won't happen in such a scenario.


No it was definitely rewritten - do not depend on the ctid to verify 
that. Take our word for it, or 
use*pg_relation_filenode('int_test');* before and after, as well as 
*pg_relation_size('int_test')*;


That is a mixed bag:

test=# select pg_relation_filenode('int_test');
 pg_relation_filenode
--
6
(1 row)

test=# select pg_relation_size('int_test');
 pg_relation_size
--
   368640
(1 row)

test=# alter table int_test alter column int_fld set data type bigint;
ALTER TABLE
test=# select pg_relation_filenode('int_test');
 pg_relation_filenode
--
70002
(1 row)

test=# select pg_relation_size('int_test');
 pg_relation_size
--
   368640



Cheers,
Greg



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane

>
> So as I also tested the same as you posted, there has been no change in
> "ctid" , when I altered the column data type from 'int' to 'bigint' in the
> table, so that means full table rewriting won't happen in such a scenario.


No it was definitely rewritten - do not depend on the ctid to verify that.
Take our word for it, or use* pg_relation_filenode('int_test');* before and
after, as well as *pg_relation_size('int_test')*;

Cheers,
Greg

Re: How to do faster DML

2024-02-14 Thread David G. Johnston

On Tuesday, February 13, 2024, veem v  wrote:
>
> float data types rather than numeric. This will give better performance.
>

 Only use an inexact floating-point data type if you truly understand what
you are getting yourself into.  Quickly getting the wrong answer isn’t
tolerable solution.

David J.

Re: How to do faster DML

2024-02-14 Thread Ron Johnson

On Wed, Feb 14, 2024 at 11:58 PM veem v  wrote:

>
>
> On Thu, 15 Feb 2024 at 00:43, Adrian Klaver 
> wrote:
>
>> It depends:
>>
>> https://www.postgresql.org/docs/current/sql-altertable.html
>>
>> "Adding a column with a volatile DEFAULT or changing the type of an
>> existing column will require the entire table and its indexes to be
>> rewritten. As an exception, when changing the type of an existing column,
>> if the USING clause does not change the column contents and the old type
>> is either binary coercible to the new type or an unconstrained domain over
>> the new type, a table rewrite is not needed. However, indexes must always
>> be rebuilt unless the system can verify that the new index would be
>> logically equivalent to the existing one. For example, if the collation for
>> a column has been changed, an index rebuild is always required because the
>> new sort order might be different. However, in the absence of a collation
>> change, a column can be changed from text to varchar (or vice versa)
>> without rebuilding the indexes because these data types sort identically.
>> Table and/or index rebuilds may take a significant amount of time for a
>> large table; and will temporarily require as much as double the disk space.
>>
>> "
>>
>>
>> create table int_test(int_fld integer);
>>
>> insert into int_test select * from generate_series(1, 1, 1);
>>
>>
>> select ctid, int_fld from int_test ;
>>
>> ctid   | int_fld
>> --+-
>>  (0,1)|   1
>>  (0,2)|   2
>>  (0,3)|   3
>>  (0,4)|   4
>>  (0,5)|   5
>>  (0,6)|   6
>>  (0,7)|   7
>>  (0,8)|   8
>>  (0,9)|   9
>>  (0,10)   |  10
>>
>>
>> alter table int_test alter column int_fld set data type bigint;
>>
>> select ctid, int_fld from int_test ;
>>
>>   ctid   | int_fld
>> --+-
>>  (0,1)|   1
>>  (0,2)|   2
>>  (0,3)|   3
>>  (0,4)|   4
>>  (0,5)|   5
>>  (0,6)|   6
>>  (0,7)|   7
>>  (0,8)|   8
>>  (0,9)|   9
>>  (0,10)   |  10
>>
>> update int_test set  int_fld = int_fld;
>>
>> select ctid, int_fld from int_test  order by int_fld;
>>
>> (63,1)|   1
>>  (63,2)|   2
>>  (63,3)|   3
>>  (63,4)|   4
>>  (63,5)|   5
>>  (63,6)|   6
>>  (63,7)|   7
>>  (63,8)|   8
>>  (63,9)|   9
>>  (63,10)   |  10
>>
>>
>> Where ctid is:
>>
>> https://www.postgresql.org/docs/current/ddl-system-columns.html
>>
>>
>>  Thank you so much.
> So as I also tested the same as you posted, there has been no change in
> "ctid" , when I altered the column data type from 'int' to 'bigint' in the
> table, so that means full table rewriting won't happen in such a scenario.
>

It happened when I altered columns from INTEGER to BIGINT.  How do I know?

The disk filled up.

>

Re: How to do faster DML

2024-02-14 Thread veem v

On Thu, 15 Feb 2024 at 00:43, Adrian Klaver 
wrote:

> It depends:
>
> https://www.postgresql.org/docs/current/sql-altertable.html
>
> "Adding a column with a volatile DEFAULT or changing the type of an
> existing column will require the entire table and its indexes to be
> rewritten. As an exception, when changing the type of an existing column,
> if the USING clause does not change the column contents and the old type
> is either binary coercible to the new type or an unconstrained domain over
> the new type, a table rewrite is not needed. However, indexes must always
> be rebuilt unless the system can verify that the new index would be
> logically equivalent to the existing one. For example, if the collation for
> a column has been changed, an index rebuild is always required because the
> new sort order might be different. However, in the absence of a collation
> change, a column can be changed from text to varchar (or vice versa)
> without rebuilding the indexes because these data types sort identically.
> Table and/or index rebuilds may take a significant amount of time for a
> large table; and will temporarily require as much as double the disk space.
>
> "
>
>
> create table int_test(int_fld integer);
>
> insert into int_test select * from generate_series(1, 1, 1);
>
>
> select ctid, int_fld from int_test ;
>
> ctid   | int_fld
> --+-
>  (0,1)|   1
>  (0,2)|   2
>  (0,3)|   3
>  (0,4)|   4
>  (0,5)|   5
>  (0,6)|   6
>  (0,7)|   7
>  (0,8)|   8
>  (0,9)|   9
>  (0,10)   |  10
>
>
> alter table int_test alter column int_fld set data type bigint;
>
> select ctid, int_fld from int_test ;
>
>   ctid   | int_fld
> --+-
>  (0,1)|   1
>  (0,2)|   2
>  (0,3)|   3
>  (0,4)|   4
>  (0,5)|   5
>  (0,6)|   6
>  (0,7)|   7
>  (0,8)|   8
>  (0,9)|   9
>  (0,10)   |  10
>
> update int_test set  int_fld = int_fld;
>
> select ctid, int_fld from int_test  order by int_fld;
>
> (63,1)|   1
>  (63,2)|   2
>  (63,3)|   3
>  (63,4)|   4
>  (63,5)|   5
>  (63,6)|   6
>  (63,7)|   7
>  (63,8)|   8
>  (63,9)|   9
>  (63,10)   |  10
>
>
> Where ctid is:
>
> https://www.postgresql.org/docs/current/ddl-system-columns.html
>
>
>  Thank you so much.
So as I also tested the same as you posted, there has been no change in
"ctid" , when I altered the column data type from 'int' to 'bigint' in the
table, so that means full table rewriting won't happen in such a scenario.

Regards
Veem

Re: How to do faster DML

2024-02-14 Thread Adrian Klaver



On 2/14/24 10:11 AM, veem v wrote:


On Wed, 14 Feb, 2024, 10:30 am Ron Johnson,  
wrote:


On Tue, Feb 13, 2024 at 4:17 PM veem v  wrote:
[sni[]

One question here, if we have defined one column as a
fixed length data type "integer" and slowly we noticed the
length of data keeps increasing (in case of a sequence
generated PK column which will keep increasing), and we want
to alter the column to "bigint" now. In such scenario, will it
append/pad the additional spaces to all the existing values
which were already stored  with integer type initially in the
table. And that would be then an increase to the existing
table storage. Please correct me if I'm wrong.


ALTER TABLE foo ALTER COLUMN bar BIGINT; rewrites the whole
table.  Do yourself a favor, and start with BIGINT.



Not yet tried, but that's interesting. I was initially thinking as it 
will be increasing the length, so it would be just a metadata change 
and finish within seconds.



It depends:

https://www.postgresql.org/docs/current/sql-altertable.html

"Adding a column with a volatile |DEFAULT| or changing the type of an 
existing column will require the entire table and its indexes to be 
rewritten. As an exception, when changing the type of an existing 
column, if the |USING| clause does not change the column contents and 
the old type is either binary coercible to the new type or an 
unconstrained domain over the new type, a table rewrite is not needed. 
However, indexes must always be rebuilt unless the system can verify 
that the new index would be logically equivalent to the existing one. 
For example, if the collation for a column has been changed, an index 
rebuild is always required because the new sort order might be 
different. However, in the absence of a collation change, a column can 
be changed from |text| to |varchar| (or vice versa) without rebuilding 
the indexes because these data types sort identically. Table and/or 
index rebuilds may take a significant amount of time for a large table; 
and will temporarily require as much as double the disk space.


"


create table int_test(int_fld integer);

insert into int_test select * from generate_series(1, 1, 1);


select ctid, int_fld from int_test ;

ctid   | int_fld
--+-
 (0,1)    |   1
 (0,2)    |   2
 (0,3)    |   3
 (0,4)    |   4
 (0,5)    |   5
 (0,6)    |   6
 (0,7)    |   7
 (0,8)    |   8
 (0,9)    |   9
 (0,10)   |  10


alter table int_test alter column int_fld set data type bigint;

select ctid, int_fld from int_test ;

  ctid   | int_fld
--+-
 (0,1)    |   1
 (0,2)    |   2
 (0,3)    |   3
 (0,4)    |   4
 (0,5)    |   5
 (0,6)    |   6
 (0,7)    |   7
 (0,8)    |   8
 (0,9)    |   9
 (0,10)   |  10

update int_test set  int_fld = int_fld;

select ctid, int_fld from int_test  order by int_fld;

(63,1)    |   1
 (63,2)    |   2
 (63,3)    |   3
 (63,4)    |   4
 (63,5)    |   5
 (63,6)    |   6
 (63,7)    |   7
 (63,8)    |   8
 (63,9)    |   9
 (63,10)   |  10


Where ctid is:

https://www.postgresql.org/docs/current/ddl-system-columns.html


"

|ctid|

   The physical location of the row version within its table. Note that
   although the |ctid| can be used to locate the row version very
   quickly, a row's |ctid| will change if it is updated or moved by
   |VACUUM FULL|. Therefore |ctid| is useless as a long-term row
   identifier. A primary key should be used to identify logical rows.

"



 But as you mentioned, it seems to be the effect of "fixed length data 
type" which is why it's going to rewrite whole table even we just 
increases the column length. Hope it won't be the case in variable 
length data type.



--
Adrian Klaver
adrian.kla...@aklaver.com

Re: How to do faster DML

2024-02-14 Thread veem v

On Wed, 14 Feb, 2024, 10:30 am Ron Johnson,  wrote:

> On Tue, Feb 13, 2024 at 4:17 PM veem v  wrote:
> [sni[]
>
>> One question here, if we have defined one column as a fixed length data
>> type "integer" and slowly we noticed the length of data keeps increasing
>> (in case of a sequence generated PK column which will keep increasing), and
>> we want to alter the column to "bigint" now. In such scenario, will it
>> append/pad the additional spaces to all the existing values which were
>> already stored  with integer type initially in the table. And that would be
>> then an increase to the existing table storage. Please correct me if I'm
>> wrong.
>>
>
> ALTER TABLE foo ALTER COLUMN bar BIGINT; rewrites the whole table.  Do
> yourself a favor, and start with BIGINT.
>

Not yet tried, but that's interesting. I was initially thinking as it will
be increasing the length, so it would be just a metadata change and finish
within seconds.

 But as you mentioned, it seems to be the effect of "fixed length data
type" which is why it's going to rewrite whole table even we just increases
the column length. Hope it won't be the case in variable length data type.

>

Re: How to do faster DML

2024-02-13 Thread Ron Johnson

On Tue, Feb 13, 2024 at 4:17 PM veem v  wrote:
[sni[]

> One question here, if we have defined one column as a fixed length data
> type "integer" and slowly we noticed the length of data keeps increasing
> (in case of a sequence generated PK column which will keep increasing), and
> we want to alter the column to "bigint" now. In such scenario, will it
> append/pad the additional spaces to all the existing values which were
> already stored  with integer type initially in the table. And that would be
> then an increase to the existing table storage. Please correct me if I'm
> wrong.
>

ALTER TABLE foo ALTER COLUMN bar BIGINT; rewrites the whole table.  Do
yourself a favor, and start with BIGINT.

Re: How to do faster DML

2024-02-13 Thread veem v

On Tue, 13 Feb 2024 at 20:32, Peter J. Holzer  wrote:

> Please do not conflate "char(n)" with native machine types like int or
> float. These are very different things. A char(n) is string of fixed but
> arbitrary length. This is not something a CPU can process in a single
> instruction. It has to go over it character by character.
>
> There is almost never a reason to use char(n). Just use varchar(n) or in
> the case of PostgreSQL just varchar or text.
>
> > However I do see even in Oracle databases, we have Integer type too,
>
> Not really. INTEGER is just an alias for NUMBER(38) in Oracle (see for
> example
>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/Data-Types.html
> ).
> It's not the same as an INTEGER in PostgreSQL.
>
>
Thank you so much for the clarification.

Basically as i understood, we can follow below steps in serial,

Step-1)First wherever possible use Smallint,Integer,bigint,float data types
rather than numeric. This will give better performance.

Step-2)Use the frequently queried columns first and least frequently
queried columns towards last in the row while creating the table. This is
too intended for better performance.

Step-3)Define the columns with typlen desc as per below formula( column
tetris symptom). This is for better storage space utilization.

SELECT a.attname, t.typname, t.typalign, t.typlen
  FROM pg_class c
  JOIN pg_attribute a ON (a.attrelid = c.oid)
  JOIN pg_type t ON (t.oid = a.atttypid)
 WHERE c.relname = 'user_order'
   AND a.attnum >= 0
 ORDER BY t.typlen DESC;

One question here, if we have defined one column as a fixed length data
type "integer" and slowly we noticed the length of data keeps increasing
(in case of a sequence generated PK column which will keep increasing), and
we want to alter the column to "bigint" now. In such scenario, will it
append/pad the additional spaces to all the existing values which were
already stored  with integer type initially in the table. And that would be
then an increase to the existing table storage. Please correct me if I'm
wrong.

Regards
Veem

Re: How to do faster DML

2024-02-13 Thread Peter J. Holzer

On 2024-02-13 01:53:25 +0530, veem v wrote:
> On Mon, 12 Feb 2024 at 03:40, Peter J. Holzer  wrote:
> 
> The fixed width types are those that the CPU can directly process:
> Integers with 16, 32 and 64 bits, floating point numbers with 32 and 64
> bits. The CPU can read and write them with a single memory access, it
> can do arithmetic with a single instruction, etc.
> 
> Number/Numeric are not native types on any CPU. To read them the CPU
> needs several memory accesses (probably one per byte unless you get
> really clever) and then it can't do any calculations with them
> directly, instead it has run a subroutine which does operations on
> little chunks and then puts those chunks together again - basically the
> same as you do when you're doing long addition or multiplication on
> paper. So that's not very efficient.
> 
> 
> So it looks like the fixed length data type(like integer, float) should be the
> first choice while choosing the data type of the attributes wherever possible,
> as these are native types. (Like choosing "Integer/float" over "Numeric",
> "Char" over "Varchar" etc). 

Please do not conflate "char(n)" with native machine types like int or
float. These are very different things. A char(n) is string of fixed but
arbitrary length. This is not something a CPU can process in a single
instruction. It has to go over it character by character.

There is almost never a reason to use char(n). Just use varchar(n) or in
the case of PostgreSQL just varchar or text.

> However I do see even in Oracle databases, we have Integer type too,

Not really. INTEGER is just an alias for NUMBER(38) in Oracle (see for
example
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/Data-Types.html).
It's not the same as an INTEGER in PostgreSQL.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-13 Thread Peter J. Holzer

On 2024-02-12 11:28:41 -0500, Greg Sabino Mullane wrote:
> On Mon, Feb 12, 2024 at 1:50 AM veem v  wrote:
> 
> So we were thinking, adding many column to a table should be fine in
> postgres (as here we have a use case in which total number of columns may
> go till ~500+). But then,  considering the access of columns towards the
> end of a row is going to add more time as compared to the column which is
> at the start of the row. As you mentioned, accessing 100th column may add 
> 4
> to 5 times more as compared to the access of first column. So , is it
> advisable here to go for similar approach of breaking the table into two ,
> if the total number of column reaches certain number/threshold for a 
> table?
> 
> 
> I'm not sure of what Peter was testing exactly to get those 4-5x figures,

Sorry, I should have included my benchmark code (it's short enough - see
below).

What i was actually timing was 

select count(*) from t_postgresql_column_bench where v{i} = 'a'

for various i.

> but I presume that is column access time,

That was the goal. Of course there is always some overhead but I figured
that by counting rows where a column has a constant value the overhead
is minimal or at least constant.

> which would not mean a direct effect on your total query time of 4-5x.

Right. In any real application the column access time is only a part of
the total processing time and probably a small part, so the effect on
total processing time is correspondingly smaller.

hp


#!/usr/bin/python3

import random
import time
import psycopg2

n_cols = 100
n_rows = 10

db = psycopg2.connect("")
csr = db.cursor()

csr.execute("drop table if exists t_postgresql_column_bench")
q = "create table t_postgresql_column_bench ("
q += ", ".join(f"v{i} text" for i in range(n_cols))
q += ")"
csr.execute(q)
q = "insert into t_postgresql_column_bench values("
q += ", ".join("%s" for i in range(n_cols))
q += ")"

for j in range(n_rows):
v = [ chr(random.randint(96+1, 96+26)) for i in range(n_cols)]
csr.execute(q, v)
db.commit()

for i in range(n_cols):
q = f"select count(*) from t_postgresql_column_bench where v{i} = 'a'"
t0 = time.clock_gettime(time.CLOCK_MONOTONIC)
csr.execute(q)
r = csr.fetchall()
print(r)
t1 = time.clock_gettime(time.CLOCK_MONOTONIC)
print(i, t1 - t0)
db.commit()


-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-12 Thread veem v

On Tue, 13 Feb 2024 at 02:01, Ron Johnson  wrote:

> On Mon, Feb 12, 2024 at 3:23 PM veem v  wrote:
> [snip]
>
>> So it looks like the fixed length data type(like integer, float) should
>> be the first choice while choosing the data type of the attributes
>> wherever possible, as these are native types.
>>
>
> Correct.
>
>
>> (Like choosing "Integer/float" over "Numeric", "Char" over "Varchar"
>> etc).
>>
> However I do see even in Oracle databases, we have Integer type too, but
>> it's suggesting(For e.g. in below blog) to rather go with Number types over
>> Integer and Varchar2 over Char, which is opposite of what we are discussing
>> here. Is the fixed length data type behaves differently in postgres vs
>> oracle and thus should be treated differently?
>>
>> https://www.databasestar.com/oracle-data-types/
>>
>
>>
>> From above blog:-
>>
>
> Oracle is not Postgresql.  WTH are you using an Oracle blog to decide on
> Postgresql data types
>
>
>>
>>
>> *When to use CHAR: There should be no reason to use the CHAR data type,
>> as it is similar to a VARCHAR2 and it’s better to be consistent.*
>> *When to use INTEGER: You should use the NUMBER data type instead.*
>>
>>
> Did you actually read that blog post?
>
> Have you even read the Postgresql documentation on data types?
>
>

My Apology, If interpreting wrongly. My thought was that , as fixed length
data types are native type ones, ideally it should be faster in all the
databases. So was comparing with different databases. And mainly as I had
worked mainly in Oracle database in the past and so it was a key learning
here and seemed totally opposite, so was curious to know.

Regards
Veem

Re: How to do faster DML

2024-02-12 Thread Ron Johnson

On Mon, Feb 12, 2024 at 3:23 PM veem v  wrote:
[snip]

> So it looks like the fixed length data type(like integer, float) should be
> the first choice while choosing the data type of the attributes
> wherever possible, as these are native types.
>

Correct.


> (Like choosing "Integer/float" over "Numeric", "Char" over "Varchar" etc).
>
However I do see even in Oracle databases, we have Integer type too, but
> it's suggesting(For e.g. in below blog) to rather go with Number types over
> Integer and Varchar2 over Char, which is opposite of what we are discussing
> here. Is the fixed length data type behaves differently in postgres vs
> oracle and thus should be treated differently?
>
> https://www.databasestar.com/oracle-data-types/
>

>
> From above blog:-
>

Oracle is not Postgresql.  WTH are you using an Oracle blog to decide on
Postgresql data types


>
>
> *When to use CHAR: There should be no reason to use the CHAR data type, as
> it is similar to a VARCHAR2 and it’s better to be consistent.*
> *When to use INTEGER: You should use the NUMBER data type instead.*
>
>
Did you actually read that blog post?

Have you even read the Postgresql documentation on data types?

Re: How to do faster DML

2024-02-12 Thread veem v

Thank you so much for the clarification.

On Mon, 12 Feb 2024 at 03:40, Peter J. Holzer  wrote:

> The fixed width types are those that the CPU can directly process:
> Integers with 16, 32 and 64 bits, floating point numbers with 32 and 64
> bits. The CPU can read and write them with a single memory access, it
> can do arithmetic with a single instruction, etc.
>
> Number/Numeric are not native types on any CPU. To read them the CPU
> needs several memory accesses (probably one per byte unless you get
> really clever) and then it can't do any calculations with them
> directly, instead it has run a subroutine which does operations on
> little chunks and then puts those chunks together again - basically the
> same as you do when you're doing long addition or multiplication on
> paper. So that's not very efficient.
>
> So it looks like the fixed length data type(like integer, float) should be
the first choice while choosing the data type of the attributes
wherever possible, as these are native types. (Like choosing
"Integer/float" over "Numeric", "Char" over "Varchar" etc).
However I do see even in Oracle databases, we have Integer type too, but
it's suggesting(For e.g. in below blog) to rather go with Number types over
Integer and Varchar2 over Char, which is opposite of what we are discussing
here. Is the fixed length data type behaves differently in postgres vs
oracle and thus should be treated differently?

https://www.databasestar.com/oracle-data-types/

>From above blog:-


*When to use CHAR: There should be no reason to use the CHAR data type, as
it is similar to a VARCHAR2 and it’s better to be consistent.*
*When to use INTEGER: You should use the NUMBER data type instead.*

Regards
Veem

Re: How to do faster DML

2024-02-12 Thread Greg Sabino Mullane

On Mon, Feb 12, 2024 at 1:50 AM veem v  wrote:

> So we were thinking, adding many column to a table should be fine in
> postgres (as here we have a use case in which total number of columns may
> go till ~500+). But then,  considering the access of columns towards the
> end of a row is going to add more time as compared to the column which is
> at the start of the row. As you mentioned, accessing 100th column may add 4
> to 5 times more as compared to the access of first column. So , is it
> advisable here to go for similar approach of breaking the table into two ,
> if the total number of column reaches certain number/threshold for a table?
>

I'm not sure of what Peter was testing exactly to get those 4-5x figures,
but I presume that is column access time, which would not mean a direct
effect on your total query time of 4-5x. As far as breaking the table in
two, I would not recommend that as it adds complexity and introduces other
problems. On the other hand, 500 columns is pretty extreme, so maybe things
rarely accessed or searched for could go in another table. Really hard to
say without knowing more about your data access patterns. Hopefully, we
never see a "SELECT *" for a table like that! :)

Cheers,
Greg

Re: How to do faster DML

2024-02-12 Thread Greg Sabino Mullane

On Tue, Feb 6, 2024 at 12:15 AM Lok P  wrote:

> Another thing I noticed the shared_buffer parameters set as 2029684 in
> this instance, which comes to ~21MB and that seems very small for a
> database operating in large scale. And I see we have RAM in the instance
> showing as ~256GB. So thinking of bumping it to something as ~40-50GB.
>

shared_buffers has a unit of 8 kb blocks, so your cluster is set at 15GB,
not 21 MB. Even so, going to 50 would be fine if you have that much RAM.

Hope that will help to some extent. Not sure if there is methods to
> manually,  cache some objects(tables/indexes) which were getting used
> frequently by the read queries.
>

That's one of the points of shared_buffers - keep things that are accessed
often in memory. Postgres keeps track of which things are used more often,
so in theory the most frequently used items are removed only when
absolutely necessary.

Cheers,
Greg

>

Re: How to do faster DML

2024-02-12 Thread Dominique Devienne

On Mon, Feb 12, 2024 at 7:50 AM veem v  wrote:

> I think that's not much of a concern with PostgreSQL because you can't
>> update a row in-place anyway because of MVCC.
>
>

> Good to know. So it means here in postgres, there is no such concern like
> "row chaining", "row migration" etc.
>
which we normally have in a non mvcc database (like Oracle say).
>

Don't confuse MVCC (Multi-Version Concurrency Control), and various
DB-specific MVCC implementation details.
BOTH Oracle and PostgreSQL *are* MVCC (read Tom Kyte). But they happen to
use very different implementations.
SQLite in WAL mode is also MVCC, but using yet another implementation.
All (MVCC) DBs do it differently, but achieve the same effect; albeit with
different trade-offs. --DD

Re: How to do faster DML

2024-02-11 Thread veem v

Thank you .

On Mon, 12 Feb 2024 at 03:52, Peter J. Holzer  wrote:

> On 2024-02-11 12:08:47 -0500, Ron Johnson wrote:
> > On Sun, Feb 11, 2024 at 11:54 AM veem v  wrote:
> > When you said "you would normally prefer those over numeric " I was
> > thinking the opposite. As you mentioned integer is a fixed length
> data type
> > and will occupy 4 bytes whether you store 15 or .But in case
> of
> > variable length type like Number or numeric , it will resize itself
> based
> > on the actual data, So is there any downside of going with the
> variable
> > length data type like Numeric,
> >
> >
> > Consider a table with a bunch of NUMERIC fields.  One of those records
> has
> > small values (aka three bytes).  It fits neatly in 2KiB.
> >
> > And then you update all those NUMERIC fields to big numbers that take 15
> > bytes.  Suddenly (or eventually, if you update them at different times),
> the
> > record does not fit in 2KiB, and so must be moved to its own.page.
> That causes
> > extra IO.
>
> I think that's not much of a concern with PostgreSQL because you can't
> update a row in-place anyway because of MVCC. So in any case you're
> writing a new row. If you're lucky there is enough free space in the same
> page and you can do a HOT update, but that's quite independent on
> whether the row changes size.
>
>
>
Good to know. So it means here in postgres, there is no such concern like
"row chaining", "row migration" etc. which we normally have in a non mvcc
database (like Oracle say). And there its not advisable to have more than
~255 columns in a table even its technically possible. And if such
requirement arises, we normally break the table into 2 different tables
with some columns in common to join them.

https://jonathanlewis.wordpress.com/2015/02/19/255-columns/

So we were thinking, adding many column to a table should be fine in
postgres (as here we have a use case in which total number of columns may
go till ~500+). But then,  considering the access of columns towards the
end of a row is going to add more time as compared to the column which is
at the start of the row. As you mentioned, accessing 100th column may add 4
to 5 times more as compared to the access of first column. So , is it
advisable here to go for similar approach of breaking the table into two ,
if the total number of column reaches certain number/threshold for a table?

Regards
Veem

Re: How to do faster DML

2024-02-11 Thread Peter J. Holzer

On 2024-02-11 12:08:47 -0500, Ron Johnson wrote:
> On Sun, Feb 11, 2024 at 11:54 AM veem v  wrote:
> When you said "you would normally prefer those over numeric " I was
> thinking the opposite. As you mentioned integer is a fixed length data 
> type
> and will occupy 4 bytes whether you store 15 or .But in case of
> variable length type like Number or numeric , it will resize itself based
> on the actual data, So is there any downside of going with the variable
> length data type like Numeric,
> 
> 
> Consider a table with a bunch of NUMERIC fields.  One of those records has
> small values (aka three bytes).  It fits neatly in 2KiB.
> 
> And then you update all those NUMERIC fields to big numbers that take 15
> bytes.  Suddenly (or eventually, if you update them at different times), the
> record does not fit in 2KiB, and so must be moved to its own.page.  That 
> causes
> extra IO.

I think that's not much of a concern with PostgreSQL because you can't
update a row in-place anyway because of MVCC. So in any case you're
writing a new row. If you're lucky there is enough free space in the same
page and you can do a HOT update, but that's quite independent on
whether the row changes size.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-11 Thread Peter J. Holzer

On 2024-02-11 22:23:58 +0530, veem v wrote:
> On Sun, 11 Feb 2024 at 19:02, Peter J. Holzer  wrote:
> 
> > Similarly for Number/Numeric data type.
> 
> Number in Oracle and numeric in PostgreSQL are variable length types.
> But in PostgreSQL you also have a lot of fixed length numeric types
> (from boolean to bigint as well as float4 and float8) and you would
> normally prefer those over numeric (unless you really need a decimal or
> very long type). So padding is something you would encounter in a
> typical PostgreSQL database while it just wouldn't happen in a typical
> Oracle database.
> 
> 
> 
> When you said "you would normally prefer those over numeric " I was thinking
> the opposite. As you mentioned integer is a fixed length data type and will
> occupy 4 bytes whether you store 15 or .But in case of variable
> length type like Number or numeric , it will resize itself based on the actual
> data, So is there any downside of going with the variable length data type 
> like
> Numeric, Varchar type always for defining the data elements?

The fixed width types are those that the CPU can directly process:
Integers with 16, 32 and 64 bits, floating point numbers with 32 and 64
bits. The CPU can read and write them with a single memory access, it
can do arithmetic with a single instruction, etc.

Number/Numeric are not native types on any CPU. To read them the CPU
needs several memory accesses (probably one per byte unless you get
really clever) and then it can't do any calculations with them
directly, instead it has run a subroutine which does operations on
little chunks and then puts those chunks together again - basically the
same as you do when you're doing long addition or multiplication on
paper. So that's not very efficient.

Also the space savings aren't that great and probably even negative: In
my example the length of a numeric type with at most 10 digits varied
between 3 and 7 bytes, Only for values between -99 and +99 is this (3
bytes) actually shorter, for other values it's the same length or
longer. So you would only save space if most of your values are in that
±99 range. But not if all of them are, because then you could simply use
a smallint (Range -32768..32767) in PostgreSQL and save another byte.

Finally - and I'm probably biased in this as I learned programming 40
years ago - to me the native types feel more natural than product
specific variable-length decimal types.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-11 Thread Ron Johnson

On Sun, Feb 11, 2024 at 11:54 AM veem v  wrote:
[snip]

> When you said *"you would normally prefer those over numeric " *I was
> thinking the opposite. As you mentioned integer is a fixed length data type
> and will occupy 4 bytes whether you store 15 or .But in case of
> variable length type like Number or numeric , it will resize itself based
> on the actual data, So is there any downside of going with the variable
> length data type like Numeric,
>

Consider a table with a bunch of NUMERIC fields.  One of those records has
small values (aka three bytes).  It fits neatly in 2KiB.

And then you update all those NUMERIC fields to big numbers that take 15
bytes.  Suddenly (or eventually, if you update them at different times),
the record does *not* fit in 2KiB, and so must be moved to its own.page.
That causes extra IO.

> Varchar type always for defining the data elements?
>

Internally, all character-type fields are stored as TEXT.  CHAR and
VARCHAR(XX)'s only purposes are SQL-compliance and length-limitation.
And length-limitation is "just" a legacy carried forward from the card
punch days.

Re: How to do faster DML

2024-02-11 Thread David G. Johnston

On Sunday, February 11, 2024, veem v  wrote:

>
> On Sun, 11 Feb 2024 at 19:02, Peter J. Holzer  wrote:
>
>> > Similarly for Number/Numeric data type.
>>
>> Number in Oracle and numeric in PostgreSQL are variable length types.
>> But in PostgreSQL you also have a lot of fixed length numeric types
>> (from boolean to bigint as well as float4 and float8) and you would
>> normally prefer those over numeric (unless you really need a decimal or
>> very long type). So padding is something you would encounter in a
>> typical PostgreSQL database while it just wouldn't happen in a typical
>> Oracle database.
>>
>>
>> When you said *"you would normally prefer those over numeric " *I was
> thinking the opposite. As you mentioned integer is a fixed length data type
> and will occupy 4 bytes whether you store 15 or .But in case of
> variable length type like Number or numeric , it will resize itself based
> on the actual data, So is there any downside of going with the variable
> length data type like Numeric, Varchar type always for defining the data
> elements?
>

Regardless of the size of the actual data in a variable width column expect
that the size and computational overhead is going to make using that field
more costly than using a fixed width field.  You don’t have a choice for
text, it is always variable width, but for numeric, if can use an integer
variant you will come out ahead versus numeric.

David J.

Re: How to do faster DML

2024-02-11 Thread veem v

On Sun, 11 Feb 2024 at 19:02, Peter J. Holzer  wrote:

> > Similarly for Number/Numeric data type.
>
> Number in Oracle and numeric in PostgreSQL are variable length types.
> But in PostgreSQL you also have a lot of fixed length numeric types
> (from boolean to bigint as well as float4 and float8) and you would
> normally prefer those over numeric (unless you really need a decimal or
> very long type). So padding is something you would encounter in a
> typical PostgreSQL database while it just wouldn't happen in a typical
> Oracle database.
>
>
> When you said *"you would normally prefer those over numeric " *I was
thinking the opposite. As you mentioned integer is a fixed length data type
and will occupy 4 bytes whether you store 15 or .But in case of
variable length type like Number or numeric , it will resize itself based
on the actual data, So is there any downside of going with the variable
length data type like Numeric, Varchar type always for defining the data
elements?

Regards
Veem

Re: How to do faster DML

2024-02-11 Thread Peter J. Holzer

On 2024-02-11 13:25:10 +0530, veem v wrote:
> On Sun, 11 Feb 2024 at 05:55, Peter J. Holzer  wrote:
> Yes. Numbers in Oracle are variable length, so most Oracle tables
> wouldn't contain many fixed length columns. In PostgreSQL must numeric
> types are fixed length, so you'll have quite a lot of them.
> 
> 
> 
> So it means , say in other databases like (oracle database), we were careless
> choosing the data length , say for example Varchar2(4000), if the real data
> which is inserted into the table holds a varchar string of length 20 bytes 
> then
> Oracle trimmed it to occupy the 20 bytes length only in the storage. but in
> postgre here we need to be cautious and define the length as what the data
> attribute can max contains , because that amount of fixed space is allocated
> to every value which is inserted into the table for that attribute/data
> element.

No. Varchar is (as the name implies) a variable length data type and
both Oracle and PostgreSQL store only the actual value plus some length
indicator in the database. Indeed, in PostgreSQL you don't need to
specify the maximum length at all.

However, if you specify a column as "integer" in PostgreSQL it will
always occupy 4 bytes, whether you store the value 15 in it or
9. In Oracle, there is no "integer" type and the roughly
equivalent number(10,0) is actually a variable length floating point
number. So 15 will occupy only 3 bytes and 9 will occupy 7
bytes[1].

> Similarly for Number/Numeric data type.

Number in Oracle and numeric in PostgreSQL are variable length types.
But in PostgreSQL you also have a lot of fixed length numeric types
(from boolean to bigint as well as float4 and float8) and you would
normally prefer those over numeric (unless you really need a decimal or
very long type). So padding is something you would encounter in a
typical PostgreSQL database while it just wouldn't happen in a typical
Oracle database.

But as Laurenz wrote that's a micro optimization which you usually
shouldn't spend much time on. OTOH, if you have hundreds of columns in a
table, maybe it is worthwhile to spend some time on analyzing access
patterns and optimizing the data layout.

hp

[1] From memory. I may be misremembering the details.

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-11 Thread Laurenz Albe

On Sun, 2024-02-11 at 13:25 +0530, veem v wrote:
> On Sun, 2024-02-11 at 01:25 +0100, Peter J. Holzer wrote:
> > On 2024-02-06 11:25:05 +0530, veem v wrote:
> > > With respect to the storage part:- In Oracle database we were supposed to 
> > > keep
> > > the frequently accessed column first and less frequently accessed columns
> > > towards last in the table while creating table DDL. This used to help the 
> > > query
> > > performance while accessing those columns as it's a row store database. 
> > > Are
> > > there any such performance implications exists in postgres? And there the 
> > > data
> > > types like varchar used to only occupy the space which the real data/value
> > > contains. 
> > >
> > > But here in postgres, it seems a bit different as it has significant
> > > overhead in storage, as these seem like being fixed length data types and 
> > > will
> > > occupy the space no matter what the read data holds.
> >
> > Yes. Numbers in Oracle are variable length, so most Oracle tablesbetween
> > wouldn't contain many fixed length columns. In PostgreSQL must numeric
> > types are fixed length, so you'll have quite a lot of them.
>
> So it means , say in other databases like (oracle database), we were careless
> choosing the data length , say for example Varchar2(4000), if the real data 
> which
> is inserted into the table holds a varchar string of length 20 bytes then 
> Oracle
> trimmed it to occupy the 20 bytes length only in the storage. but in postgre 
> here
> we need to be cautious and define the length as what the data attribute can 
> max
> contains , because that amount of fixed space is allocated to every value 
> which
> is inserted into the table for that attribute/data element. Similarly for
> Number/Numeric data type.  Please correct if my understanding is wrong.

Your understanding is wrong.  Oracle and PostgreSQL are not too different about
storing values.  The data type that Oracle calls "number", and that is called
"numeric" in PostgreSQL and in the SQL standard, is stored a variable length 
data
type in both Oracle and PostgreSQL.

It is just that Oracle has very few data types (I saw them gush about adding
"boolean" as a great new feature in version 23).  So, for example, there are
no integer data types in Oracle, and you have to store them as a variable
length data type.  PostgreSQL has integer data types, which are fixed length
(2, 4 or 8 bytes) and provide much more efficient storage for integers.

"character varying" is also pretty similar in Oracle and PostgreSQL, except
that Oracle calls it "varchar2".

The only fixed-length character data type is "character", but that is always
a bad choice, in Oracle as in PostgreSQL.

About your initial question: in PostgreSQL there is also a certain performance
gain if you store frequently used columns first, since the database has to
skip fewer columns to get to the data.  If the previous columns are fixed
length data types like integers, that is cheaper, because we don't have to
look at the data to know how long they are.

Another thing to consider is padding.  Each fixed-width data type has certain
alignment requirements (imposed by CPU processing) that you can find in
"pg_type.typalign".  This can lead to wasted space in the form of
"padding bytes".  For example, if a "bigint" follows a ASCII single-character
"text" or "varchar" value (which occupies two bytes), there will be six
padding bytes between them to align the "bigint" at a storage address that is
a multiple of eight.

But although both of these considerations (skipping over previous columns and
padding) are relevant for performance, they are often a micro-optimization
that you won't be able to measure, and you shouldn't lose too much sleep
over them.

> 
> > So there's a bit of a tradeoff between minimizing alignment overhead and
> > arranging columns for fastest access.

Precisely.

Yours,
Laurenz Albe

Re: How to do faster DML

2024-02-10 Thread veem v

Thank You so much for the detailed explanation.

On Sun, 11 Feb 2024 at 05:55, Peter J. Holzer  wrote:

>
> Yes. Numbers in Oracle are variable length, so most Oracle tables
> wouldn't contain many fixed length columns. In PostgreSQL must numeric
> types are fixed length, so you'll have quite a lot of them.
>
>
So it means , say in other databases like (oracle database), we were
careless choosing the data length , say for example Varchar2(4000), if the
real data which is inserted into the table holds a varchar string of length
20 bytes then Oracle trimmed it to occupy the 20 bytes length only in the
storage. but in postgre here we need to be cautious and define the length
as what the data attribute can max contains , because that amount of fixed
space is allocated to every value which is inserted into the table for that
attribute/data element. Similarly for Number/Numeric data type.  Please
correct if my understanding is wrong.

Regards
Veem

Re: How to do faster DML

2024-02-10 Thread Peter J. Holzer

On 2024-02-06 11:25:05 +0530, veem v wrote:
> With respect to the storage part:- In Oracle database we were supposed to keep
> the frequently accessed column first and less frequently accessed columns
> towards last in the table while creating table DDL. This used to help the 
> query
> performance while accessing those columns as it's a row store database. Are
> there any such performance implications exists in postgres? And there the data
> types like varchar used to only occupy the space which the real data/value
> contains. 
> 
> But here in postgres, it seems a bit different as it has significant
> overhead in storage, as these seem like being fixed length data types and will
> occupy the space no matter what the read data holds.

Yes. Numbers in Oracle are variable length, so most Oracle tables
wouldn't contain many fixed length columns. In PostgreSQL must numeric
types are fixed length, so you'll have quite a lot of them.

> So curious to know, if in this case will there be any performance
> impact accessing those columns, if such a column which is frequently
> accessed but has been put towards the end of the table because of
> storage space saving?

Yes. PostgreSQL has to skip over each column until it gets to the column
it wants to read. So reading a column at the end of the row will be
slower than reading one at the start. A very simplistic benchmark (100
columns of type text each containing a single character) suggests that
accessing column 100 takes about 4 or 5 times as long as column 1, and
the access times for the coiumns between are pretty linear.

So there's a bit of a tradeoff between minimizing alignment overhead and
arranging columns for fastest access.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-08 Thread Greg Sabino Mullane

On Thu, Feb 8, 2024 at 12:12 AM Lok P  wrote:

> Rearranging the table columns by typlen desc, didn't give much storage
> space reduction.
>

It's not so much "biggest to smallest" as making sure you don't have any
gaps when you move from one data type to another. You may have not had any
"holes" originally, so it's hard to state anything without data. The other
thing you can do is use smaller data types if you can get away with it.
smallint better than int, int better than bigint, etc.

So it seems the compression does not apply for the rows inserted using
> "CTAS" or "INSERT AS SELECT". Does that mean it is only applicable for the
> row by row inserts but not batch inserts(which a bigger system normally
> will have)? I was not expecting this though, so it was disappointing.
>

TOAST compression doesn't care how the data arrived. It does have criteria
though as to how large something is before it gets toasted and/or
compressed. See:

https://www.crunchydata.com/blog/postgres-toast-the-greatest-thing-since-sliced-bread

Cheers,
Greg

Re: How to do faster DML

2024-02-07 Thread Lok P

> On Tue, Feb 6, 2024 at 8:34 AM Greg Sabino Mullane 
> wrote:
>
>> On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:
>>
>>> What I mean was, we will definitely need the data for querying purposes
>>> by the users, but just to keep the storage space incontrol (and to help the
>>> read queries), we were thinking of having the historical partitions
>>> compressed. And for that , if any compression strategy should we follow on
>>> postgres?
>>>
>>
>> There is no compression strategy, per se. There are ways Postgres
>> internally compresses the data (see "TOAST"), but in general, the table is
>> either there or not. If space is a concern you should examine if you truly
>> need 127 columns, make sure they are arranged correctly (search for
>> 'postgres column tetris'), and move unused and older partitions elsewhere -
>> like a separate cheaper Postgres server, or something offline.
>>
>> Cheers,
>> Greg
>>
>>
Rearranging the table columns by typlen desc, didn't give much storage
space reduction.

So, I was trying TOAST compression by creating the table from scratch and
using the LZ4 algorithm defined on the column definition level just for
varchar type columns , as it seems this compression only works for varchar
and text columns. And the table had 7 columns defined as varchar out of
total 12 columns. I write the DDL something as below

Column1 varchar(50) compression(lz4) not null

However , when i loaded the table using INSERT AS SELECT from the main
table(which is uncompressed one) , i see the size of the compressed table
remains same and also i applied the function "pg_column_compression()" to
see if any column value is compressed using lz4, it returns all "null",
which means not compressed.

So it seems the compression does not apply for the rows inserted using
"CTAS" or "INSERT AS SELECT". Does that mean it is only applicable for the
row by row inserts but not batch inserts(which a bigger system normally
will have)? I was not expecting this though, so it was disappointing.

Regards
Lok

Re: How to do faster DML

2024-02-05 Thread veem v

On Tue, 6 Feb 2024 at 10:45, Lok P  wrote:

> Thank you Greg.
>
> *"and keeping all your active stuff in cache. Since you have 127 columns,
> only pull back the columns you absolutely need for each query."*
>
> Understood the point. As postgres is a "row store" database, so keeping
> the size of the row lesser by making the number of columns to minimum,
> should definitely help as more number of rows can be packed then into one
> page or block (which I believe is 8KB default size per block). And that
> mean more number of blocks can be cached while reading, so better cache hit
> ratio.
>
> As you rightly pointed out, Another thing I noticed the shared_buffer
> parameters set as 2029684 in this instance, which comes to ~21MB and that
> seems very small for a database operating in large scale. And I see we have
> RAM in the instance showing as ~256GB. So thinking of bumping it to
> something as ~40-50GB. Hope that will help to some extent. Not sure if
> there is methods to manually,  cache some objects(tables/indexes) which
> were getting used frequently by the read queries.
>
>
> On Tue, Feb 6, 2024 at 10:27 AM Greg Sabino Mullane 
> wrote:
>
>>
>> So when you said *"In that case, and based on the numbers you provided,
>>> daily partitioning seems a decent solution."*
>>> , does it mean that size of the partition (which will be ~100-150GB per
>>> partition here) is not an issue for the postgres to serve read latency
>>> sensitive queries?
>>>
>>
>> Yes, that should be fine. At the end of the day, what's going to be more
>> important is making sure you are using good indexing, and keeping all your
>> active stuff in cache. Since you have 127 columns, only pull back the
>> columns you absolutely need for each query.
>>
>> Cheers,
>> Greg
>>
>>
>
Good discussion!! Don't want to divert the thread. And apology for
comparing with another RDBMS, but curious to know the difference in
behaviour.

With respect to the storage part:- In Oracle database we were supposed to
keep the frequently accessed column first and less frequently accessed
columns towards last in the table while creating table DDL. This used to
help the query performance while accessing those columns as it's a row
store database. Are there any such performance implications exists in
postgres? And there the data types like varchar used to only occupy the
space which the real data/value contains.

But here in postgres, it seems a bit different as it has significant
overhead in storage, as these seem like being fixed length data types and
will occupy the space no matter what the read data holds. So curious to
know, if in this case will there be any performance impact accessing those
columns, if such a column which is frequently accessed but has been put
towards the end of the table because of storage space saving?

Re: How to do faster DML

2024-02-05 Thread Lok P

Thank you Greg.

*"and keeping all your active stuff in cache. Since you have 127 columns,
only pull back the columns you absolutely need for each query."*

Understood the point. As postgres is a "row store" database, so keeping the
size of the row lesser by making the number of columns to minimum, should
definitely help as more number of rows can be packed then into one page or
block (which I believe is 8KB default size per block). And that mean more
number of blocks can be cached while reading, so better cache hit ratio.

As you rightly pointed out, Another thing I noticed the shared_buffer
parameters set as 2029684 in this instance, which comes to ~21MB and that
seems very small for a database operating in large scale. And I see we have
RAM in the instance showing as ~256GB. So thinking of bumping it to
something as ~40-50GB. Hope that will help to some extent. Not sure if
there is methods to manually,  cache some objects(tables/indexes) which
were getting used frequently by the read queries.

On Tue, Feb 6, 2024 at 10:27 AM Greg Sabino Mullane 
wrote:

>
> So when you said *"In that case, and based on the numbers you provided,
>> daily partitioning seems a decent solution."*
>> , does it mean that size of the partition (which will be ~100-150GB per
>> partition here) is not an issue for the postgres to serve read latency
>> sensitive queries?
>>
>
> Yes, that should be fine. At the end of the day, what's going to be more
> important is making sure you are using good indexing, and keeping all your
> active stuff in cache. Since you have 127 columns, only pull back the
> columns you absolutely need for each query.
>
> Cheers,
> Greg
>
>

Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane

> So when you said *"In that case, and based on the numbers you provided,
> daily partitioning seems a decent solution."*
> , does it mean that size of the partition (which will be ~100-150GB per
> partition here) is not an issue for the postgres to serve read latency
> sensitive queries?
>

Yes, that should be fine. At the end of the day, what's going to be more
important is making sure you are using good indexing, and keeping all your
active stuff in cache. Since you have 127 columns, only pull back the
columns you absolutely need for each query.

Cheers,
Greg

Re: How to do faster DML

2024-02-05 Thread Lok P

On Tue, Feb 6, 2024 at 8:34 AM Greg Sabino Mullane 
wrote:

> On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:
>
>> This table will always be queried on the transaction_date column as one
>> of the filters criteria. But the querying/search criteria can span from a
>> day to a month worth of transaction date data.
>>
>
> In that case, and based on the numbers you provided, daily partitioning
> seems a decent solution.
>
> What I mean was, we will definitely need the data for querying purposes by
>> the users, but just to keep the storage space incontrol (and to help the
>> read queries), we were thinking of having the historical partitions
>> compressed. And for that , if any compression strategy should we follow on
>> postgres?
>>
>
> There is no compression strategy, per se. There are ways Postgres
> internally compresses the data (see "TOAST"), but in general, the table is
> either there or not. If space is a concern you should examine if you truly
> need 127 columns, make sure they are arranged correctly (search for
> 'postgres column tetris'), and move unused and older partitions elsewhere -
> like a separate cheaper Postgres server, or something offline.
>
> With regards to loading data to the table faster, wondering why you said '
>> *NO*' to load the data first and enabling/Creating the Primary key and
>> Foreign key constraint later approach.
>>
>
> For an initial load, this is fine, if you do things very carefully. I'd
> build the child table indexes post load but still feed things into the main
> table as an initial tradeoff, but YMMV.
>
> Just looked back and saw this is actually Aurora, not Postgres. Most of
> the advice on this thread should still apply, but be aware that things are
> not the same as Postgres, especially at the storage layer. For all the
> money you are giving them, don't forget to bug them as well.
>
> Cheers,
> Greg
>
>
Thank you so much Greg. That helps.

So when you said *"In that case, and based on the numbers you provided,
daily partitioning seems a decent solution."*
, does it mean that size of the partition (which will be ~100-150GB per
partition here) is not an issue for the postgres to serve read latency
sensitive queries?

And yes I think you are spot on in regards to the storage. Actually we
loaded csv files which were having size summed to ~200GB and I was
wondering howcome the table size becomes ~1TB when that data is loaded to
the database. I was not aware about the "column tetris" and how postgres
padding additional spaces to the column data based on subsequent columns
type, still trying to digest the concept. Though we have many columns
with NULL values in them. Will try to reorder the column in the table and
hopefully , we will be able to really reduce the storage space through
that. Will raise a case too.

Regards
Lok

Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane

On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:

> This table will always be queried on the transaction_date column as one of
> the filters criteria. But the querying/search criteria can span from a day
> to a month worth of transaction date data.
>

In that case, and based on the numbers you provided, daily partitioning
seems a decent solution.

What I mean was, we will definitely need the data for querying purposes by
> the users, but just to keep the storage space incontrol (and to help the
> read queries), we were thinking of having the historical partitions
> compressed. And for that , if any compression strategy should we follow on
> postgres?
>

There is no compression strategy, per se. There are ways Postgres
internally compresses the data (see "TOAST"), but in general, the table is
either there or not. If space is a concern you should examine if you truly
need 127 columns, make sure they are arranged correctly (search for
'postgres column tetris'), and move unused and older partitions elsewhere -
like a separate cheaper Postgres server, or something offline.

With regards to loading data to the table faster, wondering why you said '
> *NO*' to load the data first and enabling/Creating the Primary key and
> Foreign key constraint later approach.
>

For an initial load, this is fine, if you do things very carefully. I'd
build the child table indexes post load but still feed things into the main
table as an initial tradeoff, but YMMV.

Just looked back and saw this is actually Aurora, not Postgres. Most of the
advice on this thread should still apply, but be aware that things are not
the same as Postgres, especially at the storage layer. For all the money
you are giving them, don't forget to bug them as well.

Cheers,
Greg

Re: How to do faster DML

2024-02-05 Thread veem v

I have not much experience with postgres on this scale though, others may
provide better suggestions. However, with this scale you will have a single
table size ~30TB+. And someone just pointed out in this thread,  ~180
partitions/table as the good to have limit,  and if that is true it would
be ~170GB per partition. Looks bulky for a system where readresponse time
is expected to be in milliseconds.

On Mon, 5 Feb 2024 at 16:51, Lok P  wrote:

>
>
> On Mon, 5 Feb, 2024, 2:21 am Lok P,  wrote:
>
>
> On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane 
> wrote:
>
> Partitioning will depend a lot on how you access the data. Is it always
> using that transaction_date column in the where clause? Can you share the
> table definition via \d? We've talked about this table quite a bit, but not
> actually seen what it looks like. We are flying blind a little bit. You
> mentioned your queries are slow, but can you share an explain analyze on
> one of these slow queries?
>
> 45 billion rows is quite manageable. How many rows are in each day? You
> may want to do a larger bucket than 24 hours per partition.
>
> 1)Even after partitioning the target table , to speed up the data load on
> this table , Is there an option to disable the primary and foreign keys and
> re-enable them post data load finishes. Will that be a good idea
>
>
> No.
>
> 3)As the size of the table or each partition is going to be very large and
> this will be a read intensive application, compressing the historical
> partition will help us save the storage space and will also help the read
> queries performance
>
>
> I am not sure what this means. If you are not going to need the data
> anymore, dump the data to deep storage and drop the partition.
>
> Cheers,
> Greg
>
>
>
> Thank you.
>
> The table has ~127 columns of different data types , combinations of
> Numeric, varchar, date etc. And is having current size ~1TB holding
> ~3billion rows currently and the row size is ~300bytes.
>
> Currently it has lesser volume , but in future the daily transaction per
> day which will be inserted into this table will be Max ~500million
> rows/day. And the plan is to persist at least ~3months of transaction data
> which will be around 45billion rows in total. And if all works well , we
> may need to persist ~6 months worth of data in this database in future and
> that will be ~90 billion.
>
> This table will always be queried on the transaction_date column as one of
> the filters criteria. But the querying/search criteria can span from a day
> to a month worth of transaction date data.
>
> When you said "*You may want to do a larger bucket than 24 hours per
> partition.*", do you mean to say partition by weekly or so? Currently as
> per math i.e. 1TB of storage for ~3billion rows. So the daily range
> partition size( to hold ~500million transactions/day) will be around
> ~100-150GB. Won't that be too much data for a single partition to operate
> on, and increasing the granularity further(say weekly) will make the
> partition more bulkier?
>
> What I mean was, we will definitely need the data for querying purposes by
> the users, but just to keep the storage space incontrol (and to help the
> read queries), we were thinking of having the historical partitions
> compressed. And for that , if any compression strategy should we follow on
> postgres?
>
> With regards to loading data to the table faster, wondering why you said '
> *NO*' to load the data first and enabling/Creating the Primary key and
> Foreign key constraint later approach. Because this table is a child table
> and the parent is already having data in it, loading data to this table in
> presence of PK and FK makes it too slow as it tries to validate those for
> each set of rows. So we were thinking if doing it at a later stage at
> oneshot will be a faster approach. Please suggest.
>
> I will try to collect some SELECT query and post the explain analyze.
> Currently we are trying to get rid of the duplicates.
>
> Regards
> Lok
>
>
> Any thoughts, based on above usage pattern?
>
> While I did the maths based on the current stats with limited data sets.
> The partitions size coming as 100 to 150gb as I explained above, if we keep
> it daily range partition. Should we have to make it hourly then?
>
>  So want some experts view if this going to work fine for a read latency
> intensive applications or we should have some different strategy?
>

Re: How to do faster DML

2024-02-05 Thread Lok P

On Mon, 5 Feb, 2024, 2:21 am Lok P,  wrote:


On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane 
wrote:

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea


No.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance


I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg



Thank you.

The table has ~127 columns of different data types , combinations of
Numeric, varchar, date etc. And is having current size ~1TB holding
~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per
day which will be inserted into this table will be Max ~500million
rows/day. And the plan is to persist at least ~3months of transaction data
which will be around 45billion rows in total. And if all works well , we
may need to persist ~6 months worth of data in this database in future and
that will be ~90 billion.

This table will always be queried on the transaction_date column as one of
the filters criteria. But the querying/search criteria can span from a day
to a month worth of transaction date data.

When you said "*You may want to do a larger bucket than 24 hours per
partition.*", do you mean to say partition by weekly or so? Currently as
per math i.e. 1TB of storage for ~3billion rows. So the daily range
partition size( to hold ~500million transactions/day) will be around
~100-150GB. Won't that be too much data for a single partition to operate
on, and increasing the granularity further(say weekly) will make the
partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by
the users, but just to keep the storage space incontrol (and to help the
read queries), we were thinking of having the historical partitions
compressed. And for that , if any compression strategy should we follow on
postgres?

With regards to loading data to the table faster, wondering why you said '
*NO*' to load the data first and enabling/Creating the Primary key and
Foreign key constraint later approach. Because this table is a child table
and the parent is already having data in it, loading data to this table in
presence of PK and FK makes it too slow as it tries to validate those for
each set of rows. So we were thinking if doing it at a later stage at
oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze.
Currently we are trying to get rid of the duplicates.

Regards
Lok


Any thoughts, based on above usage pattern?

While I did the maths based on the current stats with limited data sets.
The partitions size coming as 100 to 150gb as I explained above, if we keep
it daily range partition. Should we have to make it hourly then?

 So want some experts view if this going to work fine for a read latency
intensive applications or we should have some different strategy?

Re: How to do faster DML

2024-02-04 Thread Ron Johnson

On Sat, Feb 3, 2024 at 11:09 AM Lok P  wrote:
 [snip]

> show maintenance_work_mem; - 4155MB
> show work_mem; - 8MB
> show shared_buffers ; -22029684
>

Those are pretty small values.  What are your server specs?

Re: How to do faster DML

2024-02-04 Thread Lok P

On Sun, Feb 4, 2024 at 9:25 PM Ron Johnson  wrote:

> 1. Load the children before attaching them to the parent.
> 2. Create the child indices, PK and FKs before attaching to the parent.
> 3. Do step 2 in multiple parallel jobs.  (cron is your friend.)
> 4. Attach the children to the "naked" (no PK, no FK, no indices) parent.
> 5. Use the ONLY keyword on CREATE INDEX and ALTER TABLE to add the
> indices, PK and FK to the parent *after* step 4.
>
>>
>>
>>
>>
Thank You.

Can you please help me to understand these steps a bit  more accurately.

 Say we have a parent table already having data in it and also a primary
key defined.We will load the child table first , by dropping the Primary
key and Foreign keys, so that data load will be faster.

Then we will create the primary key index on the child table. When you said
using multiple parallel jobs, do you mean creating the PK index on each
partition separately from different sessions rather than creating on the
table using a single statement or some other faster way to create the PK
index?

Now the last step is attaching the PK and FK of the above child table to
the parent. This will validate each and every row for the uniqueness and
also to ensure the foreign key is present in the parent table. Won't this
take a longer time in this step?

Re: How to do faster DML

2024-02-04 Thread Lok P

On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane 
wrote:

> Partitioning will depend a lot on how you access the data. Is it always
> using that transaction_date column in the where clause? Can you share the
> table definition via \d? We've talked about this table quite a bit, but not
> actually seen what it looks like. We are flying blind a little bit. You
> mentioned your queries are slow, but can you share an explain analyze on
> one of these slow queries?
>
> 45 billion rows is quite manageable. How many rows are in each day? You
> may want to do a larger bucket than 24 hours per partition.
>
> 1)Even after partitioning the target table , to speed up the data load on
>> this table , Is there an option to disable the primary and foreign keys and
>> re-enable them post data load finishes. Will that be a good idea
>>
>
> No.
>
> 3)As the size of the table or each partition is going to be very large and
>> this will be a read intensive application, compressing the historical
>> partition will help us save the storage space and will also help the read
>> queries performance
>
>
> I am not sure what this means. If you are not going to need the data
> anymore, dump the data to deep storage and drop the partition.
>
> Cheers,
> Greg
>
>

Thank you.

The table has ~127 columns of different data types , combinations of
Numeric, varchar, date etc. And is having current size ~1TB holding
~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per
day which will be inserted into this table will be Max ~500million
rows/day. And the plan is to persist at least ~3months of transaction data
which will be around 45billion rows in total. And if all works well , we
may need to persist ~6 months worth of data in this database in future and
that will be ~90 billion.

This table will always be queried on the transaction_date column as one of
the filters criteria. But the querying/search criteria can span from a day
to a month worth of transaction date data.

When you said "*You may want to do a larger bucket than 24 hours per
partition.*", do you mean to say partition by weekly or so? Currently as
per math i.e. 1TB of storage for ~3billion rows. So the daily range
partition size( to hold ~500million transactions/day) will be around
~100-150GB. Won't that be too much data for a single partition to operate
on, and increasing the granularity further(say weekly) will make the
partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by
the users, but just to keep the storage space incontrol (and to help the
read queries), we were thinking of having the historical partitions
compressed. And for that , if any compression strategy should we follow on
postgres?

With regards to loading data to the table faster, wondering why you said '
*NO*' to load the data first and enabling/Creating the Primary key and
Foreign key constraint later approach. Because this table is a child table
and the parent is already having data in it, loading data to this table in
presence of PK and FK makes it too slow as it tries to validate those for
each set of rows. So we were thinking if doing it at a later stage at
oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze.
Currently we are trying to get rid of the duplicates.

Regards
Lok

>

Re: How to do faster DML

2024-02-04 Thread Ron Johnson

On Sun, Feb 4, 2024 at 10:30 AM Lok P  wrote:

> On Sun, Feb 4, 2024 at 8:14 PM Dennis White 
> wrote:
>
>> I'm surprised no one has mentioned perhaps it's a good idea to partition
>> this table while adding the pk. By your own statements the table is
>> difficult to work with as is. Without partitioning the table, row inserts
>> would need to walk the pk index and could be a factor. If this is static
>> table then perhaps that's ok but if not...
>>
>> Anyway I don't recall what type the ID was or how it's set but i suggest
>> you seriously investigate using it to partition this table into manageable
>> smaller tables.
>> Best of luck.
>>
>> On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro  wrote:
>>
>>> insert into mytable2 select * from mytable1 order by ctid limit
 10_000_000 offset 20_000_000;

>>>
>>> You can use min/max values grouping them by 10_000_000 records, so you
>>> don´t need that offset, then generate commands and run them.
>>>
>>> select format('insert into mytable2 select * from mytable1 where i
>>> between %s and %s;', max(i), min(i)) from t group by i/10_000_000;
>>>
>>
> Thank you so much.
>
> You are correct. It was seemingly difficult to operate on this table.
> Every read query is struggling and so partitioning is something we must
> have to think of. And hoping that, postgres will be able to handle this
> scale, with proper partitioning and indexing strategy.
>
> I have a few related questions.
>
> 1)Even after partitioning the target table , to speed up the data load on
> this table , Is there an option to disable the primary and foreign keys and
> re-enable them post data load finishes. Will that be a good idea or will it
> be cumbersome/resource intensive to re-enable the constraints , after
> persisting all the data in the table?
>

1. Load the children before attaching them to the parent.
2. Create the child indices, PK and FKs before attaching to the parent.
3. Do step 2 in multiple parallel jobs.  (cron is your friend.)
4. Attach the children to the "naked" (no PK, no FK, no indices) parent.
5. Use the ONLY keyword on CREATE INDEX and ALTER TABLE to add the indices,
PK and FK to the parent *after* step 4.


2)I understand there is no limitation theoretically on the number or size
> of partitions a table can have in postgres. But I want to know from experts
> here, from their real life experience, if there exists any such thing which
> we should take care of before deciding the partitioning strategy, so as to
> have the soft limit (both with respect to size and number of partitions)
> obeyed.
> Note:- Currently this table will be around ~1TB in size and will hold
> Approx ~3billion rows(post removal of duplicates). But as per business need
> it may grow up to ~45 billion rows in future.
>

I'd keep it under 200 partitions, but my experience might be outdated.


> 3)As the size of the table or each partition is going to be very large and
> this will be a read intensive application,
>

By PK?  If so, partition by PK.


> compressing the historical partition will help us save the storage space
> and will also help the read queries performance. So, Can you please throw
> some light on the compression strategy which we should follow here
> (considering a daily range partition table based on transaction_date as
> partition key)?
>
> Regards
> Lok
>
>
>
>

Re: How to do faster DML

2024-02-04 Thread Greg Sabino Mullane

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on
> this table , Is there an option to disable the primary and foreign keys and
> re-enable them post data load finishes. Will that be a good idea
>

No.

3)As the size of the table or each partition is going to be very large and
> this will be a read intensive application, compressing the historical
> partition will help us save the storage space and will also help the read
> queries performance


I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg


>

Re: How to do faster DML

2024-02-04 Thread Lok P

On Sun, Feb 4, 2024 at 8:14 PM Dennis White  wrote:

> I'm surprised no one has mentioned perhaps it's a good idea to partition
> this table while adding the pk. By your own statements the table is
> difficult to work with as is. Without partitioning the table, row inserts
> would need to walk the pk index and could be a factor. If this is static
> table then perhaps that's ok but if not...
>
> Anyway I don't recall what type the ID was or how it's set but i suggest
> you seriously investigate using it to partition this table into manageable
> smaller tables.
> Best of luck.
>
> On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro  wrote:
>
>> insert into mytable2 select * from mytable1 order by ctid limit
>>> 10_000_000 offset 20_000_000;
>>>
>>
>> You can use min/max values grouping them by 10_000_000 records, so you
>> don´t need that offset, then generate commands and run them.
>>
>> select format('insert into mytable2 select * from mytable1 where i
>> between %s and %s;', max(i), min(i)) from t group by i/10_000_000;
>>
>
Thank you so much.

You are correct. It was seemingly difficult to operate on this table. Every
read query is struggling and so partitioning is something we must have to
think of. And hoping that, postgres will be able to handle this scale, with
proper partitioning and indexing strategy.

I have a few related questions.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea or will it
be cumbersome/resource intensive to re-enable the constraints , after
persisting all the data in the table?

2)I understand there is no limitation theoretically on the number or size
of partitions a table can have in postgres. But I want to know from experts
here, from their real life experience, if there exists any such thing which
we should take care of before deciding the partitioning strategy, so as to
have the soft limit (both with respect to size and number of partitions)
obeyed.
Note:- Currently this table will be around ~1TB in size and will hold
Approx ~3billion rows(post removal of duplicates). But as per business need
it may grow up to ~45 billion rows in future.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance. So, Can you please throw some light on the compression
strategy which we should follow here (considering a daily range partition
table based on transaction_date as partition key)?

Regards
Lok

Re: How to do faster DML

2024-02-04 Thread Dennis White

I'm surprised no one has mentioned perhaps it's a good idea to partition
this table while adding the pk. By your own statements the table is
difficult to work with as is. Without partitioning the table, row inserts
would need to walk the pk index and could be a factor. If this is static
table then perhaps that's ok but if not...

Anyway I don't recall what type the ID was or how it's set but i suggest
you seriously investigate using it to partition this table into manageable
smaller tables.
Best of luck.

On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro  wrote:

> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
>> offset 20_000_000;
>>
>
> You can use min/max values grouping them by 10_000_000 records, so you
> don´t need that offset, then generate commands and run them.
>
> select format('insert into mytable2 select * from mytable1 where i between
> %s and %s;', max(i), min(i)) from t group by i/10_000_000;
>

Re: How to do faster DML

2024-02-04 Thread Marcos Pegoraro

>
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
> offset 20_000_000;
>

You can use min/max values grouping them by 10_000_000 records, so you
don´t need that offset, then generate commands and run them.

select format('insert into mytable2 select * from mytable1 where i between
%s and %s;', max(i), min(i)) from t group by i/10_000_000;

Re: How to do faster DML

2024-02-04 Thread Francisco Olarte

Lok:

On Sat, 3 Feb 2024 at 21:44, Lok P  wrote:

> On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte  
> wrote:

> However , as we have ~5billion rows in the base table and out of that , we 
> were expecting almost half i.e. ~2billion would be duplicates. And you said, 
> doing the inserts using the "order by CTID Offset" approach must cause one 
> full sequential scan of the whole table for loading each chunk/10M of rows 
> and that would take a long time I believe.

I did not say "MUST CAUSE". In fact I said I believe it would not.

What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables
in a scratch database, explain and test your things there w/ & w/o
index etc.. Not all needed, but testing 100k & 1M in 1k batches could
show you missing quadratic behaviour. Explain would show you
unexpected sorts or scans.

> I am still trying to understand the other approach which you suggested. Not 
> able to understand "you can select where index_col > last order by index_col 
> limit 10M," .
> However, to get the max ID value of the last 10M loaded rows in target, do 
> you say that having an PK index created on that target table column(ID) will 
> help, and we can save the max (ID) value subsequently in another table to 
> fetch and keep loading from the source table (as ID>Max_ID stored in temp 
> table)?

I am a programmer by trade. When faced with problems like these,
unless I find a trivial solution, I tend to make auxiliary programs as
it is much easier for me to make a thousand lines of Perl ( or
python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to
avoid roundtrips ) than trying to debug complicated SQL only
workflows.

For your kind of problem I would make a program to read the rows and
insert them. As an example, lets say you have a non-unique index on ID
and are going to use the on-conflict-do-nothing route ( I do believe
this would be slow due to the need of having an index on the target
table to support it, but let's assume it is ok ). To do that I may
just do a loop, starting with last_id=-1(some id less than any other
id), selecting a chunk of rows with id>=last_id ordered by id and
inserting them. After doing that I may notice that I do not need the
index if the sort order is right, drop the index and the on-conflict
and just do, for every row, if(id>last_id) insert before storing
last_id=id.

Anyway, not knowing the real table characteristics and current usage
patterns I cannot recomend anything concrete.

> Would it be better to do it in one shot only , but by setting a higher value 
> of some parameters like "maintenance_work_mem" or "max_parallel_workers"?

It depends on a lot of unknown ( to us ) things.

Francisco Olarte.

Re: How to do faster DML

2024-02-04 Thread Alban Hertroys

> On 3 Feb 2024, at 13:20, Lok P  wrote:
> 
> Hello All,
> A non partitioned table having ~4.8 billion rows in it and having data size 
> as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got 
> approx ~1billion+ duplicate rows inserted in it and we want to get the 
> duplicate data removed for this table and create a PK/unique constraint back 
> so as to not have the duplicate values in future. We are struggling to do the 
> same.

~4.8 billion rows of which ~1 billion are duplicates… Wait a minute…

Did you verify that your ID column is larger than 32-bits?
Because if that’s a 32 bit integer, the range of values it can hold is about 
4.3 billion, after which it wraps around.

With ~4.8 billion rows that would result in about ~0.5 billion repeated ID 
values, giving you the reported ~1 billion duplicate ID's.

If that’s the case, your duplicates obviously aren’t really duplicates and you 
require a different type of solution.

> Teammates suggested doing this using CTAS method, i.e. create a new table 
> with the unique record set and then drop the main table. Something as below
> 
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where ID in
>   (select min(ID) from TAB1
>   group by ID having count(ID)>=1 );
> 
> But for the above to work faster , they mentioned to have an index created on 
> the column using which the duplicate check will be performed i.e ID column. 
> So, creating the index itself took ~2hrs+ and the index size now shows as 
> ~116GB.
> 
> Create index idx1 on TAB1(ID)

Are your duplicates exact duplicates? Or is there an order of preference among 
them?
And if so, what really makes those rows unique?

That matters for solutions on how to deduplicate these rows.

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: How to do faster DML

2024-02-04 Thread Peter J. Holzer

On 2024-02-04 02:14:20 +0530, Lok P wrote:
> However , as we have ~5billion rows in the base table and out of that , we 
> were
> expecting almost half i.e. ~2billion would be duplicates.

That's way more than I expected from your original description. And it
of course raises the question whether it's a good idea to just throw
away all that data or if you need to keep that in a normalized way.

> And you said, doing the inserts using the "order by CTID Offset"
> approach must cause one full sequential scan of the whole table for
> loading each chunk/10M of rows and that would take a long time I
> believe.
> 
> I am still trying to understand the other approach which you suggested. Not
> able to understand "you can select where index_col > last order by index_col 
> limit 10M," .
> However, to get the max ID value of the last 10M loaded rows in target, do you
> say that having an PK index created on that target table column(ID) will
> help,

Yes. Getting the maximum value from an index is a very fast operation.
You just have to traverse down the right edge of the tree (or you may
even be able to access the right-most leaf page directly).

>  and we can save the max (ID) value subsequently in another table to fetch
> and keep loading from the source table (as ID>Max_ID stored in temp table)?

Another table or a variable in a script (personally, if I need to do
something repeatedly, I usually write a script in the scripting language
I feel most comfortable in (which has been Python for the last 7 or 8
years, Perl before that) which gives you variables, loops, conditionals
and - above all - repeatability.

> OR
> Would it be better to do it in one shot only , but by setting a higher value 
> of
> some parameters like "maintenance_work_mem" or "max_parallel_workers"?

Hard to say. Normally, processing in fewer. bigger chunks is faster. But
RAM is much faster than disk (even with SSDs), so it might be faster to
make work_mem as large as you can and then use a chunk size which just
fits inside work_mem is faster. Of course finding that sweet spot takes
experimentation, hence time, and it may make little sense to experiment
for 20 hours just to save 40 minutes.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-03 Thread Lok P

On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte 
wrote:

> On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane 
> wrote:
> ...
> > Given the size of your table, you probably want to divide that up.
> > As long as nothing is changing the original table, you could do:
> >
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 0;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 10_000_000;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 20_000_000;
>
> Is it me or does this have the faint smell of quadratic behaviour? I
> mean, you need to read and discard the first 10M to do offset 10M (
> also I believe ctid order is implied in sequential scan, but no so
> sure, if not it would need a full table sort on each pass ).
>
> When doing things like this, I normally have some kind of unique index
> and do it by selecting with limit above the last read value( stored
> when doing it via code, as I normally load first, index later so I
> cannot select max() fast on the target ). Or, with some kind of
> "sparse" index (meaning, groups much less than the batch size ) and a
> little code you can select where index_col > last order by index_col
> limit 10M, remember last received index_col and reselect discarding
> missing ( or just reselect it when doing your on conflict do nothing,
> which also has a fast select max(id) due to the PK, it will work if it
> has an index on id column on the original even if not unique ) to
> avoid that.
>
> Also, I'm not sure if ctid is ordered and you can select where
> ctid>last ordered, if that works it probably is faster for immutable
> origins.
>
> Francisco Olarte.

Thank you.

Yes  , I think the "on conflict do nothing;" option looks promising as it
will remove the duplicate in case of PK violation but keepte load continue
for subsequent rows.

However , as we have ~5billion rows in the base table and out of that , we
were expecting almost half i.e. ~2billion would be duplicates. And you
said, doing the inserts using the "order by CTID Offset" approach must
cause one full sequential scan of the whole table for loading each
chunk/10M of rows and that would take a long time I believe.

I am still trying to understand the other approach which you suggested. Not
able to understand "y*ou can select where index_col > last order by
index_col **limit 10M," .*
However, to get the max ID value of the last 10M loaded rows in target, do
you say that having an PK index created on that target table column(ID)
will help, and we can save the max (ID) value subsequently in another table
to fetch and keep loading from the source table (as ID>Max_ID stored in
temp table)?
OR
Would it be better to do it in one shot only , but by setting a higher
value of some parameters like "maintenance_work_mem" or
"max_parallel_workers"?

Re: How to do faster DML

2024-02-03 Thread Peter J. Holzer

On 2024-02-03 19:25:12 +0530, Lok P wrote:
> Apology. One correction, the query is like below. I. E filter will be on on
> ctid which I believe is equivalent of rowid in oracle and we will not need the
> index on Id column then. 
> 
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster? 
> 
> Also wondering , the index creation which took ~2.5hrs+ , would that have been
> made faster any possible way by allowing more db resource through some session
> level db parameter setting?
> 
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>       (select min(CTID) from TAB1
>       group by ID having count(ID)>=1 );

That »having count(ID)>=1« seems redundant to me. Surely every id which
occurs in the table occurs at least once?

Since you want ID to be unique I assume that it is already almost
unique - so only a small fraction of the ids will be duplicates. So I
would start with creating a list of duplicates:

create table tab1_dups as
select id, count(*) from tab1 group by id having count(*) > 1;

This will still take some time because it needs to build a temporary
structure large enough to hold a count for each individual id. But at
least then you'll have a much smaller table to use for further cleanup.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"

signature.asc
Description: PGP signature

Re: How to do faster DML

2024-02-03 Thread Francisco Olarte

On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane  wrote:
...
> Given the size of your table, you probably want to divide that up.
> As long as nothing is changing the original table, you could do:
>
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 0;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 10_000_000;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 20_000_000;

Is it me or does this have the faint smell of quadratic behaviour? I
mean, you need to read and discard the first 10M to do offset 10M (
also I believe ctid order is implied in sequential scan, but no so
sure, if not it would need a full table sort on each pass ).

When doing things like this, I normally have some kind of unique index
and do it by selecting with limit above the last read value( stored
when doing it via code, as I normally load first, index later so I
cannot select max() fast on the target ). Or, with some kind of
"sparse" index (meaning, groups much less than the batch size ) and a
little code you can select where index_col > last order by index_col
limit 10M, remember last received index_col and reselect discarding
missing ( or just reselect it when doing your on conflict do nothing,
which also has a fast select max(id) due to the PK, it will work if it
has an index on id column on the original even if not unique ) to
avoid that.

Also, I'm not sure if ctid is ordered and you can select where
ctid>last ordered, if that works it probably is faster for immutable
origins.

Francisco Olarte.

Re: How to do faster DML

2024-02-03 Thread Greg Sabino Mullane

As a general rule, avoid heavy subselects like that. You don't need to
build a full list of duplicates before starting. Another approach:

create table mytable2 (like mytable1);

alter table mytable2 add primary key (id);

insert into mytable2 select * from mytable1 on conflict do nothing;

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 0;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 10_000_000;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 20_000_000;
etc.

Cheers,
Greg

Re: How to do faster DML

2024-02-03 Thread Lok P

Ron Johnson 
7:37 PM (1 hour ago)
to *pgsql-general*
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson  wrote:

> On Sat, Feb 3, 2024 at 8:55 AM Lok P  wrote:
>
>> Apology. One correction, the query is like below. I. E filter will be on
>> on ctid which I believe is equivalent of rowid in oracle and we will not
>> need the index on Id column then.
>>
>>  But, it still runs long, so thinking any other way to make the duplicate
>> removal faster?
>>
>> Also wondering , the index creation which took ~2.5hrs+ , would that have
>> been made faster any possible way by allowing more db resource through some
>> session level db parameter setting?
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where CTID in
>>   (select min(CTID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>>
>> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>>
>>> Hello All,
>>> A non partitioned table having ~4.8 billion rows in it and having data
>>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>>> duplicate data removed for this table and create a PK/unique constraint
>>> back so as to not have the duplicate values in future. We are struggling to
>>> do the same.
>>>
>>> Teammates suggested doing this using CTAS method, i.e. create a new
>>> table with the unique record set and then drop the main table. Something as
>>> below
>>>
>>> create table TAB1_New
>>> as
>>> SELECT  * from TAB1 A
>>> where ID in
>>>   (select min(ID) from TAB1
>>>   group by ID having count(ID)>=1 );
>>>
>>> But for the above to work faster , they mentioned to have an index
>>> created on the column using which the duplicate check will be performed i.e
>>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>>> shows as ~116GB.
>>>
>>>
>>> *Create index idx1 on TAB1(ID)*
>>> And now running the SELECT subquery part of the CTAS statement to see if
>>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>>> killed it. Below is the plan for the same.
>>>
>>> explain
>>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>>
>>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>>   Group Key: ID
>>>   Filter: (count(ID) >= 1)
>>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>>> rows=4883397120 width=14)
>>>
>>> I want to understand if by any way this can be done faster . Also I am
>>> worried that creating PK constraint/index back after deleting the duplicate
>>> is also going to run forever. Is there any way we can make these heavy
>>> operations faster on postgre by facilitating more database resources
>>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>>> extension added, but not seeing the parallel hint enforced when adding it
>>> to the query.
>>>
>>> In Oracle we have Parallel hints, Direct path read/write for faster
>>> read/write operations, parallel index scan etc. available, if anything
>>> similar to that available in aurora postgre to facilitate more
>>> horsepower and speed up the batch operations. And , how can we monitor
>>> progress of any running query ?
>>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>>> 256GB RAM. PG version 15.4.
>>>
>>
>  Aurora is not Postgresql, so configurations might not work.  Having said
> that...
> https://www.postgresql.org/docs/15t/how-parallel-query-works.html
>
> And have you analyzed the table lately?  Also, what's your work_mem
> and maintenance_work_mem?
>

 Thank you .

Below are the values of the default parameters in this instance

SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very
clear idea of how these are related to each other and help each of the
operations,  but I set a few of those as below before triggering those in
the same session.

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these
values when running heavy SELECT queries doing a large sequential scan  OR
large index creation process OR any Select query with heavy sorting/"order
by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum
and auto_analyze column getting populated in the pg_stat_user_tables , I
thought it must be doing that automatically.

By the way if we run "analyze tab1' on this 1.5TB table , will that run
longer and will any of the above parameters help to expedite that ANALYZE
operation too, if I run the ANALYZE manually?

Regards
Lok

>

Re: How to do faster DML

2024-02-03 Thread Ron Johnson

On Sat, Feb 3, 2024 at 8:55 AM Lok P  wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>   (select min(CTID) from TAB1
>   group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where ID in
>>   (select min(ID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>
>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>   Group Key: ID
>>   Filter: (count(ID) >= 1)
>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>
 Aurora is not Postgresql, so configurations might not work.  Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately?  Also, what's your work_mem
and maintenance_work_mem?

Re: How to do faster DML

2024-02-03 Thread Todd Lewis

I copy/pasted your question into ChatGPT, and it gave me 10 specific
suggestions.  Have you tried those?

On Sat, Feb 3, 2024 at 10:55 AM Lok P  wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>   (select min(CTID) from TAB1
>   group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where ID in
>>   (select min(ID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>
>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>   Group Key: ID
>>   Filter: (count(ID) >= 1)
>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>> Regards
>> Lok
>>
>

-- 
Todd Lewis
tle...@brickabode.com

Re: How to do faster DML

2024-02-03 Thread Lok P

Apology. One correction, the query is like below. I. E filter will be on on
ctid which I believe is equivalent of rowid in oracle and we will not need
the index on Id column then.

 But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT  * from TAB1 A
where CTID in
  (select min(CTID) from TAB1
  group by ID having count(ID)>=1 );


On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:

> Hello All,
> A non partitioned table having ~4.8 billion rows in it and having data
> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
> got approx ~1billion+ duplicate rows inserted in it and we want to get the
> duplicate data removed for this table and create a PK/unique constraint
> back so as to not have the duplicate values in future. We are struggling to
> do the same.
>
> Teammates suggested doing this using CTAS method, i.e. create a new table
> with the unique record set and then drop the main table. Something as below
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where ID in
>   (select min(ID) from TAB1
>   group by ID having count(ID)>=1 );
>
> But for the above to work faster , they mentioned to have an index created
> on the column using which the duplicate check will be performed i.e ID
> column. So, creating the index itself took ~2hrs+ and the index size now
> shows as ~116GB.
>
>
> *Create index idx1 on TAB1(ID)*
> And now running the SELECT subquery part of the CTAS statement to see if
> its giving the correct unique records count. It ran for 2.5 hrs and then we
> killed it. Below is the plan for the same.
>
> explain
> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>
> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>   Group Key: ID
>   Filter: (count(ID) >= 1)
>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
> rows=4883397120 width=14)
>
> I want to understand if by any way this can be done faster . Also I am
> worried that creating PK constraint/index back after deleting the duplicate
> is also going to run forever. Is there any way we can make these heavy
> operations faster on postgre by facilitating more database resources
> through some parameter setup, like parallel hint etc? We have pg_hint_plan
> extension added, but not seeing the parallel hint enforced when adding it
> to the query.
>
> In Oracle we have Parallel hints, Direct path read/write for faster
> read/write operations, parallel index scan etc. available, if anything
> similar to that available in aurora postgre to facilitate more
> horsepower and speed up the batch operations. And , how can we monitor
> progress of any running query ?
>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
> 256GB RAM. PG version 15.4.
>
> Regards
> Lok
>

66 matches

Mail list logo