Re: [GENERAL] Multiple indexes, huge table

2012-09-07 Thread Merlin Moncure
On Thu, Sep 6, 2012 at 4:22 PM, Aram Fingal fin...@multifactorial.com wrote:
 I have a table which currently has about 500 million rows.  For the most 
 part, the situation is going to be that I will import a few hundred million 
 more rows from text files once every few months but otherwise there won't be 
 any insert, update or delete queries.  I have created five indexes, some of 
 them multi-column, which make a tremendous difference in performance for the 
 statistical queries which I need to run frequently (seconds versus hours.)  
 When adding data to the table, however, I have found that it is much faster 
 to drop all the indexes, copy the data to the table and then create the 
 indexes again (hours versus days.)  So, my question is whether this is really 
 the best way.  Should I write a script which drops all the indexes, copies 
 the data and then recreates the indexes or is there a better way to do this?

 There are also rare cases where I might want to make a correction.  For 
 example, one of the columns is sample name which is a foreign key to a 
 samples table defined with  ON UPDATE CASCADE.  I decided to change a 
 sample name in the samples table which should affect about 20 million rows 
 out of the previously mentioned 500 million.  That query has now been running 
 for five days and isn't finished yet.

Your case might do well with partitioning, particularly if you are
time bottlenecked during the import.  It will require some careful
though before implementing, but the general schema is to insert the
new data into a child table that gets its own index: this prevents you
from having to reindex the whole table.  Partitioning makes other
things more complicated though (like RI).

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-07 Thread Marti Raudsepp
On Fri, Sep 7, 2012 at 12:22 AM, Aram Fingal fin...@multifactorial.com wrote:
 Should I write a script which drops all the indexes, copies the data and then 
 recreates the indexes or is there a better way to do this?

There's a pg_bulkload extension which does much faster incremental
index updates for large bulk data imports, so you get best of both
worlds: http://pgbulkload.projects.postgresql.org/

Beware though, that this is an external addon and is not as well
tested as core PostgreSQL. I have not tried it myself.

Regards,
Marti


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-07 Thread Jeff Janes
On Thu, Sep 6, 2012 at 5:12 PM, Alan Hodgson ahodg...@simkin.ca wrote:
 On Thursday, September 06, 2012 05:06:27 PM Jeff Janes wrote:
 For updating 20 million out of 500 million rows, wouldn't a full table
 scan generally be preferable to an index scan anyway?


 Not one table scan for each row updated ...

My understanding is that he was only trying to update one row on the
parent table anyway, which then cascaded to 20,000,000 rows on the
child/fact table.

If you mean one table scan for each of the 20,000,000 rows *of the
child* being updated, that isn't what it does now, index or not.

Even if he were updating 10 rows of the parent table, I think it would
still be the case that if one sequential scan of the child/fact was
faster than one (large, low-cardinality, unclustered) index scan, then
10 sequential scans would be faster than 10 index scans.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-07 Thread Aram Fingal

On Sep 7, 2012, at 11:15 AM, Marti Raudsepp wrote:

 There's a pg_bulkload extension which does much faster incremental
 index updates for large bulk data imports, so you get best of both
 worlds: http://pgbulkload.projects.postgresql.org/


Thanks, I'll have to check that out.  This is going to be more and more of an 
issue as I work with some specialized techniques related to DNA sequencing.  

-Aram

[GENERAL] Multiple indexes, huge table

2012-09-06 Thread Aram Fingal
I have a table which currently has about 500 million rows.  For the most part, 
the situation is going to be that I will import a few hundred million more rows 
from text files once every few months but otherwise there won't be any insert, 
update or delete queries.  I have created five indexes, some of them 
multi-column, which make a tremendous difference in performance for the 
statistical queries which I need to run frequently (seconds versus hours.)  
When adding data to the table, however, I have found that it is much faster to 
drop all the indexes, copy the data to the table and then create the indexes 
again (hours versus days.)  So, my question is whether this is really the best 
way.  Should I write a script which drops all the indexes, copies the data and 
then recreates the indexes or is there a better way to do this?  

There are also rare cases where I might want to make a correction.  For 
example, one of the columns is sample name which is a foreign key to a samples 
table defined with  ON UPDATE CASCADE.  I decided to change a sample name in 
the samples table which should affect about 20 million rows out of the 
previously mentioned 500 million.  That query has now been running for five 
days and isn't finished yet.  

-Aram

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-06 Thread Tom Lane
Aram Fingal fin...@multifactorial.com writes:
 I have a table which currently has about 500 million rows.  For the most 
 part, the situation is going to be that I will import a few hundred million 
 more rows from text files once every few months but otherwise there won't be 
 any insert, update or delete queries.  I have created five indexes, some of 
 them multi-column, which make a tremendous difference in performance for the 
 statistical queries which I need to run frequently (seconds versus hours.)  
 When adding data to the table, however, I have found that it is much faster 
 to drop all the indexes, copy the data to the table and then create the 
 indexes again (hours versus days.)  So, my question is whether this is really 
 the best way.  Should I write a script which drops all the indexes, copies 
 the data and then recreates the indexes or is there a better way to do this?  

Yes, that's actually recommended practice for such cases.

 There are also rare cases where I might want to make a correction.  For 
 example, one of the columns is sample name which is a foreign key to a 
 samples table defined with  ON UPDATE CASCADE.  I decided to change a 
 sample name in the samples table which should affect about 20 million rows 
 out of the previously mentioned 500 million.  That query has now been running 
 for five days and isn't finished yet.  

That sounds like you lack an index on the referencing column of the
foreign key constraint.  Postgres doesn't require you to keep such
an index, but it's a really good idea if you ever update the referenced
column.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-06 Thread Aram Fingal

On Sep 6, 2012, at 5:54 PM, Tom Lane wrote:

 That sounds like you lack an index on the referencing column of the
 foreign key constraint.  Postgres doesn't require you to keep such
 an index, but it's a really good idea if you ever update the referenced
 column.


Thanks.  You're right.  That column (which is a foreign key)  is a component of 
a multi-column index but I don't have an index just for it.

-Aram

Re: [GENERAL] Multiple indexes, huge table

2012-09-06 Thread Jeff Janes

 There are also rare cases where I might want to make a correction.  For 
 example, one of the columns is sample name which is a foreign key to a 
 samples table defined with  ON UPDATE CASCADE.  I decided to change a 
 sample name in the samples table which should affect about 20 million rows 
 out of the previously mentioned 500 million.  That query has now been 
 running for five days and isn't finished yet.

 That sounds like you lack an index on the referencing column of the
 foreign key constraint.  Postgres doesn't require you to keep such
 an index, but it's a really good idea if you ever update the referenced
 column.

For updating 20 million out of 500 million rows, wouldn't a full table
scan generally be preferable to an index scan anyway?

But, if he doesn't drop those other indexes during this process, the
maintenance on them is going to kill his performance anyway, just like
it does for bulk loading.  If you figure 20,000,000 * (1 table + 5
index) / 15,000 rpm, it comes out to around 5 days.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-06 Thread Alan Hodgson
On Thursday, September 06, 2012 05:06:27 PM Jeff Janes wrote:
 For updating 20 million out of 500 million rows, wouldn't a full table
 scan generally be preferable to an index scan anyway?
 

Not one table scan for each row updated ...



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple indexes, huge table

2012-09-06 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 That sounds like you lack an index on the referencing column of the
 foreign key constraint.  Postgres doesn't require you to keep such
 an index, but it's a really good idea if you ever update the referenced
 column.

 For updating 20 million out of 500 million rows, wouldn't a full table
 scan generally be preferable to an index scan anyway?

Foreign key triggers do their checks retail, though, so you really want
the probe for any referencing rows for a particular row-being-updated
to be able to use an index.

(It would be nice if we could replace that with a mass revalidation
once it got to be a big fraction of the table, but we don't have a
mechanism for that.  Yet.)

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general