Re: [PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-13 Thread Dan Langille

> On Feb 11, 2016, at 4:41 PM, Dan Langille  wrote:
> 
>> On Feb 10, 2016, at 5:13 AM, Dan Langille  wrote:
>> 
>>> On Feb 10, 2016, at 2:47 AM, Jeff Janes  wrote:
>>> 
>>> On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille  wrote:
 I have a wee database server which regularly tries to insert 1.5 million or
 even 15 million new rows into a 400 million row table.  Sometimes these
 inserts take hours.
 
 The actual query to produces the join is fast.  It's the insert which is
 slow.
 
 INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
 DeltaSeq)
 SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
 Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
 batch_testing.DeltaSeq
  FROM batch_testing JOIN Path ON (batch_testing.Path = Path.Path)
 JOIN Filename ON (batch_testing.Name =
 Filename.Name);
 
 This is part of the plan: http://img.ly/images/9374145/full  created via
 http://tatiyants.com/pev/#/plans
 
 This gist contains postgresql.conf, zfs settings, slog, disk partitions.
 
 https://gist.github.com/dlangille/1a8c8cc62fa13b9f
>>> 
>>> The table you are inserting into has 7 indexes, all of which have to
>>> be maintained.  The index on the sequence column should be efficient
>>> to maintain.  But for the rest, if the inserted rows are not naturally
>>> ordered by any of the indexed columns then it would end up reading 6
>>> random scattered leaf pages in order to insert row pointers.  If none
>>> those pages are in memory, that is going to be slow to read off from
>>> hdd in single-file.  Also, you are going dirty all of those scattered
>>> pages, and they will be slow to write back to hdd because there
>>> probably won't be much opportunity for write-combining.
>>> 
>>> Do you really need all of those indexes?
>>> 
>>> Won't the index on (jobid, pathid, filenameid) service any query that
>>> (jobid) does, so you can get rid of the latter?
>>> 
>>> And unless you have range queries on fileindex, like "where jobid = 12
>>> and fileindex between 4 and 24" then you should be able to replace
>>> (jobid, fileindex) with (fileindex,jobid) and then get rid of the
>>> stand-alone index on (fileindex).
>>> 
>>> If you add an "order by" to the select statement which order by the
>>> fields of one of the remaining indexes, than you could make the
>>> maintenance of that index become much cheaper.
>> 
>> I will make these changes one-by-one and test each.  This will be 
>> interesting.
> 
> On a test server, the original insert takes about 45 minutes.  I removed all 
> indexes.  25 minutes.
> 
> Thank you.

Today I tackled the production server.  After discussion on the Bacula devel 
mailing list (http://marc.info/?l=bacula-devel=145537742804482=2 
)
I compared my schema to the stock schema provided with Bacula.  Yes, I found
extra indexes.  I saved the existing schema and proceeded to remove the indexes
from prod not found in the default.

The query time went from 223 minute to 4.5 minutes.  That is 50 times faster.

I think I can live with that. :)

Jeff: if you show up at PGCon, dinner is on me.  Thank you.

--
Dan Langille - BSDCan / PGCon
d...@langille.org





signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-13 Thread Dan Langille
> On Feb 13, 2016, at 10:43 AM, Dan Langille  wrote:
> 
> Today I tackled the production server.  After discussion on the Bacula devel 
> mailing list (http://marc.info/?l=bacula-devel=145537742804482=2 
> )
> I compared my schema to the stock schema provided with Bacula.  Yes, I found
> extra indexes.  I saved the existing schema and proceeded to remove the 
> indexes
> from prod not found in the default.
> 
> The query time went from 223 minute to 4.5 minutes.  That is 50 times faster.

The query plans: https://twitter.com/DLangille/status/698528182383804416

--
Dan Langille - BSDCan / PGCon
d...@langille.org



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-11 Thread Dan Langille
> On Feb 10, 2016, at 5:13 AM, Dan Langille  wrote:
> 
>> On Feb 10, 2016, at 2:47 AM, Jeff Janes  wrote:
>> 
>> On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille  wrote:
>>> I have a wee database server which regularly tries to insert 1.5 million or
>>> even 15 million new rows into a 400 million row table.  Sometimes these
>>> inserts take hours.
>>> 
>>> The actual query to produces the join is fast.  It's the insert which is
>>> slow.
>>> 
>>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
>>> DeltaSeq)
>>> SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
>>> Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
>>> batch_testing.DeltaSeq
>>>   FROM batch_testing JOIN Path ON (batch_testing.Path = Path.Path)
>>>  JOIN Filename ON (batch_testing.Name =
>>> Filename.Name);
>>> 
>>> This is part of the plan: http://img.ly/images/9374145/full  created via
>>> http://tatiyants.com/pev/#/plans
>>> 
>>> This gist contains postgresql.conf, zfs settings, slog, disk partitions.
>>> 
>>>  https://gist.github.com/dlangille/1a8c8cc62fa13b9f
>> 
>> The table you are inserting into has 7 indexes, all of which have to
>> be maintained.  The index on the sequence column should be efficient
>> to maintain.  But for the rest, if the inserted rows are not naturally
>> ordered by any of the indexed columns then it would end up reading 6
>> random scattered leaf pages in order to insert row pointers.  If none
>> those pages are in memory, that is going to be slow to read off from
>> hdd in single-file.  Also, you are going dirty all of those scattered
>> pages, and they will be slow to write back to hdd because there
>> probably won't be much opportunity for write-combining.
>> 
>> Do you really need all of those indexes?
>> 
>> Won't the index on (jobid, pathid, filenameid) service any query that
>> (jobid) does, so you can get rid of the latter?
>> 
>> And unless you have range queries on fileindex, like "where jobid = 12
>> and fileindex between 4 and 24" then you should be able to replace
>> (jobid, fileindex) with (fileindex,jobid) and then get rid of the
>> stand-alone index on (fileindex).
>> 
>> If you add an "order by" to the select statement which order by the
>> fields of one of the remaining indexes, than you could make the
>> maintenance of that index become much cheaper.
> 
> I will make these changes one-by-one and test each.  This will be interesting.

On a test server, the original insert takes about 45 minutes.  I removed all 
indexes.  25 minutes.

Thank you.

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org







-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-10 Thread Dan Langille
> On Feb 10, 2016, at 2:47 AM, Jeff Janes  wrote:
> 
> On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille  wrote:
>> I have a wee database server which regularly tries to insert 1.5 million or
>> even 15 million new rows into a 400 million row table.  Sometimes these
>> inserts take hours.
>> 
>> The actual query to produces the join is fast.  It's the insert which is
>> slow.
>> 
>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
>> DeltaSeq)
>>  SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
>> Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
>> batch_testing.DeltaSeq
>>FROM batch_testing JOIN Path ON (batch_testing.Path = Path.Path)
>>   JOIN Filename ON (batch_testing.Name =
>> Filename.Name);
>> 
>> This is part of the plan: http://img.ly/images/9374145/full  created via
>> http://tatiyants.com/pev/#/plans
>> 
>> This gist contains postgresql.conf, zfs settings, slog, disk partitions.
>> 
>>   https://gist.github.com/dlangille/1a8c8cc62fa13b9f
> 
> The table you are inserting into has 7 indexes, all of which have to
> be maintained.  The index on the sequence column should be efficient
> to maintain.  But for the rest, if the inserted rows are not naturally
> ordered by any of the indexed columns then it would end up reading 6
> random scattered leaf pages in order to insert row pointers.  If none
> those pages are in memory, that is going to be slow to read off from
> hdd in single-file.  Also, you are going dirty all of those scattered
> pages, and they will be slow to write back to hdd because there
> probably won't be much opportunity for write-combining.
> 
> Do you really need all of those indexes?
> 
> Won't the index on (jobid, pathid, filenameid) service any query that
> (jobid) does, so you can get rid of the latter?
> 
> And unless you have range queries on fileindex, like "where jobid = 12
> and fileindex between 4 and 24" then you should be able to replace
> (jobid, fileindex) with (fileindex,jobid) and then get rid of the
> stand-alone index on (fileindex).
> 
> If you add an "order by" to the select statement which order by the
> fields of one of the remaining indexes, than you could make the
> maintenance of that index become much cheaper.

I will make these changes one-by-one and test each.  This will be interesting.

> Could you move the indexes for this table to SSD?

Now that's a clever idea.

bacula=# select pg_size_pretty(pg_indexes_size('file'));
 pg_size_pretty 

 100 GB
(1 row)

bacula=# select pg_size_pretty(pg_table_size('file'));
 pg_size_pretty 

 63 GB
(1 row)

bacula=# 

No suprising that the indexes are larger than the data.

The SSD is 30GB.  I don't have enough space.  Buying 2x500GB SSDs
would allow me to put all the data onto SSD.  I'm using about 306G for the 
databases now.

> SSD is probably wasted on your WAL.  If your main concern is bulk
> insertions, then WAL is going to written sequentially with few fsyncs.
> That is ideal for HDD.  Even if you also have smaller transactions,

OK.

> WAL is still sequentially written as long as you have a non-volatile
> cache on your RAID controller which can absorb fsyncs efficiently.

Of note, no RAID controller or non-volatile cache here. I'm running ZFS with 
plain HBA controllers.

Thank you.  I have some interesting changes to test.

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-09 Thread Dan Langille
I have a wee database server which regularly tries to insert 1.5 million or 
even 15 million new rows into a 400 million row table.  Sometimes these inserts 
take hours.

The actual query to produces the join is fast.  It's the insert which is slow.

INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, DeltaSeq)
  SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId, 
Filename.FilenameId, batch_testing.LStat, batch_testing.MD5, 
batch_testing.DeltaSeq
FROM batch_testing JOIN Path ON (batch_testing.Path = Path.Path)
   JOIN Filename ON (batch_testing.Name = Filename.Name);

This is part of the plan: http://img.ly/images/9374145/full 
  created via 
http://tatiyants.com/pev/#/plans 

This gist contains postgresql.conf, zfs settings, slog, disk partitions.

   https://gist.github.com/dlangille/1a8c8cc62fa13b9f 


I'm tempted to move it to faster hardware, but in case I've missed something 
basic...

Thank you.

--
Dan Langille - BSDCan / PGCon
d...@langille.org






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PERFORM] Running lots of inserts from selects on 9.4.5

2016-02-09 Thread Jeff Janes
On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille  wrote:
> I have a wee database server which regularly tries to insert 1.5 million or
> even 15 million new rows into a 400 million row table.  Sometimes these
> inserts take hours.
>
> The actual query to produces the join is fast.  It's the insert which is
> slow.
>
> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
> DeltaSeq)
>   SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
> Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
> batch_testing.DeltaSeq
> FROM batch_testing JOIN Path ON (batch_testing.Path = Path.Path)
>JOIN Filename ON (batch_testing.Name =
> Filename.Name);
>
> This is part of the plan: http://img.ly/images/9374145/full  created via
> http://tatiyants.com/pev/#/plans
>
> This gist contains postgresql.conf, zfs settings, slog, disk partitions.
>
>https://gist.github.com/dlangille/1a8c8cc62fa13b9f

The table you are inserting into has 7 indexes, all of which have to
be maintained.  The index on the sequence column should be efficient
to maintain.  But for the rest, if the inserted rows are not naturally
ordered by any of the indexed columns then it would end up reading 6
random scattered leaf pages in order to insert row pointers.  If none
those pages are in memory, that is going to be slow to read off from
hdd in single-file.  Also, you are going dirty all of those scattered
pages, and they will be slow to write back to hdd because there
probably won't be much opportunity for write-combining.

Do you really need all of those indexes?

Won't the index on (jobid, pathid, filenameid) service any query that
(jobid) does, so you can get rid of the latter?

And unless you have range queries on fileindex, like "where jobid = 12
and fileindex between 4 and 24" then you should be able to replace
(jobid, fileindex) with (fileindex,jobid) and then get rid of the
stand-alone index on (fileindex).

If you add an "order by" to the select statement which order by the
fields of one of the remaining indexes, than you could make the
maintenance of that index become much cheaper.

Could you move the indexes for this table to SSD?

SSD is probably wasted on your WAL.  If your main concern is bulk
insertions, then WAL is going to written sequentially with few fsyncs.
That is ideal for HDD.  Even if you also have smaller transactions,
WAL is still sequentially written as long as you have a non-volatile
cache on your RAID controller which can absorb fsyncs efficiently.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance