Re: [GENERAL] Running update in chunks?

2013-01-27 Thread Jasen Betts
On 2013-01-25, Tim Uckun  wrote:
>> I agree that seems like the most likely cause. Each update to the
>> row holding the hstore column requires adding new index entries for
>> all the hstore elements, and autovacuum will need to clean up the
>> old ones in the background. The best solution would be to either
>> normalize the data instead of using hstore, or move the hstore to a
>> separate table which is referenced by some sort of ID from the
>> frequently-updated table.
>
>
> That's very interesting. I can certainly split up the table, no big
> deal there.  So would the index be redone even if I am not updating
> the hstore field itself?

Absolutely! see MVCC.

 http://www.postgresql.org/docs/current/static/mvcc-intro.html


-- 
⚂⚃ 100% natural



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-26 Thread Tim Uckun
>
> That would be one solution, but I think a better one would be to not
> store "make_id" in "imports" in the first place, but instead to always
> fetch it by joining "imports" to "models" at query time.
>

My problem here is that the incoming data is quite messy so the join
conditions become weird (lots of ORs and such). A multi pass approach
seems to work better.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Jeff Janes
On Fri, Jan 25, 2013 at 3:38 AM, Tim Uckun  wrote:
>>
>> That suggests (to me, at least) that it is related to index updating. Again,
>> your GIN index seems primary candidate.
>>
>> Try running iostat (I think that's available on a Mac) with/without the
>> fillfactor and with/without the GIN index while you do the updates. It's
>> possible your SSD is just behaving oddly under stress.
>>
>
>
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field


When the row gets updated, it might move to some place else.  An index
maps data values to row locations.  So if the location changes, all
indexes need to be updated, even if the data value for that index did
not change.  (Well, I shouldn't say they *need* to change.  The
database could have been designed, with considerable difficulty and
consequences, to leave behind permanent redirect pointers to the new
location.  But it wasn't)

There is a mechanism called HOT update (Heap-Only Tuple) which can
prevent this under certain conditions.

1) Either none of the fields being updated are indexed, or any that
are both updated and indexed are updated to the value they already
have.

2) There is room for a new copy of the tuple on the same page as the old one.

lowering the fillfactor helps with requirement 2, especially since
your tuples are probably wide (because of the hstore column) and so
not many fit on a page.

Note that if you update a field to have the same value as it already
does, it still makes a new copy of the entire tuple anyway.  (It
detects that the :old = :new for HOT-eligibility purposes if the field
is indexed, but not for suppression of copying purposes.  And if the
tuple needs to be copied but there is no room on that page, then it
isn't eligible for HOT after all).

So you should add a where clause to the UPDATE to filter out things
that are unchanged.


> but it looks like I
> need to split the hstore into a different table.

That would be one solution, but I think a better one would be to not
store "make_id" in "imports" in the first place, but instead to always
fetch it by joining "imports" to "models" at query time.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Albe Laurenz
Tim Uckun wrote:
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field but it looks like I
> need to split the hstore into a different table.

Every UPDATE that is not HOT will create a row version with
a new "row id".  That means that all indexes referencing that
row will have to get updated.

That is consistent with better performance with low fillfactor
(which makes HOT more likely).

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Richard Huxton

On 25/01/13 11:38, Tim Uckun wrote:

That suggests (to me, at least) that it is related to index updating. Again,
your GIN index seems primary candidate.

Try running iostat (I think that's available on a Mac) with/without the
fillfactor and with/without the GIN index while you do the updates. It's
possible your SSD is just behaving oddly under stress.



I dropped the index and the numbers shot up tenfold or more.  I don't
know why postgres feels the need to update the GIN index on the hstore
field when I am only updating an integer field but it looks like I
need to split the hstore into a different table.
If the row moves to a different block, then it has no choice. The old 
index entry will point to an invalid block. There are some optimisations 
(HOT - http://pgsql.tapoueh.org/site/html/misc/hot.html) but that relies 
on (iirc) the update staying on the same block and also not updating any 
indexed fields (and you were, I think).


A GIN index is very expensive to update compared to btree too.
--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Tim Uckun
>
> That suggests (to me, at least) that it is related to index updating. Again,
> your GIN index seems primary candidate.
>
> Try running iostat (I think that's available on a Mac) with/without the
> fillfactor and with/without the GIN index while you do the updates. It's
> possible your SSD is just behaving oddly under stress.
>


I dropped the index and the numbers shot up tenfold or more.  I don't
know why postgres feels the need to update the GIN index on the hstore
field when I am only updating an integer field but it looks like I
need to split the hstore into a different table.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Richard Huxton

On 25/01/13 08:57, Tim Uckun wrote:

What if you do:
alter table cars.imports set (fillfactor=50);
Before the vacuum full, and then try the update again?


This makes a dramatic difference when combined with a vacuum.

UPDATE 98834
Time: 3408.210 ms

Ten times faster!
That suggests (to me, at least) that it is related to index updating. 
Again, your GIN index seems primary candidate.


A fillfactor of 50% means row updates probably stay on the same 
disk-block as their previous version. This implies less index updates.


Try running iostat (I think that's available on a Mac) with/without the 
fillfactor and with/without the GIN index while you do the updates. It's 
possible your SSD is just behaving oddly under stress.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Tim Uckun
> I agree that seems like the most likely cause. Each update to the
> row holding the hstore column requires adding new index entries for
> all the hstore elements, and autovacuum will need to clean up the
> old ones in the background. The best solution would be to either
> normalize the data instead of using hstore, or move the hstore to a
> separate table which is referenced by some sort of ID from the
> frequently-updated table.


That's very interesting. I can certainly split up the table, no big
deal there.  So would the index be redone even if I am not updating
the hstore field itself?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Tim Uckun
relid | 26710
schemaname| cars
relname   | imports
seq_scan  | 280
seq_tup_read  | 25873543
idx_scan  | 4
idx_tup_fetch | 2749
n_tup_ins | 98926
n_tup_upd | 6350466
n_tup_del | 92
n_tup_hot_upd | 625286
n_live_tup| 98834
n_dead_tup| 0
last_vacuum   | 2013-01-25 21:55:36.078614+13
last_autovacuum   | 2013-01-25 21:58:40.850546+13
last_analyze  | 2013-01-25 21:55:36.305967+13
last_autoanalyze  | 2013-01-25 21:51:54.307639+13
vacuum_count  | 6
autovacuum_count  | 32
analyze_count | 6
autoanalyze_count | 25


On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford
 wrote:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>
>> I already posted the schema earlier. It's a handful of integer fields
>> with one hstore field.
>
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten by a
> spam filter.
>
>
>>
>>
>> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
>>  wrote:
>>>
>>> select * from pg_stat_user_tables where relname='yourtable';
>>
>>
>>
>> Messy output
>
> Don't know if you are in psql but if so, expanded display works for this.
> I.e.:
> steve@[local] => \x
> Expanded display is on.
> steve@[local] => select * from pg_stat_user_tables where relname='footest';
> -[ RECORD 1 ]-+--
> relid | 781691
> schemaname| public
> relname   | footest
> seq_scan  | 3
> seq_tup_read  | 609
> idx_scan  |
> idx_tup_fetch |
> n_tup_ins | 609
> n_tup_upd | 0
> n_tup_del | 0
> n_tup_hot_upd | 0
> n_live_tup| 301
> n_dead_tup| 0
> last_vacuum   |
> last_autovacuum   |
> last_analyze  |
> last_autoanalyze  | 2012-12-19 08:42:23.347368-08
> vacuum_count  | 0
> autovacuum_count  | 0
> analyze_count | 0
> autoanalyze_count | 2
>
>
>
>>
>>
>> "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
>>
>> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
>> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
>> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>>
>>
> So at least autovacuum is running (and some manual vacuum and analyze as
> well).
>
> Cheers,
> Steve
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Tim Uckun
>
> What if you do:
> alter table cars.imports set (fillfactor=50);
> Before the vacuum full, and then try the update again?


This makes a dramatic difference when combined with a vacuum.

UPDATE 98834
Time: 3408.210 ms

Ten times faster!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-25 Thread Tim Uckun
Sorry I haven't been responsive for a little while.

I ran your script  but creating a new schema instead of my existing
schema. My timings were similar to yours (more or less) except fo the
vacuums which took roughly 147891 ms.


On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner  wrote:
> Kevin Grittner wrote:
>
>> update imports set make_id = 0
>>
>> Query returned successfully: 98834 rows affected, 45860 ms execution time.
>
> For difficult problems, there is nothing like a self-contained test
> case, that someone else can run to see the issue. Here's a starting
> point:
>
> create extension if not exists hstore;
> create schema cars;
> drop table if exists cars.imports;
> CREATE TABLE cars.imports
> (
> id serial NOT NULL,
> target_id integer,
> batch_id integer,
> make_id integer,
> model_id integer,
> date timestamp without time zone,
> division_id integer,
> dealer_id integer,
> data hstore,
> created_at timestamp without time zone NOT NULL,
> updated_at timestamp without time zone NOT NULL,
> CONSTRAINT imports_pkey PRIMARY KEY (id)
> );
> CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree 
> (dealer_id);
> CREATE INDEX index_imports_on_division_id ON cars.imports USING btree 
> (division_id);
> CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> insert into cars.imports (created_at, updated_at) select now(), now() from 
> (select generate_series(1, 10)) x;
> vacuum freeze analyze;
> \timing on
> update cars.imports set make_id = 0;
> vacuum analyze;
> update cars.imports set make_id = 0;
>
> ... and here's what I get when I run it on my desktop computer with
> ordinary disk drives and a completely default configuration:
>
> test=# create extension if not exists hstore;
> CREATE EXTENSION
> Time: 48.032 ms
> test=# create schema cars;
> CREATE SCHEMA
> Time: 8.150 ms
> test=# drop table if exists cars.imports;
> NOTICE:  table "imports" does not exist, skipping
> DROP TABLE
> Time: 0.205 ms
> test=# CREATE TABLE cars.imports
> test-# (
> test(# id serial NOT NULL,
> test(# target_id integer,
> test(# batch_id integer,
> test(# make_id integer,
> test(# model_id integer,
> test(# date timestamp without time zone,
> test(# division_id integer,
> test(# dealer_id integer,
> test(# data hstore,
> test(# created_at timestamp without time zone NOT NULL,
> test(# updated_at timestamp without time zone NOT NULL,
> test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
> test(# );
> NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for 
> serial column "imports.id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" 
> for table "imports"
> CREATE TABLE
> Time: 152.677 ms
> test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX
> Time: 6.391 ms
> test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree 
> (dealer_id);
> CREATE INDEX
> Time: 64.668 ms
> test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree 
> (division_id);
> CREATE INDEX
> Time: 65.573 ms
> test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree 
> (make_id);
> CREATE INDEX
> Time: 64.959 ms
> test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree 
> (model_id);
> CREATE INDEX
> Time: 64.906 ms
> test=# insert into cars.imports (created_at, updated_at) select now(), now() 
> from (select generate_series(1, 10)) x;
> INSERT 0 10
> Time: 2516.559 ms
> test=# vacuum freeze analyze;
> VACUUM
> Time: 3357.778 ms
> test=# \timing on
> Timing is on.
> test=# update cars.imports set make_id = 0;
> UPDATE 10
> Time: 2937.241 ms
> test=# vacuum analyze;
> VACUUM
> Time: 2097.426 ms
> test=# update cars.imports set make_id = 0;
> UPDATE 10
> Time: 3935.939 ms
>
> Ubuntu 12.10
> i7-3770 CPU @ 3.40GHz with 16GB RAM
> Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 
> 2013 x86_64 x86_64 x86_64 GNU/Linux
> A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.
>
>  PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 
> 4.7.2-2ubuntu1) 4.7.2, 64-bit
>
> test=# SELECT name, current_setting(name), source
> test-#   FROM pg_settings
> test-#   WHERE source NOT IN ('default', 'override');
> name|  current_setting   |source
> ++--
>  application_name   | psql   | client
>  client_encoding| UTF8   | client
>  DateStyle  | ISO, MDY   | configuration file
>  default_text_search_config | pg_catalog.english | configuration file
>  lc_messages| en_US.UTF-8| configuration file
>  lc_monetary| en_US.UTF-8| configuration file
>  lc_nu

Re: [GENERAL] Running update in chunks?

2013-01-23 Thread Kevin Grittner
Jeff Janes wrote:

> one hstore field can easily be equivalent to 50 text fields with
> an index on each one.
> 
> I'm pretty sure that that is your bottleneck.

I agree that seems like the most likely cause. Each update to the
row holding the hstore column requires adding new index entries for
all the hstore elements, and autovacuum will need to clean up the
old ones in the background. The best solution would be to either
normalize the data instead of using hstore, or move the hstore to a
separate table which is referenced by some sort of ID from the
frequently-updated table.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-22 Thread Steve Crawford

On 01/22/2013 10:59 AM, Alvaro Herrera wrote:

Steve Crawford escribió:

On 01/21/2013 05:02 PM, Tim Uckun wrote:

I already posted the schema earlier. It's a handful of integer fields
with one hstore field.

Oh well. I can't find it but maybe it got lost in shipping or eaten
by a spam filter.

This is what we have the archives are for:

http://www.postgresql.org/message-id/flat/50fedf66.7050...@pinpointresearch.com#50fedf66.7050...@pinpointresearch.com

The original message is at the top of the page (obviously).

Didn't notice that the information was over on the github site (which, 
of course, also makes it impossible to search for in my email and 
unavailable to the mail archives for those wishing to view it in the 
future).


Cheers,
Steve



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-22 Thread Alvaro Herrera
Steve Crawford escribió:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
> >I already posted the schema earlier. It's a handful of integer fields
> >with one hstore field.
> 
> Oh well. I can't find it but maybe it got lost in shipping or eaten
> by a spam filter.

This is what we have the archives are for:

http://www.postgresql.org/message-id/flat/50fedf66.7050...@pinpointresearch.com#50fedf66.7050...@pinpointresearch.com

The original message is at the top of the page (obviously).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-22 Thread Steve Crawford

On 01/21/2013 06:21 PM, Kevin Grittner wrote:

Kevin Grittner wrote:
...
  shared_buffers | 32MB   | configuration file
...

I did absolutely no tuning from the default configuration.

But Tim has his shared_buffers set to 1600kB 
("shared_buffers";"1600kB";"configuration file") or roughly 1/20 of the 
typical default value, which is a very low starting value anyway, on a 
machine populated with 8GB RAM.


I'd like to see how it runs with a more reasonable shared_buffers 
setting. At a very minimum the 32MB default.


Cheers,
Steve



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-22 Thread Steve Crawford

On 01/21/2013 05:02 PM, Tim Uckun wrote:

I already posted the schema earlier. It's a handful of integer fields
with one hstore field.


Oh well. I can't find it but maybe it got lost in shipping or eaten by a 
spam filter.





On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
 wrote:

select * from pg_stat_user_tables where relname='yourtable';



Messy output
Don't know if you are in psql but if so, expanded display works for 
this. I.e.:

steve@[local] => \x
Expanded display is on.
steve@[local] => select * from pg_stat_user_tables where relname='footest';
-[ RECORD 1 ]-+--
relid | 781691
schemaname| public
relname   | footest
seq_scan  | 3
seq_tup_read  | 609
idx_scan  |
idx_tup_fetch |
n_tup_ins | 609
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 301
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-12-19 08:42:23.347368-08
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 2




"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24


So at least autovacuum is running (and some manual vacuum and analyze as 
well).


Cheers,
Steve



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Jeff Janes
On Monday, January 21, 2013, Tim Uckun wrote:

> I already posted the schema earlier. It's a handful of integer fields
> with one hstore field.
>


one hstore field can easily be equivalent to 50 text fields with an index
on each one.

I'm pretty sure that that is your bottleneck.

what does \di+ show?


Cheers,

Jeff

>
>


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Jeff Janes
On Monday, January 21, 2013, Tim Uckun wrote:

> > First off, what does it say for rows affected? (Hint, if you really
> > are using a default configuration and it doesn't say 0 rows
> > affected, please show us the actual query used.)
>
> update imports set make_id = null
>
> Query returned successfully: 98834 rows affected, 49673 ms execution time.
>
>
> vacuum analyze imports
>
> Query returned successfully with no result in 4138 ms.
>
> VACUUM FULL imports;
>

What if you do:
alter table cars.imports set (fillfactor=50);
Before the vacuum full, and then try the update again?

Cheers,

Jeff


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Kevin Grittner
Kevin Grittner wrote:

> update imports set make_id = 0
> 
> Query returned successfully: 98834 rows affected, 45860 ms execution time.

For difficult problems, there is nothing like a self-contained test
case, that someone else can run to see the issue. Here's a starting
point:

create extension if not exists hstore;
create schema cars;
drop table if exists cars.imports;
CREATE TABLE cars.imports
(
id serial NOT NULL,
target_id integer,
batch_id integer,
make_id integer,
model_id integer,
date timestamp without time zone,
division_id integer,
dealer_id integer,
data hstore,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT imports_pkey PRIMARY KEY (id)
);
CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX index_imports_on_division_id ON cars.imports USING btree 
(division_id);
CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
insert into cars.imports (created_at, updated_at) select now(), now() from 
(select generate_series(1, 10)) x;
vacuum freeze analyze;
\timing on
update cars.imports set make_id = 0;
vacuum analyze;
update cars.imports set make_id = 0;

... and here's what I get when I run it on my desktop computer with
ordinary disk drives and a completely default configuration:

test=# create extension if not exists hstore;
CREATE EXTENSION
Time: 48.032 ms
test=# create schema cars;
CREATE SCHEMA
Time: 8.150 ms
test=# drop table if exists cars.imports;
NOTICE:  table "imports" does not exist, skipping
DROP TABLE
Time: 0.205 ms
test=# CREATE TABLE cars.imports
test-# (
test(# id serial NOT NULL,
test(# target_id integer,
test(# batch_id integer,
test(# make_id integer,
test(# model_id integer,
test(# date timestamp without time zone,
test(# division_id integer,
test(# dealer_id integer,
test(# data hstore,
test(# created_at timestamp without time zone NOT NULL,
test(# updated_at timestamp without time zone NOT NULL,
test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
test(# );
NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for serial 
column "imports.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" 
for table "imports"
CREATE TABLE
Time: 152.677 ms
test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX
Time: 6.391 ms
test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree 
(dealer_id);
CREATE INDEX
Time: 64.668 ms
test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree 
(division_id);
CREATE INDEX
Time: 65.573 ms
test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree 
(make_id);
CREATE INDEX
Time: 64.959 ms
test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree 
(model_id);
CREATE INDEX
Time: 64.906 ms
test=# insert into cars.imports (created_at, updated_at) select now(), now() 
from (select generate_series(1, 10)) x;
INSERT 0 10
Time: 2516.559 ms
test=# vacuum freeze analyze;
VACUUM
Time: 3357.778 ms
test=# \timing on
Timing is on.
test=# update cars.imports set make_id = 0;
UPDATE 10
Time: 2937.241 ms
test=# vacuum analyze;
VACUUM
Time: 2097.426 ms
test=# update cars.imports set make_id = 0;
UPDATE 10
Time: 3935.939 ms

Ubuntu 12.10
i7-3770 CPU @ 3.40GHz with 16GB RAM
Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux
A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.

 PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 
4.7.2-2ubuntu1) 4.7.2, 64-bit

test=# SELECT name, current_setting(name), source
test-#   FROM pg_settings
test-#   WHERE source NOT IN ('default', 'override');
            name            |  current_setting   |        source        
++--
 application_name           | psql               | client
 client_encoding            | UTF8               | client
 DateStyle                  | ISO, MDY           | configuration file
 default_text_search_config | pg_catalog.english | configuration file
 lc_messages                | en_US.UTF-8        | configuration file
 lc_monetary                | en_US.UTF-8        | configuration file
 lc_numeric                 | en_US.UTF-8        | configuration file
 lc_time                    | en_US.UTF-8        | configuration file
 log_timezone               | US/Central         | configuration file
 max_connections            | 100                | configuration file
 max_stack_depth            | 2MB                | environment variable
 shared_buffers             | 32MB               | configuration file
 TimeZone                   | US/Central         | configuration file
(13 rows)

I did absolutely no tuning from the default

Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
I already posted the schema earlier. It's a handful of integer fields
with one hstore field.


On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
 wrote:
> select * from pg_stat_user_tables where relname='yourtable';



Messy output

"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Steve Crawford

On 01/21/2013 04:15 PM, Steve Crawford wrote:

On 01/21/2013 03:45 PM, Tim Uckun wrote:

Oh I forgot
...
Me, too. I forgot to ask for the table definition. If there are 
variable-length fields like "text" or "varchar", what is the typical 
size of the data.


Also, what is the physical size of the table (\dt+ yourtable)?

Perhaps even the output of
select * from pg_stat_user_tables where relname='yourtable';
might be useful.

Cheers,
Steve


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Steve Crawford

On 01/21/2013 03:45 PM, Tim Uckun wrote:

Oh I forgot
...
"shared_buffers";"1600kB";"configuration file"


You *reduced* shared buffers to 1.6MB? IIRC the typical default is 32MB 
and the most common adjustment is to *increase* shared buffers. Most of 
my servers are set to 2GB.


Try bumping that up to a reasonable value 
(http://www.postgresql.org/docs/9.2/static/runtime-config-resource.html) 
and share the results. Don't forget to restart PG after changing that 
setting.


Cheers,
Steve


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
Oh I forgot



 SELECT version();

"PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple
clang version 4.1 (tags/Apple/clang-421.11.65) (based on LLVM 3.1svn),
64-bit"


SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');



"application_name";"pgAdmin III - Query Tool";"client"
"bytea_output";"escape";"session"
"client_encoding";"UNICODE";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, DMY";"session"
"default_text_search_config";"pg_catalog.english";"configuration file"
"lc_messages";"en_NZ.UTF-8";"configuration file"
"lc_monetary";"en_NZ.UTF-8";"configuration file"
"lc_numeric";"en_NZ.UTF-8";"configuration file"
"lc_time";"en_NZ.UTF-8";"configuration file"
"log_timezone";"NZ";"configuration file"
"max_connections";"20";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"search_path";"chrysler, public";"session"
"shared_buffers";"1600kB";"configuration file"
"TimeZone";"NZ";"configuration file"


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)

update imports set make_id = null

Query returned successfully: 98834 rows affected, 49673 ms execution time.


vacuum analyze imports

Query returned successfully with no result in 4138 ms.

VACUUM FULL imports;

Query returned successfully with no result in 38106 ms.

VACUUM FREEZE ANALYZE;

Query returned successfully with no result in 184635 ms

update imports set make_id = 0

Query returned successfully: 98834 rows affected, 45860 ms execution time.


So all the vacuuming saved about four seconds of execution time.

here is the postgresql.conf completely untouched from the default
install https://gist.github.com/4590590


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Kevin Grittner
Kevin Grittner wrote:

> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)

Never mind that bit -- I got myself confused. Sorry for the noise.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread patrick keshishian
On Mon, Jan 21, 2013 at 1:48 PM, Tim Uckun  wrote:
>> I'd be curious to see results of the same "update" on a standard HDD
>> vs the SSD, and maybe on a more typical database deployment hardware
>> vs a macbook air.
>>
>
>
> I haven't tried it on any other machine yet.  CREATE TEMP TABLE tt as
> SELECT ...  takes eight seconds so presumably the disk is not the
> choke point.

you are making an assumption that a fresh write is the same as a
re-write. try the test.

--patrick


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Kevin Grittner
Tim Uckun wrote:

> If you have any suggestions I am all ears. For the purposes of this
> discussion we can narrow down the problem this update statement.
> 
> Update imports set make_id = null.

Well, that simplifies things.

First off, what does it say for rows affected? (Hint, if you really
are using a default configuration and it doesn't say 0 rows
affected, please show us the actual query used.)

Second, try connecting to the database as a superuser and running:

VACUUM ANALYZE imports;
-- (show us the results)
VACUUM FULL imports;
VACUUM FREEZE ANALYZE;  -- (don't specify a table)

Then try your query and see whether performance is any different.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Kevin Grittner
Richard Huxton wrote:
> On 21/01/13 20:09, Tim Uckun wrote:

>> Just to close this up and give some guidance to future
>> googlers...

> Careful, future googlers.

+1

>> Conclusion. Updates on postgres are slow

> Nope.

Agreed.

>> (given the default postgresql.conf). I presume this is due to
>> MVCC or the WAL or something and there are probably some things
>> I can do to tweak the conf file to make them go faster but out
>> of the box running an update on a table with lots of rows is
>> going to cost you a lot.

> Unlikely. Do you really think that a PostgreSQL installation
> typically runs 100 times slower on updates than inserts and every
> other user has just said "oh, that's ok then"? Or is it more
> likely that something peculiar is broken on your setup.

As someone who has managed hundreds of databases ranging up to over
3TB without seeing this without some very specific cause, I agree
that there is something wonky on Tim's setup which he hasn't told
us about. Then again, I'm not sure we've pushed hard enough for the
relevant details.

Tim, if you're still interested in resolving this, please post the
results from running the SQL code on this page:

http://wiki.postgresql.org/wiki/Server_Configuration

It might be worthwhile to read through this page:

http://wiki.postgresql.org/wiki/Slow_Query_Questions

... and try some of the ideas there. Base disk perfromance numbers
would help put the results in perspective.

The cause could be anything from table bloat due to inadequate
vacuuming to hardware problems.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
> I'd be curious to see results of the same "update" on a standard HDD
> vs the SSD, and maybe on a more typical database deployment hardware
> vs a macbook air.
>


I haven't tried it on any other machine yet.  CREATE TEMP TABLE tt as
SELECT ...  takes eight seconds so presumably the disk is not the
choke point.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
>
> Nope.

If you have any suggestions I am all ears. For the purposes of this
discussion we can narrow down the problem this update statement.

Update imports set make_id = null.

There are 98K records in the table. There is no index on the make_id
field. Standard untouched postgresql.conf from the brew install of
postgres.


> Unlikely. Do you really think that a PostgreSQL installation typically runs
> 100 times slower on updates than inserts and every other user has just said
> "oh, that's ok then"? Or is it more likely that something peculiar is broken
> on your setup.

I really don't know. That's why I am here asking.  I don't think
anything particular is broken with my system.  As mentioned above the
setup is really simple. Standard postgres install, the default conf
file, update one field on one table. It takes fifty plus seconds.

I concede that if I was to go into the postgres.conf and make some
changes it will probably run faster (maybe much faster) but I wanted
to exhaust other factors before I went messing with the default
install.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread patrick keshishian
On Mon, Jan 21, 2013 at 1:23 PM, Richard Huxton  wrote:
> On 21/01/13 20:09, Tim Uckun wrote:
>>
>> Just to close this up and give some guidance to future googlers...
>
> Careful, future googlers.
>
>> Conclusion.  Updates on postgres are slow
>
> Nope.
>
>
>> (given the default
>> postgresql.conf).  I presume this is due to MVCC or the WAL or
>> something and there are probably some things I can do to tweak the
>> conf file to make them go faster but out of the box running an update
>> on a table with lots of rows is going to cost you a lot.
>
> Unlikely. Do you really think that a PostgreSQL installation typically runs
> 100 times slower on updates than inserts and every other user has just said
> "oh, that's ok then"? Or is it more likely that something peculiar is broken
> on your setup.
>
>
>>   Removing the indexes doesn't help that much.
>>
>> Suggestion for the PG team.  Deliver a more realistic postgres.conf by
>> default. The default one seems to be aimed at ten year old PCs with
>> very little RAM and disk space. At least deliver additional conf files
>> for small, medium, large, huge setups.

I'd be curious to see results of the same "update" on a standard HDD
vs the SSD, and maybe on a more typical database deployment hardware
vs a macbook air.

--patrick


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Richard Huxton

On 21/01/13 20:09, Tim Uckun wrote:

Just to close this up and give some guidance to future googlers...

Careful, future googlers.

Conclusion.  Updates on postgres are slow

Nope.


(given the default
postgresql.conf).  I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot.
Unlikely. Do you really think that a PostgreSQL installation typically 
runs 100 times slower on updates than inserts and every other user has 
just said "oh, that's ok then"? Or is it more likely that something 
peculiar is broken on your setup.



  Removing the indexes doesn't help that much.

Suggestion for the PG team.  Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.

--
  Richard Huxton


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
Just to close this up and give some guidance to future googlers...

There are no foreign keys. The table is empty when I start. I copy the
data into it from a text file.

Removing indexes didn't help much (made a very slight difference).

running a query CREATE TEMP TABLE tt AS SELECT  using a massive
join takes about 8 seconds. I presume that's the baseline for the disk
and RAM given my current postgres configuration.  Note that this is
not a satisfactory option for me because I can't do what I want in one
step (the update I specified is one of many).

running a very simple update "UPDATE imports set make_id = null"
takes over 50 seconds so that's the minimum amount of time any update
is going to take.

Running a complex update where I join all the tables together and
update all the fields takes about 106 seconds.

Just running a complex select with the joins takes no time at all.

I tried chunking the updates using chunks of 100 records and 1000
records (where ID between X and Y repeatedly) and it was even slower.

Conclusion.  Updates on postgres are slow (given the default
postgresql.conf).  I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot. Removing the
indexes doesn't help that much.

Suggestion for the PG team.  Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Kevin Grittner
Richard Huxton wrote:

> The only differences I can think of are WAL logging (transaction
> log) and index updates (the temp table has no indexes).

What about foreign keys? Are there any tables which reference the
updated column in a foreign key declaration? Do they have indexes
on that column?

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Richard Huxton

On 21/01/13 10:30, Tim Uckun wrote:

Can you try a couple of things just to check timings. Probably worth EXPLAIN
ANALYSE.

SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;


Takes about 300 ms


CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
ON i.model_id = md.id;

Takes about 300 ms
OK - so writing all the data takes very under one second but updating 
the same amount takes 50 seconds.


The only differences I can think of are WAL logging (transaction log) 
and index updates (the temp table has no indexes).


1. Try CREATE TABLE rather than CREATE TEMP TABLE - if that's still 
quick then it's not the time taken to write WAL.
2. Run the update query against your new tt table and see how long that 
takes.
3. Add indexes and repeat (in particular I'd be suspicious of the gin 
index on "data")


My guess is that it's the time taken to update the "data" index - gin 
indexes can be slow to rebuild (although 50 seconds seems *very* slow). 
If so there are a few options:
1. Split the table and put whatever this "data" is into an import_data 
table - assuming it doesn't change often.
2. Try a fill-factor of 50% or less - keeping the updates on the same 
data page as the original might help
3. Drop the gin index before doing your bulk update and rebuild it at 
the end. This is a common approach with bulk-loading / updates.


Oh - I'm assuming you're only updating those rows whose id has changed - 
that seemed to be the suggestion in your first message. If not, simply 
adding "AND make_id <> md.make_id" should help. Also (and you may well 
have considered this) - for a normalised setup you'd just have the 
model-id in "imports" and look up the make-id through the "models" table.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Tim Uckun
> Can you try a couple of things just to check timings. Probably worth EXPLAIN
> ANALYSE.
>
> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
> md.id;


Takes about 300 ms

>
> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
> ON i.model_id = md.id;

Takes about 300 ms

>
> Now the first one should take half a second judging by your previous
> explain. If the second one takes 50 seconds too then that's just the limit
> of your SSD's write. If it's much faster then something else is happening.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running update in chunks?

2013-01-21 Thread Richard Huxton

On 21/01/13 08:04, Tim Uckun wrote:

This is the query I am running

update cars.imports i
 set make_id = md.make_id
 from cars.models md where i.model_id = md.id;


Here is the analyse

Looks like it's the actual update that's taking all the time.

This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive.  I am using postgres 9.2
installed with homebrew using the standard conf file.
Can you try a couple of things just to check timings. Probably worth 
EXPLAIN ANALYSE.


SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id = 
md.id;


CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models 
md ON i.model_id = md.id;


Now the first one should take half a second judging by your previous 
explain. If the second one takes 50 seconds too then that's just the 
limit of your SSD's write. If it's much faster then something else is 
happening.



--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general