date:20101021

Re: [PERFORM] Index scan is not working, why??

2010-10-21 Thread Samuel Gendler

please provide non-default config options on this host plus the same from a
host which is using an index scan, please.  Also, postgresql version, OS,
and all of the other stuff that is asked for in this document:
http://wiki.postgresql.org/wiki/SlowQueryQuestions.  It is impossible to say
why the query planner might be choosing a particular plan without any
insight whatsoever as to how the server is configured.



On Wed, Oct 20, 2010 at 10:25 PM, AI Rumman rumman...@gmail.com wrote:

 I don't know why seq scan is running on the following query where the same
 query is giving index scan on other servers:
 explain analyze
 select *
 from act
 where act.acttype in ( 'Meeting','Call','Task');
   QUERY PLAN

 
  Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) (actual
 time=0.013..484.572 rows=263639 loops=1)
   Filter: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text =
 'Call'::text) OR ((acttype)::text = 'Task'::text))
  Total runtime: 732.956 ms
 (3 rows)

 The above query is giving index scan on other servers and even if I rewrite
 the query as follows I got index scan:
 explain analyze
 select *
 from act
 where act.acttype = 'Meeting'
 or act.acttype = 'Call';
   QUERY PLAN

 --
  Bitmap Heap Scan on act (cost=17.98..1083.80 rows=2277 width=142) (actual
 time=1.901..9.722 rows=4808 loops=1)
   Recheck Cond: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text =
 'Call'::text))
   - BitmapOr (cost=17.98..17.98 rows=2281 width=0) (actual
 time=1.262..1.262 rows=0 loops=1)
   - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141
 width=0) (actual time=0.790..0.790 rows=3181 loops=1)
   Index Cond: ((acttype)::text = 'Meeting'::text)
   - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141
 width=0) (actual time=0.469..0.469 rows=1630 loops=1)
   Index Cond: ((acttype)::text = 'Call'::text)
  Total runtime: 14.227 ms
 (8 rows)


 \d act
   Table public.act
   Column | Type | Modifiers

 --++---
  actid | integer | not null default 0
  subject | character varying(250) | not null
  semodule | character varying(20) |
  acttype | character varying(200) | not null
  date_start | date | not null
  due_date | date |
  time_start | character varying(50) |
  time_end | character varying(50) |
  sendnotification | character varying(3) | not null default '0'::character
 varying
  duration_hours | character varying(2) |
  duration_minutes | character varying(200) |
  status | character varying(200) |
  eventstatus | character varying(200) |
  priority | character varying(200) |
  location | character varying(150) |
  notime | character varying(3) | not null default '0'::character varying
  visibility | character varying(50) | not null default 'all'::character
 varying
  recurringtype | character varying(200) |
  end_date | date |
  end_time | character varying(50) |
 Indexes:
   act_pkey PRIMARY KEY, btree (actid)
   act_acttype_idx btree (acttype)
   act_date_start_idx btree (date_start)
   act_due_date_idx btree (due_date)
   act_eventstatus_idx btree (eventstatus)
   act_status_idx btree (status)
   act_subject_idx btree (subject)
   act_time_start_idx btree (time_start)

 Any idea please.

Re: [PERFORM] Index scan is not working, why??

2010-10-21 Thread Scott Marlowe

On Thu, Oct 21, 2010 at 1:51 AM, Samuel Gendler
sgend...@ideasculptor.com wrote:
 please provide non-default config options on this host plus the same from a
 host which is using an index scan, please.  Also, postgresql version, OS,
 and all of the other stuff that is asked for in this
 document: http://wiki.postgresql.org/wiki/SlowQueryQuestions.  It is
 impossible to say why the query planner might be choosing a particular plan
 without any insight whatsoever as to how the server is configured.

I know it's mentioned in that wiki doc, but the ddl for the table and
its indexes, or the output of \d tablename is quite useful and should
be included as well.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Periodically slow inserts

2010-10-21 Thread Gael Le Mignot


Hello,

We are  using PostgreSQL for  storing data and full-text  search indexes
for the webiste of a daily newspaper. We are very happy overall with the
results, but we have one weird behaviour that we would like to solve.

The problem is  when we index objects into the  full-text search part of
the database (which a DELETE and  then an INSERT into a specific table),
the INSERT  sometimes take a long time  (from 10s to 20s),  but the same
insert (and many other similar ones) are fast (below 0.2s).

This  slowness  comes  regularly,   about  every  200  objects  indexed,
regardless  of the frequency  of the  inserts. If  I reindex  one object
every 5 seconds for one hour, or one object every second for 10 minutes,
I've the same  kind of results : around 0.5% of  the time, indexing took
more than 10s.

The  positive point  is that  this slowness  doesn't block  the  rest of
queries to the database, but it's still painful to have to wait (even if
only once in a  while) for 10s or 20s when the  end-user hits the save
button.

This  slowness is associated  with very  high IO  load on  the operating
system.  I  tried playing with  checkpoint parameters (making  them more
frequent  or  less  frequent,  but  I  didn't  notice  any  siginificant
difference).

Do you have any hint on how to smooth the process, so we don't have this
regular huge slowdown ?



If you want more details about the setup :

- server is a Xen virtual machine  with 8Gb of memory, disks being 15000
  rpm SAS disks on RAID 1, and CPU being one core of a Nehalem processor
  (but CPU load is low anyway).

- the database schema is like :

CREATE TABLE sesql_index (
  classname varchar(255),
  id integer,
  created_at timestamp,
  modified_at timestamp,
  created_by integer,
  modified_by integer,
  workflow_state integer,
  site_id integer,
  title_text text,
  title_tsv tsvector,
  subtitle_text text,
  subtitle_tsv tsvector,
  fulltext_text text,
  fulltext_tsv tsvector,
  authors integer[],
  folders integer[],
  [...]
  indexed_at timestamp DEFAULT NOW(),
  PRIMARY KEY (classname, id)
);

CREATE TABLE sesql_author (CHECK (classname = 'Author'), 
  PRIMARY KEY (classname, id)) INHERITS (sesql_index);

CREATE TABLE sesql_program (CHECK (classname = 'Program'), 
  PRIMARY KEY (classname, id)) INHERITS (sesql_index);

CREATE TABLE sesql_default (CHECK (classname = 'Slideshow' OR classname
= 'Book' OR classname = 'Article' OR classname = 'Publication' OR
classname = 'Forum'), PRIMARY KEY (classname, id)) INHERITS (sesql_index);

(with a few other similar tables for different objects).

Inserts/deletes are  done directly into  the child tables,  searches are
done either  on the  master table (sesql_index)  or on the  child tables
depending of the use case (but search works fine anyway).

In  addition to  that we  have several  indexes, created  on  each child
tables :

CREATE INDEX sesql_default_classname_index ON sesql_default (classname);
CREATE INDEX sesql_default_id_index ON sesql_default (id);
CREATE INDEX sesql_default_created_at_index ON sesql_default (created_at);
CREATE INDEX sesql_default_modified_at_index ON sesql_default (modified_at);
CREATE INDEX sesql_default_created_by_index ON sesql_default (created_by);
CREATE INDEX sesql_default_modified_by_index ON sesql_default (modified_by);
CREATE INDEX sesql_default_workflow_state_index ON sesql_default 
(workflow_state);
CREATE INDEX sesql_default_site_id_index ON sesql_default (site_id);
CREATE INDEX sesql_default_publication_date_index ON sesql_default 
(publication_date);
CREATE INDEX sesql_default_authors_index ON sesql_default USING GIN (authors);
CREATE INDEX sesql_default_folders_index ON sesql_default USING GIN (folders);

And the  heavy ones, for each  fulltext field, we have  two columns, the
text  and the  tsv, with  an index  on the  tsv, and  the tsv  itself is
updated via a trigger :

CREATE INDEX sesql_default_fulltext_index ON sesql_default USING GIN 
(fulltext_tsv);

CREATE TRIGGER sesql_default_fulltext_update BEFORE INSERT OR UPDATE
ON sesql_default FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(fulltext_tsv, 'public.lem_french', fulltext_text);

Thanks a lot for reading me until here ;)

Regards,

-- 
Gaël Le Mignot - g...@pilotsystems.net
Pilot Systems - 9, rue Desargues - 75011 Paris
Tel : +33 1 44 53 05 55 - www.pilotsystems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Kenneth Marshall

Hi,

There are a lot of details missing about your system:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

Cheers,
Ken

On Thu, Oct 21, 2010 at 02:25:44PM +0200, Gael Le Mignot wrote:
 
 Hello,
 
 We are  using PostgreSQL for  storing data and full-text  search indexes
 for the webiste of a daily newspaper. We are very happy overall with the
 results, but we have one weird behaviour that we would like to solve.
 
 ...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Leonardo Francalanci

 
 We are  using PostgreSQL for  storing data and  full-text  search indexes
 for the webiste of a daily newspaper. We are  very happy overall with the
 results, but we have one weird behaviour that  we would like to solve.


I think there's a lot of missing info worth knowing:

1) checkpoints logs? Enable them, maybe the slowness happens
at checkpoints:

log_checkpoints=true

2) How many rows does each table contain?

3) HW: how many discs you have, and which controller you're using (and:
does it use a BBU?)

The more you tell the list, the better help you'll get...




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Index scan is not working, why??

2010-10-21 Thread Mladen Gogala


AI Rumman wrote:
I don't know why seq scan is running on the following query where the 
same query is giving index scan on other servers:

explain analyze
select *
from act
where act.acttype in ( 'Meeting','Call','Task');
  QUERY PLAN  

 Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) (actual 
time=0.013..484.572 rows=263639 loops=1)
  Filter: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 
'Call'::text) OR ((acttype)::text = 'Task'::text))

 Total runtime: 732.956 ms
(3 rows)
Al, what percentage of the rows fits the above criteria? How big are 
your histograms?


--
Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] New wiki page on write reliability

2010-10-21 Thread Greg Smith

Now that some of my recent writing has gone from NDA protected to public 
sample, I've added a new page to the PostgreSQL wiki that provides a 
good starting set of resources to learn about an ever popular topic 
here, how write cache problems can lead to database corruption:  
http://wiki.postgresql.org/wiki/Reliable_Writes


Bruce also has a presentation he's been working on that adds pictures 
showing the flow of data through the various cache levels, to help 
people visualize the whole thing, that should get added into there once 
he's finished tweaking it.


I'd like to get some feedback from the members of this list about what's 
still missing after this expanded data dump.  Ultimately I'd like to get 
this page to be an authoritative enough resource that the Reliability 
section of the official documentation could point back to this as a 
recommendation for additional information.  So much of this material 
requires singling out specific vendors and staying up to date with 
hardware changes, both things that the official docs are not a good 
place for.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, PostgreSQL 9.0 High Performance:
http://www.2ndquadrant.com/books



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] New wiki page on write reliability

2010-10-21 Thread Brad Nicholson


On 10-10-21 10:08 AM, Greg Smith wrote:
Now that some of my recent writing has gone from NDA protected to 
public sample, I've added a new page to the PostgreSQL wiki that 
provides a good starting set of resources to learn about an ever 
popular topic here, how write cache problems can lead to database 
corruption:  http://wiki.postgresql.org/wiki/Reliable_Writes


Bruce also has a presentation he's been working on that adds pictures 
showing the flow of data through the various cache levels, to help 
people visualize the whole thing, that should get added into there 
once he's finished tweaking it.


I'd like to get some feedback from the members of this list about 
what's still missing after this expanded data dump.  Ultimately I'd 
like to get this page to be an authoritative enough resource that the 
Reliability section of the official documentation could point back 
to this as a recommendation for additional information.  So much of 
this material requires singling out specific vendors and staying up to 
date with hardware changes, both things that the official docs are not 
a good place for.




Looks like a good start.

I think a warning turning fsync off, the dangers of async_commit, and 
the potential problems with disabling full_page_writes might be worth 
mentioning on this page, unless you want to leave that buried in the 
attached references.


--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Tom Lane

Gael Le Mignot g...@pilotsystems.net writes:
 The problem is  when we index objects into the  full-text search part of
 the database (which a DELETE and  then an INSERT into a specific table),
 the INSERT  sometimes take a long time  (from 10s to 20s),  but the same
 insert (and many other similar ones) are fast (below 0.2s).

 This  slowness  comes  regularly,   about  every  200  objects  indexed,
 regardless  of the frequency  of the  inserts.

Hm.  You didn't say which PG version you're using, but if it's = 8.4,
I think this may be caused by GIN's habit of queuing index insertions
until it's accumulated a reasonable-size batch:
http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE

While you can turn that off, I think that doing so will reduce the
index's search efficiency over time.  It might be better to schedule
regular vacuums on the table so that the work is done by vacuum rather
than foreground queries.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Leonardo Francalanci


   does it use a BBU?)


Sorry, this was supposed to read do you have cache on the controller, of
course a battery can't change the performance... but you got it anyway...




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Leonardo Francalanci

 2010-10-21 16:39:15 CEST  LOG:  checkpoint complete: wrote 365 buffers
 (11.9%); 0 transaction log  file(s) added, 0 removed, 3 recycled;
 write=0.403 s, sync=21.312 s,  total=21.829 s


I'm no expert, but isn't 21s to sync 365 buffers a big amount of time?





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] New wiki page on write reliability

2010-10-21 Thread Greg Smith


Brad Nicholson wrote:
I think a warning turning fsync off, the dangers of async_commit, and 
the potential problems with disabling full_page_writes might be worth 
mentioning on this page, unless you want to leave that buried in the 
attached references.


Good idea to highlight that.  What I just did here was point out which 
of the references covered that specific topic, which is as good as I 
could do for now. 

It's hard for me to justify spending time writing more about those when 
they are covered in the attached references, and I personally can't do 
it because of my publishing agreement.  The fact that the information 
about this topic is what ended up being released as the sample material 
from my book is not coincidence--I wanted to be able to share what I'd 
done here as a free community resources because this topic is so 
important to me.  But I can't go much further than what I've already put 
up there myself.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Cédric Villemain

2010/10/21 Leonardo Francalanci m_li...@yahoo.it:
 2010-10-21 16:39:15 CEST  LOG:  checkpoint complete: wrote 365 buffers
 (11.9%); 0 transaction log  file(s) added, 0 removed, 3 recycled;
 write=0.403 s, sync=21.312 s,  total=21.829 s


 I'm no expert, but isn't 21s to sync 365 buffers a big amount of time?

It is.

I suggest to look at /proc/meminfo about dirty buffers and the results
of 'iostat  -x 2' runing for some moment







 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Greg Smith


Gael Le Mignot wrote:

The delay is the delay of the sync part of
the checkpoints :

2010-10-21 16:39:15 CEST LOG:  checkpoint complete: wrote 365 buffers
(11.9%); 0 transaction log file(s) added, 0 removed, 3 recycled;
write=0.403 s, sync=21.312 s, total=21.829 s

Maybe  there is something  I misunderstood,  but aren't  the checkpoints
supposed to run smoothly over the checkpoint_completion_target interval ?
  


Well, first off you have to get the checkpoints spaced out in time 
enough for that to work.  Both checkpoint_segments and possibly 
checkpoint_timeout may also need to be increased in order for the 
checkpoint write spreading code to work.  When I start seeing long sync 
times, I'll usually shoot for 64 segments and 10 minutes for the 
timeout to give that smoothing work some room to do what it's supposed to.


However, only the main checkpoint writes are spread over time.  The hope 
is that by the time the sync phase starts, the operating system will 
have already written most of them out.  Sometimes, particularly in 
servers with lots of RAM for caching writes, this doesn't happen.  In 
that case, you can have gigabytes of data queued up at the beginning of 
the sync phase--which is not spread out at all.


We are currently working on a spread sync feature for PostgreSQL that 
makes this problem better on platforms/filesystems it's possible to 
improve behavior on (you can't do anything about this issue on ext3 for 
example).  I'll be talking about that development at the PgWest 
conference in a two weeks:  
https://www.postgresqlconference.org/content/righting-your-writes and 
hope to submit a patch with a first version of this feature to the 
November development CommitFest, in hopes of it making it into version 9.1.


If you can't come up with any solution here and need help with your 
current version sooner than that, we've already backported this 
improvement all the way to V8.3; drop me an off-list note if you want to 
discuss consulting services in this area we have available.  If you're 
lucky, just adjusting the segment and timeout values may be enough for you.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, PostgreSQL 9.0 High Performance
http://www.2ndquadrant.com/books



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Gael Le Mignot

Hello Tom!

Thu, 21 Oct 2010 10:55:48 -0400, you wrote: 

  Gael Le Mignot g...@pilotsystems.net writes:
  The problem is  when we index objects into the  full-text search part of
  the database (which a DELETE and  then an INSERT into a specific table),
  the INSERT  sometimes take a long time  (from 10s to 20s),  but the same
  insert (and many other similar ones) are fast (below 0.2s).

  This  slowness  comes  regularly,   about  every  200  objects  indexed,
  regardless  of the frequency  of the  inserts.

  Hm.  You didn't say which PG version you're using, but if it's = 8.4,
  I think this may be caused by GIN's habit of queuing index insertions
  until it's accumulated a reasonable-size batch:
  http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE

  While you can turn that off, I think that doing so will reduce the
  index's search efficiency over time.  It might be better to schedule
  regular vacuums on the table so that the work is done by vacuum rather
  than foreground queries.

Thanks for your feedback.

It seems to  be related, at least, if I  increase the work_mem variable,
the slowness because  bigger (up to 1 minute for a  work_mem of 8mb) but
much less frequent (around 0.05% instead of 0.5% of the requests for 8mb
instead of 1mb).

So a big work_mem and a regular  vacuum would do the tick, I think. Does
auto_vacuum triggers the  gin index vacuuming too, or  does it require a
manual vacuum ?

Regards, 

-- 
Gaël Le Mignot - g...@pilotsystems.net
Pilot Systems - 9, rue Desargues - 75011 Paris
Tel : +33 1 44 53 05 55 - www.pilotsystems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] New wiki page on write reliability

2010-10-21 Thread Bruce Momjian

Greg Smith wrote:
 Now that some of my recent writing has gone from NDA protected to public 
 sample, I've added a new page to the PostgreSQL wiki that provides a 
 good starting set of resources to learn about an ever popular topic 
 here, how write cache problems can lead to database corruption:  
 http://wiki.postgresql.org/wiki/Reliable_Writes
 
 Bruce also has a presentation he's been working on that adds pictures 
 showing the flow of data through the various cache levels, to help 
 people visualize the whole thing, that should get added into there once 
 he's finished tweaking it.

My presentation is done and is now on the wiki too:

http://momjian.us/main/writings/pgsql/hw_selection.pdf

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Mladen Gogala


Gael Le Mignot wrote:

Hello,

We are  using PostgreSQL for  storing data and full-text  search indexes
for the webiste of a daily newspaper. We are very happy overall with the
results, but we have one weird behaviour that we would like to solve.

The problem is  when we index objects into the  full-text search part of
the database (which a DELETE and  then an INSERT into a specific table),
the INSERT  sometimes take a long time  (from 10s to 20s),  but the same
insert (and many other similar ones) are fast (below 0.2s).
  

Have you tried with strace, just to see where the time is spent?

--

Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com 
The Leader in Integrated Media Intelligence Solutions





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Steve Crawford


On 10/20/2010 09:45 PM, Scott Marlowe wrote:

On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drakej...@commandprompt.com  wrote:
   

On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
 

Ben Chobot wrote:
   

On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:

 

I'm weighing options for a new server. In addition to PostgreSQL, this machine 
will handle some modest Samba and Rsync load.

I will have enough RAM so the virtually all disk-read activity will be cached. 
The average PostgreSQL read activity will be modest - a mix of single-record 
and fairly large (reporting) result-sets. Writes will be modest as well but 
will come in brief (1-5 second) bursts of individual inserts. The rate of 
insert requests will hit 100-200/second for those brief bursts.

So...

Am I likely to be better off putting $$$ toward battery-backup on the RAID or 
toward adding a second RAID-set and splitting off the WAL traffic? Or something 
else?
   

A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
Dedicated WAL spindles are nice and all, but they're still spinning
media. Raid card cache is wy faster, and while it's best at bursty
writes, it sounds like bursty writes are precisely what you have.
 

Totally agree!
   

BBU first, more spindles second.
 

Agreed.  note that while you can get incredible burst performance from
a battery backed cache, due to both caching and writing out of order,
once the throughput begins to saturate at the speed of the disk array,
the bbu cache is now only re-ordering really, as it will eventually
fill up faster than the disks can take the writes, and you'll settle
in at some percentage of your max tps you get for a short benchmark
run.  It's vitally important that once you put a BBU cache in place,
you run a very long running transactional test (pgbench is a simple
one to start with) that floods the io subsystem so you see what you're
average throughput is with the WAL and data store getting flooded.  I
know on my system pgbench runs of a few minutes can be 3 or 4 times
faster than runs that last for the better part of an hour.

   
Thanks for all the replies. This is what I suspected but since I can't 
just buy one of everything to try, I wanted a sanity-check before 
spending the $$$.


I am not too worried about saturating the controller cache as the 
current much lower spec machine can handle the sustained load just fine 
and the bursts are typically only 1-3 seconds long spaced a minute or 
more apart.


Cheers,
Steve




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Periodically slow inserts

2010-10-21 Thread Tom Lane

Gael Le Mignot g...@pilotsystems.net writes:
 Thu, 21 Oct 2010 10:55:48 -0400, you wrote: 
 I think this may be caused by GIN's habit of queuing index insertions
 until it's accumulated a reasonable-size batch:
 http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE

 So a big work_mem and a regular  vacuum would do the tick, I think. Does
 auto_vacuum triggers the  gin index vacuuming too, or  does it require a
 manual vacuum ?

Autovacuum will handle it too.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian

Scott Marlowe wrote:
 On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com 
 wrote:
  On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
  Ben Chobot wrote:
   On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
  
I'm weighing options for a new server. In addition to PostgreSQL, this 
machine will handle some modest Samba and Rsync load.
   
I will have enough RAM so the virtually all disk-read activity will be 
cached. The average PostgreSQL read activity will be modest - a mix of 
single-record and fairly large (reporting) result-sets. Writes will be 
modest as well but will come in brief (1-5 second) bursts of 
individual inserts. The rate of insert requests will hit 
100-200/second for those brief bursts.
   
So...
   
Am I likely to be better off putting $$$ toward battery-backup on the 
RAID or toward adding a second RAID-set and splitting off the WAL 
traffic? Or something else?
  
   A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
   Dedicated WAL spindles are nice and all, but they're still spinning
   media. Raid card cache is wy faster, and while it's best at bursty
   writes, it sounds like bursty writes are precisely what you have.
 
  Totally agree!
 
  BBU first, more spindles second.
 
 Agreed.  note that while you can get incredible burst performance from
 a battery backed cache, due to both caching and writing out of order,
 once the throughput begins to saturate at the speed of the disk array,
 the bbu cache is now only re-ordering really, as it will eventually
 fill up faster than the disks can take the writes, and you'll settle
 in at some percentage of your max tps you get for a short benchmark
 run.  It's vitally important that once you put a BBU cache in place,
 you run a very long running transactional test (pgbench is a simple
 one to start with) that floods the io subsystem so you see what you're
 average throughput is with the WAL and data store getting flooded.  I
 know on my system pgbench runs of a few minutes can be 3 or 4 times
 faster than runs that last for the better part of an hour.

With a BBU you can turn off full_page_writes, which should decrease the
WAL traffic.

However, I don't see this mentioned in our documentation.  Should I add
it?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner

Bruce Momjian br...@momjian.us wrote:
 
 With a BBU you can turn off full_page_writes
 
My understanding is that that is not without risk.  What happens if
the WAL is written, there is a commit, but the data page has not yet
been written to the controller?  Don't we still have a torn page?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian

Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  With a BBU you can turn off full_page_writes
  
 My understanding is that that is not without risk.  What happens if
 the WAL is written, there is a commit, but the data page has not yet
 been written to the controller?  Don't we still have a torn page?

I don't see how full_page_writes affect non-written pages to the
controller.

full_page_writes is designed to guard against a partial write to a
device.  I don't think the raid cache can be partially written to, and
the cache will not be cleared until the drive has fully writen the data
to disk.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith


Bruce Momjian wrote:

With a BBU you can turn off full_page_writes, which should decrease the
WAL traffic.

However, I don't see this mentioned in our documentation.  Should I add
it?
  


What I would like to do is beef up the documentation with some concrete 
examples of how to figure out if your cache and associated write path 
are working reliably or not.  It should be possible to include does 
this handle full page writes correctly? in that test suite.  Until we 
have something like that, I'm concerned that bugs in filesystem or 
controller handling may make full_page_writes unsafe even with a BBU, 
and we'd have no way for people to tell if that's true or not.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner

Bruce Momjian br...@momjian.us wrote:
 
 full_page_writes is designed to guard against a partial write to a
 device.  I don't think the raid cache can be partially written to
 
So you're confident that an 8kB write to the controller will not be
done as a series of smaller atomic writes by the OS file system?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith


Kevin Grittner wrote:

Bruce Momjian br...@momjian.us wrote:
 
  

full_page_writes is designed to guard against a partial write to a
device.  I don't think the raid cache can be partially written to

 
So you're confident that an 8kB write to the controller will not be

done as a series of smaller atomic writes by the OS file system?


Sure, that happens.  But if the BBU has gotten an fsync call after the 
8K write, it shouldn't return success until after all 8K are in its 
cache.  That's why full_page_writes should be safe on a system with BBU 
as Bruce is suggesting.  But I'd like to see some independent proof of 
that fact, that includes some targeted tests users can run, before we 
start recommending that practice.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-21 Thread Jesper Krogh


On 2010-10-21 06:47, Scott Carey wrote:

 On a wimpy disk, I/O bound for sure.   But my disks go 1000MB/sec.
 No query can go fast enough for them.  The best I've gotten is
 800MB/sec, on a wide row (average 800 bytes).  Most tables go
 300MB/sec or so.  And with 72GB of RAM, many scans are in-memory
 anyway.


Is it cpu or io bound while doing it?

Can you scan it faster using time cat relation-oid.*  /dev/null


 A single SSD with supercapacitor will go about 500MB/sec by itself
 next spring.   I will easily be able to build a system with 2GB/sec
 I/O for under $10k.


What filesystem are you using? Readahead?
Can you try to check the filesystemfragmentation of the table using filefrag?

--
Jesper

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner

Greg Smith g...@2ndquadrant.com wrote: 
 Kevin Grittner wrote:
 
 So you're confident that an 8kB write to the controller will not
 be done as a series of smaller atomic writes by the OS file
 system?
 
 Sure, that happens.  But if the BBU has gotten an fsync call after
 the 8K write, it shouldn't return success until after all 8K are
 in its cache.
 
I'm not concerned about an fsync after the controller has it; I'm
concerned about a system crash in the middle of writing an 8K page
to the controller.  Other than the expected *size* of the window of
time during which you're vulnerable, what does a BBU caching
controller buy you in this regard?  Can't the OS rearrange the
writes of disk sectors after the 8K page is written to the OS cache
so that the window might occasionally be rather large?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] How does PG know if data is in memory?

2010-10-21 Thread Bruce Momjian

Kevin Grittner wrote:
 gnuo...@rcn.com wrote:
  
  -- None of that is anything for amateurs to play with.
  
  Not jam a stick in anybody's eye, but shouldn't database pros not
  be amateurs?
  
 While many PostgreSQL installations are managed by professional
 DBAs, or programmers or consultants with a deep enough grasp of the
 issues to tune a knob like that appropriately, PostgreSQL is also
 used in environments without such staff.  In fact, there is pressure
 to make PostgreSQL easier to configure for exactly that reason.  If
 we add more knobs which are this hard to tune correctly, we would
 risk inundation with complaints from people to tried to use it and
 made things worse.

Agreed.  Here is a blog entry that explains some of the tradeoffs of
adding knobs:

http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Experiences with running PostgreSQL on Blue Arc Network Storage

2010-10-21 Thread Tim Goodaire


Hello,

Does anyone have any experience with running postgreSQL on Blue Arc's 
network storage products? In particular, we are interested in the the 
Titan and Mercury series:


http://www.bluearc.com/data-storage-products/titan-3000-network-storage-system.shtml
http://www.bluearc.com/data-storage-products/mercury-network-storage-system.shtml

--
Tim Goodaire416-673-4126tgood...@ca.afilias.info
Database Team Lead, Afilias Canada Corp.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian

Kevin Grittner wrote:
 Greg Smith g...@2ndquadrant.com wrote: 
  Kevin Grittner wrote:
  
  So you're confident that an 8kB write to the controller will not
  be done as a series of smaller atomic writes by the OS file
  system?
  
  Sure, that happens.  But if the BBU has gotten an fsync call after
  the 8K write, it shouldn't return success until after all 8K are
  in its cache.
  
 I'm not concerned about an fsync after the controller has it; I'm
 concerned about a system crash in the middle of writing an 8K page
 to the controller.  Other than the expected *size* of the window of
 time during which you're vulnerable, what does a BBU caching
 controller buy you in this regard?  Can't the OS rearrange the
 writes of disk sectors after the 8K page is written to the OS cache
 so that the window might occasionally be rather large?

If the write fails to the controller, the page is not flushed and PG
does not continue.  If the write fails, the fsync never happens, and
hence PG stops.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner

Bruce Momjian br...@momjian.us wrote:
 
 I assume we send a full 8k to the controller, and a failure during
 that write is not registered as a write.
 
On what do you base that assumption?  I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian

Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  I assume we send a full 8k to the controller, and a failure during
  that write is not registered as a write.
  
 On what do you base that assumption?  I assume that we send a full
 8K to the OS cache, and the file system writes disk sectors
 according to its own algorithm.  With either platters or BBU cache,
 the data is persisted on fsync; why do you see a risk with one but
 not the other?

Now that is an interesting question.  We write 8k to the kernel, but the
kernel doesn't have to honor those write sizes, so while we probably
can't get a partial 512-byte block written to disk with an BBU (that
isn't cleanup up by the BBU on reboot), we could get some 512-byte
blocks of an 8k written and others not.

I agree you are right and a BBU does not mean you can safely turn off
full_page_writes.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Index scan is not working, why??

2010-10-21 Thread Igor Neyman

 

 -Original Message-
 From: AI Rumman [mailto:rumman...@gmail.com] 
 Sent: Thursday, October 21, 2010 1:25 AM
 To: pgsql-performance@postgresql.org
 Subject: Index scan is not working, why??
 
 I don't know why seq scan is running on the following query 
 where the same query is giving index scan on other servers:
 explain analyze
 
 select *
 from act
 where act.acttype in ( 'Meeting','Call','Task');
   QUERY PLAN
 --
 --
 
  Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) 
 (actual time=0.013..484.572 rows=263639 loops=1)
   Filter: (((acttype)::text = 'Meeting'::text) OR 
 ((acttype)::text = 'Call'::text) OR ((acttype)::text = 
 'Task'::text))  Total runtime: 732.956 ms
 (3 rows)
 
 
 The above query is giving index scan on other servers and 
 even if I rewrite the query as follows I got index scan:
 explain analyze
 
 select *
 from act
 where act.acttype = 'Meeting'
 or act.acttype = 'Call';
   QUERY PLAN
 --
 --
 --
  Bitmap Heap Scan on act (cost=17.98..1083.80 rows=2277 
 width=142) (actual time=1.901..9.722 rows=4808 loops=1)
   Recheck Cond: (((acttype)::text = 'Meeting'::text) OR 
 ((acttype)::text = 'Call'::text))
   - BitmapOr (cost=17.98..17.98 rows=2281 width=0) (actual 
 time=1.262..1.262 rows=0 loops=1)
   - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 
 rows=1141 width=0) (actual time=0.790..0.790 rows=3181 loops=1)
   Index Cond: ((acttype)::text = 'Meeting'::text)
   - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 
 rows=1141 width=0) (actual time=0.469..0.469 rows=1630 loops=1)
   Index Cond: ((acttype)::text = 'Call'::text)  Total 
 runtime: 14.227 ms
 (8 rows)
 
 

Index Scan is not alwayes prefarable to Seq Scan, it depends on
selectivity of your query.
When retrieving substancial portion of big table seq scan is usually
faster, that's why optimizer chooses it.

Your queries (and possibly data sets in the tables on different servers)
are not the same.
Your first query (which uses seq scan) returns 259671 which is probably
substantial part of the whole table.
Your second query (which uses index scan) returns only 4808 rows, which
makes index access less costly in this case.

Regards,
Igor Neyman

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Andres Freund

On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  I assume we send a full 8k to the controller, and a failure during
  that write is not registered as a write.
 
 On what do you base that assumption?  I assume that we send a full
 8K to the OS cache, and the file system writes disk sectors
 according to its own algorithm.  With either platters or BBU cache,
 the data is persisted on fsync; why do you see a risk with one but
 not the other?
At least on linux pages can certainly get written out in  8kb batches if 
youre under memory pressure.

Andres



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-21 Thread Scott Carey


On Oct 21, 2010, at 11:13 AM, Jesper Krogh wrote:

 On 2010-10-21 06:47, Scott Carey wrote:
  On a wimpy disk, I/O bound for
 sure.   But my disks go 1000MB/sec.
 
  No query can go fast enough for them.  The best I've gotten is
 
  800MB/sec, on a wide row (average 800 bytes).  Most tables go
 
  300MB/sec or so.  And with 72GB of RAM, many scans are in-memory
 
  anyway.
 
 
 Is it cpu or io bound while doing it?
I/O bound with the fio benchmark tool if 16K blocks or greater, CPU bound with 
8K blocks or smaller.  CentOS 5.5.
CPU bound with postgres.   


 Can you scan it faster using time cat relation-oid.*  /dev/null 
  

I'm not sure what you mean.  in psql, select * piped to /dev/null is VERY CPU 
bound because of all the formatting.  I haven't toyed with COPY.   Do you mean 
the actual files?  'dd' tests from actual files are similar to fio, but not as 
consistent and hard to add concurrency.  That is faster than postgres.

 
  A single SSD with supercapacitor will go about 500MB/sec by itself
 
  next spring.   I will easily be able to build a system with 2GB/sec
 
  I/O for under $10k.
 
 
 
  What filesystem are you using? Readahead?
 Can you try to check the filesystemfragmentation of the table using filefrag?
 
XFS, defragmented once a day.  Readahead 40960 (20MB, 1MB per spindle).  two 
raid 10 arrays, each 10 discs each (2 hot spare), software raid-0 tying those 
together (md, 1MB blocks).  Two Adaptec 5805 (or 5085, the external SAS one).  
A third raid card for the OS/xlog with 4x10krpm sas drives internal.

Fragmentation quickly takes this down a lot as do small files and concurrent 
activity, since its only enough spindles for ~2000 iops.  But its almost all 
large reporting queries on partitioned tables (500,000 partitions).   A few 
smaller tables are starting to cause too many seeks so those might end up on a 
smaller, high iops tablespace later.

Over time the disks have filled up and there is a significant slowdown in 
sequential transfer at the end of the partition -- 600MB/sec max.  That is 
still CPU bound on most scans, but postgres can go that fast on some scans.

Off topic:
Other interesting features is how this setup causes the system tables to bloat 
by factors of 2x to 8x each week, and requires frequent vacuum full + reindex 
on several of them else they become 1.5GB in size.  Nothing like lots of temp 
table work + hour long concurrent transactions to make the system catalog 
bloat.  I suppose with 8.4 many temp tables could be replaced using WITH 
queries, but in other cases analyzing a temp table is the only way to get a 
sane query plan.


 -- 
 Jesper
 
 
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith


Kevin Grittner wrote:

I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other


I'd like a 10 minute argument please.  I started to write something to 
refute this, only to clarify in my head the sequence of events that 
leads to the most questionable result, where I feel a bit less certain 
than I did before of the safety here.  Here is the worst case I believe 
you're describing:


1) Transaction is written to the WAL and sync'd; client receives 
COMMIT.  Since full_page_writes is off, the data in the WAL consists 
only of the delta of what changed on the page.

2) 8K database page is written to OS cache
3) PG calls fsync to force the database block out
4) OS writes first 4K block of the change to the BBU write cache.  Worst 
case, this fills the cache, and it takes a moment for some random writes 
to process before it has space to buffer again (makes this more likely 
to happen, but it's not required to see the failure case here)

5) Sudden power interruption, second half of the page write is lost
6) Server restarts
7) That 4K write is now replayed from the battery's cache

At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
data.  Without a full page write in the WAL, is it always possible to 
restore its original state now?  In theory, I think you do.  Since the 
delta in the WAL should be overwriting all of the bytes that changed 
between the old and new version of the page, applying it on top of any 
four possible states here:


1) None of the data was written to the database page yet
2) The first 4K of data was written out
3) The second 4K of data was written out
4) All 8K was actually written out

Should lead to the same result: an 8K page that includes the change that 
was in the WAL but not onto disk at the point when the crash happened.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Tom Lane

Greg Smith g...@2ndquadrant.com writes:
 At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
 data.

Right.

 Without a full page write in the WAL, is it always possible to 
 restore its original state now?  In theory, I think you do.  Since the 
 delta in the WAL should be overwriting all of the bytes that changed 
 between the old and new version of the page, applying it on top of any 
 four possible states here:

You've got entirely too simplistic a view of what the delta might be,
I fear.  In particular there are various sorts of changes that involve
inserting the data carried in the WAL record and shifting pre-existing
data around to make room, or removing an item and moving remaining data
around.  If you try to replay that type of action against a torn page,
you'll get corrupted results.

We could possibly redefine the WAL records so that they weren't just the
minimum amount of data but carried every byte that'd changed on the
page, and then I think what you're envisioning would work.  But the
records would be a lot bulkier.  It's not clear to me that this would be
a net savings over the current design, particularly not if there's
a long interval between checkpoints.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith


Kevin Grittner wrote:

With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other


Forgot to address this part.  The troublesome sequence if you don't have 
a BBU is:


1) WAL data is written to the OS cache
2) PG calls fsync
3) Data is tranferred into the drive's volatile, non battery-backed cache
4) Drive lies about data being on disk, says fsync is done
5) That 8K data page is written out to the OS cache, also with fsync, 
then onto the drive.  It says it has that too.
6) Due to current disk head location, 4KB of the data page gets written 
out before it gets to the WAL data

7) System crashes

Now you're dead.  You've just torn a data page, but not written any of 
the data to the WAL necessary to reconstruct any valid version of that page.


I think Kevin's point here may be that if your fsync isn't reliable, 
you're always in trouble.  But if your fsync is good, even torn pages 
should be repairable by the deltas written to the WAL, as I described in 
the message I just sent before this one.  That's true regardless of 
whether you achieved non-lying fsync with a BBU or by turning a 
drive's write cache off.  There's nothing really special about the BBU 
beyond it behind the most common form of reliable write cache that 
works.  You get the same properties at a slower rate with a drive that's 
configured to never lie about writes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

38 matches

Mail list logo