Re: [HACKERS] Patch to .gitignore

2013-05-24 Thread amul sul
 >"ctags" and "etags" be part of postgres source tree and its generate some 
output inside them, so I think we must ignore it. 

+1


Regards,
Amul Sul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-24 Thread Mathieu Guerin
Hello,

Thanks a lot for your answers.

> You should get it...
> stats_temp_directory   |
pg_stat_tmp   | Writes temporary
statistics files to the specified directory.

I don't know why i don't get it. I am in 9.1 version...

Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages
decrease  the warning messages decrease from 1 each minutes to 1 each five
secondes. I don't have any others logs warning but the file pg_stat.stat in
the mounting point is not created... I tryed before on a test environment
and it works...

If you have any ideas...
Thanks a lot.
Regards,
Math



2013/5/24 Michael Paquier 

>
>
>
> On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin <
> mathieu.gueri...@gmail.com> wrote:
>
>> What are the consequences ? Because this file will be remove if the
>> server reboot.
>>
> Those temporary statistics are stored in global directory when server
> shuts down, so the risk here would be to lose a portion of this data in the
> case of a crash, either at PG or at OS level.
>
>
>> If we change the parameter stats_temp_directory is it necessary to
>> reboot the server ?
>>
> No, sending SIGHUP to the server is enough.
>
>
>> When I lauch a SHOW ALL; command, the parameter stats_temp_director is
>> not here.
>>
> You should get it...
> stats_temp_directory|
> pg_stat_tmp   | Writes temporary
> statistics files to the specified directory.
>  --
> Michael
>


Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-24 Thread Mathieu Guerin
About the stats_temp_directory, I didn't run as root...
Now I'm sure the configurations are correct.

I think, I have too much IO to use stats. I will ever have this message...
Maybe I can disable this option.
Do you know what it really impact ?

Thanks.
Math




2013/5/24 Mathieu Guerin 

> Hello,
>
> Thanks a lot for your answers.
>
>
> > You should get it...
> > stats_temp_directory   |
> pg_stat_tmp   | Writes temporary
> statistics files to the specified directory.
>
> I don't know why i don't get it. I am in 9.1 version...
>
> Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages
> decrease  the warning messages decrease from 1 each minutes to 1 each five
> secondes. I don't have any others logs warning but the file pg_stat.stat in
> the mounting point is not created... I tryed before on a test environment
> and it works...
>
> If you have any ideas...
> Thanks a lot.
> Regards,
> Math
>
>
>
> 2013/5/24 Michael Paquier 
>
>>
>>
>>
>> On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin <
>> mathieu.gueri...@gmail.com> wrote:
>>
>>> What are the consequences ? Because this file will be remove if the
>>> server reboot.
>>>
>> Those temporary statistics are stored in global directory when server
>> shuts down, so the risk here would be to lose a portion of this data in the
>> case of a crash, either at PG or at OS level.
>>
>>
>>> If we change the parameter stats_temp_directory is it necessary to
>>> reboot the server ?
>>>
>> No, sending SIGHUP to the server is enough.
>>
>>
>>> When I lauch a SHOW ALL; command, the parameter stats_temp_director is
>>> not here.
>>>
>> You should get it...
>> stats_temp_directory|
>> pg_stat_tmp   | Writes temporary
>> statistics files to the specified directory.
>>  --
>> Michael
>>
>
>


Re: [HACKERS] Removal of pageinspect--1.0.sql

2013-05-24 Thread Heikki Linnakangas

On 20.05.2013 19:50, Michael Paquier wrote:

The contrib module pageinspect has been upgraded to 1.1, but
pageinspect--1.0.sql is still present in source code. Shouldn't it be
removed? Please find patch attached.


Yep. Removed, thanks.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
Hi Sergey,

Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
migration (and it is working fine). The streaming replication is for
hot-standby replication *once migrated*. Thing is I disbable archving and
set wal_level to minimal, when migrating the large portion of data, to make
it faster. Then I switch to wal_level=hot_standby, i.e the "production"
configuration, and the WAL segment seuqence seems to overlap with the
segments generated with the other setting.


On Thu, May 23, 2013 at 7:44 PM, Sergey Konoplev  wrote:

> On Thu, May 23, 2013 at 6:18 AM, German Becker 
> wrote:
> > Let me describe the process I follow to get to this. What I am doing is
> > testing a migration from 8.3 to 9.1. They way I plan to do it is the
> > following.
> > 1) Create the schema
> > 2) import the biggest tables, which are not updated,only growing, with
> COPY
> > (this is about 35gb of data)
> > 2)import the small, changing part of the data
> >
> >
> > The target system is 9.1 with streaming relication.
> > For steps 1 and 2, I set a "restore" configuration, that amongs other
> things
> > like more work mem, it sets archive_mode=off and wal_level=minimal
> (attached
> > the difference between restore and normal).
> > The archive_command is just a cp wrapped in a shell script in case I
> need to
> > change it.
>
> You can not migrate between any major versions with WAL based or
> streaming replication.
>
> Use either full dump/restore or schema only dump/restore plus trigger
> based replication (londiste, slony) to migrate data.
>
> --
> Kind regards,
> Sergey Konoplev
> PostgreSQL Consultant and DBA
>
> Profile: http://www.linkedin.com/in/grayhemp
> Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
> Skype: gray-hemp
> Jabber: gray...@gmail.com
>


Re: [HACKERS] Patch to .gitignore

2013-05-24 Thread Tom Lane
=?ISO-8859-1?Q?Fabr=EDzio_de_Royes_Mello?=  writes:
> On Fri, May 24, 2013 at 12:04 AM, Christopher Browne 
> wrote:
>> There hasn't been general agreement on the merits of particular .gitignore
>> rules of this sort.

> I agree with you about vim-oriented patterns, because its a particular
> tool, but "ctags" and "etags" be part of postgres source tree and its
> generate some output inside them, so I think we must ignore it.

[ shrug... ]  Editor backup files get generated inside the source tree
as well.  Chris stated the policy accurately: if you use tools that
leave unexpected files in the source tree, it's up to you to have a
personal .gitignore for those, assuming you want them ignored.  The
project's .gitignore files are only supposed to list files that get
generated by the standard build processes.

FWIW, my personal .gitexclude file looks like

*~
*.orig

where the latter one is for junk generated by "patch".  I believe
patch's extension for saved files varies across versions, so it
wouldn't be too sensible to have an exclusion like that in the
project-wide file.  Note also that I intentionally *don't* have an
exclusion for *.rej --- if any patch hunk failed, I want git to
mention it.  But that's a matter of personal preference.  I rather
imagine that other people configure it differently, and that's fine.
As long as we don't try to put such things in the project-wide
exclusion list, we don't have to have a consensus about it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
> Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
> migration (and it is working fine). The streaming replication is for
> hot-standby replication *once migrated*. Thing is I disbable archving and
> set wal_level to minimal, when migrating the large portion of data, to make
> it faster. Then I switch to wal_level=hot_standby, i.e the "production"
> configuration, and the WAL segment seuqence seems to overlap with the
> segments generated with the other setting.
>

Though, now you understand it's not what it looks like, right? :-)


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Robert Haas
On Thu, May 23, 2013 at 7:27 PM, Greg Smith  wrote:
> I'm working on a new project here that I wanted to announce, just to keep
> from duplicating effort in this area.  I've started to add a cost limit
> delay for regular statements.  The idea is that you set a new
> statement_cost_delay setting before running something, and it will restrict
> total resources the same way autovacuum does.  I'll be happy with it when
> it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.

Cool.  We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.

But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit =  and read_rate_limit = .  This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
"blended" measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.

> Modifying the buffer manager to account for statement-based cost
> accumulation isn't difficult.  The tricky part here is finding the right
> spot to put the delay at.  In the vacuum case, it's easy to insert a call to
> check for a delay after every block of I/O.  It should be possible to find a
> single or small number of spots to put a delay check in the executor.  But I
> expect that every utility command may need to be modified individually to
> find a useful delay point.  This is starting to remind me of the SEPostgres
> refactoring, because all of the per-command uniqueness ends up requiring a
> lot of work to modify in a unified way.

I haven't looked at this in detail, but I would hope it's not that
bad.  For one thing, many DDL commands don't do any significant I/O in
the first place and so can probably be disregarded.  Those that do are
mostly things that rewrite the table and things that build indexes.  I
doubt there are more than 3 or 4 code paths to patch.

> The main unintended consequences issue I've found so far is when a cost
> delayed statement holds a heavy lock.  Autovacuum has some protection
> against letting processes with an exclusive lock on a table go to sleep.  It
> won't be easy to do that with arbitrary statements.  There's a certain
> amount of allowing the user to shoot themselves in the foot here that will
> be time consuming (if not impossible) to eliminate.  The person who runs an
> exclusive CLUSTER that's limited by statement_cost_delay may suffer from
> holding the lock too long.  But that might be their intention with setting
> the value.  Hard to idiot proof this without eliminating useful options too.

Well, we *could* have a system where, if someone blocks waiting for a
lock held by a rate-limited process, the rate limits are raised or
abolished.  But I'm pretty sure that's a bad idea.  I think that the
people who want rate limits want them because allowing too much write
(or maybe read?) activity hoses the performance of the entire system,
and that's not going to be any less true if there are multiple jobs
piling up.  Let's say someone has a giant COPY into a huge table, and
CLUSTER blocks behind it, waiting for AccessExclusiveLock.  Well...
making the COPY run faster so that we can hurry up and start
CLUSTER-ing seems pretty clearly wrong.  We want the COPY to run
slower, and we want the CLUSTER to run slower, too.  If we don't want
that, then, as you say, we shouldn't set the GUC in the first place.

Long story short, I'm inclined to define this as expected behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] background processes vs. hot standby

2013-05-24 Thread Robert Haas
CheckRequiredParameterValues() has some code that, when hot standby is
in use, checks the values of max_connections,
max_prepared_transactions, and max_locks_per_transaction against the
master.   The comment says "we must have at least as many backend
slots as the primary" ... but the code no longer enforces that,
because we now compute MaxBackends like this:

MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
GetNumShmemAttachedBgworkers();

If GetNumShmemAttachedBgworkers() returns a lower value on the standby
than it did on the master, then we might well have fewer backend slots
on the standby.  I'm having trouble remembering why it's a problem to
have fewer backend slots on the standby than the master, but if we
need to prevent that then this code is no longer adequate to the task.

The comment doesn't explain why we check max_locks_per_transaction.  I
thought the reason for that check was that we needed to ensure that
there were at least as many lock table slots on the standby as there
were on the master, to prevent bad things from happening later.  That
was already not true, since the existing code didn't enforce any
limitation on autovacuum_max_workers on the standby side.  Maybe that
doesn't matter, since autovacuum workers can't run in hot standby
mode; not sure.  But the addition of background workers to MaxBackends
provides another way for that to be not true.  Here's how we compute
the size of the lock table:

#define NLOCKENTS() \
mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Andres Freund
Hi,

On 2013-05-24 09:48:03 -0400, Robert Haas wrote:
> CheckRequiredParameterValues() has some code that, when hot standby is
> in use, checks the values of max_connections,
> max_prepared_transactions, and max_locks_per_transaction against the
> master.   The comment says "we must have at least as many backend
> slots as the primary" ... but the code no longer enforces that,
> because we now compute MaxBackends like this:
> 
> MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
> GetNumShmemAttachedBgworkers();
> 
> If GetNumShmemAttachedBgworkers() returns a lower value on the standby
> than it did on the master, then we might well have fewer backend slots
> on the standby.  I'm having trouble remembering why it's a problem to
> have fewer backend slots on the standby than the master, but if we
> need to prevent that then this code is no longer adequate to the task.

It's afair important because we need to allocate shared memory which can
keep track of the maximum number of xids (toplevel *
max_non_suboverflowed_subxids) in progress. That's the
KnownAssignedXids* stuff in procarray.c.

> The comment doesn't explain why we check max_locks_per_transaction.  I
> thought the reason for that check was that we needed to ensure that
> there were at least as many lock table slots on the standby as there
> were on the master, to prevent bad things from happening later.  That
> was already not true, since the existing code didn't enforce any
> limitation on autovacuum_max_workers on the standby side.  Maybe that
> doesn't matter, since autovacuum workers can't run in hot standby
> mode; not sure.  But the addition of background workers to MaxBackends
> provides another way for that to be not true.  Here's how we compute
> the size of the lock table:

Yea, we need it exactly for that reason. I think its unlikely to cause
actual problems since we only ship access exclusive locks to the standby
and its hard to see scenarios where we have that many AEL on the
primary. But we probably should fix it anyway.

I think fixing the autovacuum_max_workers case is entirely reasonable
and relatively unlikely to cause problems. I don't think we can easily
do it in a minor release though since I don't see a way to transport
knowledge about it via the WAL without breaking either the WAL format
entirely or change the meaning of MaxConnections in ControlFile which
would cause problems with upgrading the primary first.

I am less excited about doing something similar for the background
worker case. Requiring just as many background workers on the standby
sounds like a bad idea to me, there seem to be too many cases where that
doesn't seem to make sense.
I wonder if we shouldn't make background workers use connections slots
from max_connections similar to how superuser_reserved_connections
work. That would mean we don't need to care about it for HS.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Jim Nasby

On 5/24/13 8:21 AM, Robert Haas wrote:

On Thu, May 23, 2013 at 7:27 PM, Greg Smith  wrote:

>I'm working on a new project here that I wanted to announce, just to keep
>from duplicating effort in this area.  I've started to add a cost limit
>delay for regular statements.  The idea is that you set a new
>statement_cost_delay setting before running something, and it will restrict
>total resources the same way autovacuum does.  I'll be happy with it when
>it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.

Cool.  We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.

But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit =  and read_rate_limit = .  This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
"blended" measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.


Doesn't that hit the old issue of not knowing if a read came from FS cache or 
disk? I realize that the current cost_delay mechanism suffers from that too, 
but since the API is lower level that restriction is much more apparent.

Instead of KB/s, could we look at how much time one process is spending waiting 
on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait 
time for every request, at least on the most popular OSes?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Jim Nasby

On 5/14/13 8:42 AM, Amit Kapila wrote:

In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && 
refcount = 0) buffer’s to end of freelist. I have implemented a new API 
StrategyMoveBufferToFreeListEnd() to

move buffer’s to end of freelist.



Instead of a separate function, would it be better to add an argument to 
StrategyFreeBuffer? ISTM this is similar to the other strategy stuff in the 
buffer manager, so perhaps it should mirror that...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-23 22:09:02 -0400, Robert Haas wrote:
> On Thu, May 23, 2013 at 1:51 PM, Andres Freund  wrote:
> > So, what I propose instead is basically:
> > 1) only vacuum non-all-visible pages, even when doing it for
> >anti-wraparound
>
> Check.  We might want an option to force a scan of the whole relation.

Yea, thought of that as well. VACUUM (DEEP) ;).

> > 3) When we cannot mark a page all-visible or we cannot get the cleanup
> >lock, remember the oldest xmin on that page. We could set all visible
> >in the former case, but we want the page to be cleaned up sometime
> >soonish.

> I think you mean "in the latter case" not "in the former case".  If
> not, then I'm confused.

Uh. Yes.

> > We don't even necessarily need to log the hint bits for all items since
> > the redo for all_visible could make sure all items are hinted. The only
> > problem is knowing up to where we can truncate pg_clog...

> [all-visible cannot restore hint bits without FPI because of torn pages]

I haven't yet thought about this sufficiently yet. I think we might have
a chance of working around this, let me ponder a bit.

But even if that means needing a full page write via the usual mechanism
for all visible if any hint bits needed to be set we are still out far
ahead of the current state imo.
* cleanup would quite possibly do an FPI shortly after in vacuum
  anyway. If we do it for all visible, it possibly does not need to be
  done for it.
* freezing would FPI almost guaranteedly since we do it so much
  later.
* Not having to rescan the whole heap will be a bigger cost saving...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Thom Brown
On 23 May 2013 10:03, Thom Brown  wrote:
> On 23 May 2013 07:10, Heikki Linnakangas  wrote:
>> Hi,
>>
>> I've been hacking on a tool to allow resynchronizing an old master server
>> after failover. The need to do a full backup/restore has been a common
>> complaint ever since we've had streaming replication. I saw on the wiki that
>> this was discussed in the dev meeting; too bad I couldn't make it.
>>
>> In a nutshell, the idea is to do copy everything that has changed between
>> the cluster, like rsync does, but instead of reading through all files, use
>> the WAL to determine what has changed. Here's a somewhat more detailed
>> explanation, from the README:
>>
>> Theory of operation
>> ---
>>
>> The basic idea is to copy everything from the new cluster to old, except for
>> the blocks that we know to be the same.
>>
>> 1. Scan the WAL log of the old cluster, starting from the point where
>> the new cluster's timeline history forked off from the old cluster. For each
>> WAL record, make a note of the data blocks that are touched. This yields a
>> list of all the data blocks that were changed in the old cluster, after the
>> new cluster forked off.
>>
>> 2. Copy all those changed blocks from the new master to the old master.
>>
>> 3. Copy all other files like clog, conf files etc. from the new cluster
>> to old. Everything except the relation files.
>>
>> 4. Apply the WAL from the new master, starting from the checkpoint
>> created at failover. (pg_rewind doesn't actually apply the WAL, it just
>> creates a backup label file indicating that when PostgreSQL is started, it
>> will start replay from that checkpoint and apply all the required WAL)
>>
>>
>> Please take a look: https://github.com/vmware/pg_rewind
>
> 6 instances set up:
>
> [Primary (5530)]
> |
> ---[Standby 1 (5531)]
> |
> ---[Standby 2 (5532)]
> |
> ---[Standby 3 (5533)]
> |
> ---[Standby 4 (5534)]
> |
> ---[Standby 5 (5535)]
>
> 1) Created a table on the primary with some data.
> 2) Promoted Standby 1
> 3) Cleanly shut down Primary
> 4) pg_rewind --target-pgdata=/tmp/primary
> --source-server='host=localhost port=5531 dbname=postgres'
>
> Last common WAL position: 0/30227F8 on timeline 1
> Last common checkpoint at 0/30227F8 on timeline 1
> error reading xlog record: record with zero length at 0/3022860
> Done!
>
> Contents of pg_xlog directory in Primary and Standby 1:
> thom@swift /tmp $ ls -l primary/pg_xlog/
> total 49156
> -rw--- 1 thom users 16777216 May 23 09:52 00010002
> -rw--- 1 thom users 16777216 May 23 09:52 00010003
> -rw--- 1 thom users 16777216 May 23 09:52 00020003
> -rw--- 1 thom users   41 May 23 09:52 0002.history
> drwx-- 2 thom users   80 May 23 09:52 archive_status
> thom@swift /tmp $ ls -l standby1/pg_xlog/
> total 49156
> -rw--- 1 thom users 16777216 May 23 09:49 00010002
> -rw--- 1 thom users 16777216 May 23 09:50 00010003
> -rw--- 1 thom users 16777216 May 23 09:52 00020003
> -rw--- 1 thom users   41 May 23 09:50 0002.history
> drwx-- 2 thom users   80 May 23 09:50 archive_status
>
> 5) Changed recovery.done in primary to point its primary_conninfo port
> to 5531 (that of Standby 1).
> 6) Renamed it to .conf.
> 7) Changed postgresql.conf to set the port back to its original one
> (as pg_rewind has caused it to match that of Standby 1)
> 8) Start Primary
>
> Latest log in primary reads:
>
> LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 
> EDT
> LOG:  entering standby mode
> LOG:  invalid xl_info in checkpoint record
> FATAL:  could not locate required checkpoint record
> HINT:  If you are not restoring from a backup, try removing the file
> "/tmp/primary/backup_label".
> LOG:  startup process (PID 31503) exited with exit code 1
> LOG:  aborting startup due to startup process failure
>
> 9) Okay, so I'll delete that label and try again.  Now all is well:
>
> LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 
> EDT
> LOG:  entering standby mode
> LOG:  database system was not properly shut down; automatic recovery in 
> progress
> LOG:  redo starts at 0/3022828
> LOG:  record with zero length at 0/3041A60
> LOG:  consistent recovery state reached at 0/3041A60
> LOG:  database system is ready to accept read only connections
> LOG:  started streaming WAL from primary at 0/300 on timeline 2
>
> 10) Connect to Standby 1 and insert more rows into our original table.
> 11) Connect to Primary and those rows are appearing.

By the way, without any data inserted I get:

thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary
--source-server='host=localhost port=5531 dbname=postgres' -v
connected to remote server
fetched file "global/pg_control", length 8192
fetched file "pg_xlog/0002.hist

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Andres Freund
On 2013-05-24 10:57:22 -0400, Thom Brown wrote:
> By the way, without any data inserted I get:
> 
> thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary
> --source-server='host=localhost port=5531 dbname=postgres' -v
> connected to remote server
> fetched file "global/pg_control", length 8192
> fetched file "pg_xlog/0002.history", length 41
> Last common WAL position: 0/300 on timeline 1
> pg_rewind: xlogreader.c:214: XLogReadRecord: Assertion `((RecPtr) %
> 8192 >= (((intptr_t) ((sizeof(XLogPageHeaderData))) + ((8) - 1)) &
> ~((intptr_t) ((8) - 1' failed.
> Aborted (core dumped)
> 
> And this magically goes away with -k on initdb.

That looks like this is a bug independent of -k. pg_rewind tries to read
the beginning of a page but xlogreader doesn't allow that atm. The
likely reason this doesn't happen with -k is autovacuum probably hints
some pages which are logged when checksums are enabled.

Should we make that assert
Assert((RecPtr % XLOG_BLCKSZ == 0) ||XRecOffIsValid(RecPtr));
instead?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch to add support of "IF NOT EXISTS" to others "CREATE" statements

2013-05-24 Thread Fabrízio de Royes Mello
Hi all,

I working in a patch to include support of "IF NOT EXISTS" into "CREATE"
statements that not have it yet.

I started with "DefineStmt" section from "src/backend/parser/gram.y":
- CREATE AGGREGATE [ IF NOT EXISTS ] ...
- CREATE OPERATOR [ IF NOT EXISTS ] ...
- CREATE TYPE [ IF NOT EXISTS ] ... [AS [{ENUM | RANGE}] (...)]
- CREATE TEXT SEARCH {PARSER | DITIONARY | TEMPLATE | CONFIGURATION} [ IF
NOT EXISTS ] ...
- CREATE COLLATION [ IF NOT EXISTS ] ...

My intention is cover anothers CREATE statements too, not just the above.

If has no objection about this implementation I'll finish him and soon I
sent the patch.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Jim Nasby

On 5/23/13 12:51 PM, Pavan Deolasee wrote:




On Thu, May 23, 2013 at 11:10 PM, Heikki Linnakangas mailto:hlinnakan...@vmware.com>> wrote:

On 23.05.2013 07:55, Robert Haas wrote:

On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
mailto:hlinnakan...@vmware.com>>  wrote:

1. Scan the WAL log of the old cluster, starting from the point 
where
the new cluster's timeline history forked off from the old cluster. 
For each
WAL record, make a note of the data blocks that are touched. This 
yields a
list of all the data blocks that were changed in the old cluster, 
after the
new cluster forked off.


Suppose that a transaction is open and has written tuples at the point
where WAL forks.  After WAL forks, the transaction commits.  Then, it
hints some of the tuples that it wrote.  There is no record in WAL
that those blocks are changed, but failing to revert them leads to
data corruption.


Bummer, you're right. Hmm, if you have checksums enabled, however, we'll 
WAL log a full-page every time a page is dirtied for setting a hint bit, which 
fixes the problem. So, there's a caveat with pg_rewind; you must have checksums 
enabled.


I was quite impressed with the idea, but hint bits indeed are problem. I 
realised the same issue also applies to the other idea that Fujii-san and 
others have suggested about waiting for dirty buffers to be written until the 
WAL is received at the standby. But since that idea would anyways need to be 
implemented in the core, we could teach SetHintBits() to return false unless 
the corresponding commit WAL records are written to the standby first.


Would it be useful to turn this problem around? Heikki's proposal is based on being able 
to track (without fail) all blocks that have been modified; could we instead track blocks 
that we know for certain have NOT been modified? The difference there is that we can be 
more conservative in stating "we know this block is the same"; worst case we 
just do some extra copying.


One possibility would be to use file timestamps. For files that are past a 
certain age on both master and slave, if we force the timestamp on the slave to 
match the timestamp from the master, rsync will be able to safely ignore that 
file. I realize that's not as good as block-level detection, but it's probably 
a tremendous improvement over what we have today. The critical thing in this 
case would be to *guarantee* that the timestamps did not match on modified 
files.

Of course, screwing around with FS timestamps in this manner is pretty grotty, 
at least on a live system. Perhaps there's some way to track that info 
separately and then use it to change file timestamps before running rsync. Or 
if we are able to define a list of files that we think may have changed, we 
just feed that list to rsync and let it do the heavy lifting.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Alvaro Herrera
Andres Freund escribió:

> I wonder if we shouldn't make background workers use connections slots
> from max_connections similar to how superuser_reserved_connections
> work. That would mean we don't need to care about it for HS.

I remember considering this and concluding that it's messy.  Suppose we
decide that the registered bgworker number would be subtracted from
max_connections: if the configuration registers as many bgworkers as
max_connections, then no client connections can take place; if there are
more bgworkers than max_connections, there's going to be errors at
startup because the last few bgworkers cannot start at all (and no
client connections will be allowed).  So users would be forced to change
max_connections according to bgworkers configuration.  That doesn't
sound friendly.

If, instead of subtracting bgworkers from max_connections, we were to
add the number of bgworkers to max_connections, then we're no better
than currently, because the number of bgworkers from the standby would
be different from those in the master, and we'd be back into the problem
of how to ensure that the allowed number of locks meets the restriction.
If you wanted to have more bgworkers in the master than the standby,
you'd have to advise users to increase max_connections in the standby to
fulfill the restriction.

(I currently have no proposal on how to go about solving this problem.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
On Fri, May 24, 2013 at 10:01 AM, Amit Langote wrote:

> > Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
> > migration (and it is working fine). The streaming replication is for
> > hot-standby replication *once migrated*. Thing is I disbable archving and
> > set wal_level to minimal, when migrating the large portion of data, to
> make
> > it faster. Then I switch to wal_level=hot_standby, i.e the "production"
> > configuration, and the WAL segment seuqence seems to overlap with the
> > segments generated with the other setting.
> >
>
> Though, now you understand it's not what it looks like, right? :-)
>
>
> --
> Amit Langote
>

I didn't quite understand what you mean by that... But anyways so do you
people think this sequence number overlap is "normal" ?


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 10:53 AM, Andres Freund  wrote:
>> [all-visible cannot restore hint bits without FPI because of torn pages]
>
> I haven't yet thought about this sufficiently yet. I think we might have
> a chance of working around this, let me ponder a bit.

Yeah.  I too feel like there might be a solution.  But I don't know
have something specific in mind, yet anyway.

> But even if that means needing a full page write via the usual mechanism
> for all visible if any hint bits needed to be set we are still out far
> ahead of the current state imo.
> * cleanup would quite possibly do an FPI shortly after in vacuum
>   anyway. If we do it for all visible, it possibly does not need to be
>   done for it.
> * freezing would FPI almost guaranteedly since we do it so much
>   later.
> * Not having to rescan the whole heap will be a bigger cost saving...

The basic problem is that if the data is going to be removed before it
would have gotten frozen, then the extra FPIs are just overhead.  In
effect, we're just deciding to freeze a lot sooner.  And while that
might well be beneficial in some use cases (e.g. the data's already in
cache) it might also not be so beneficial (the table is larger than
cache and would have been dropped before freezing kicked in).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
> I didn't quite understand what you mean by that... But anyways so do you
> people think this sequence number overlap is "normal" ?

There is "no overlap" at all. The newer segments that you see are
"pre-allocated" ones. They have not been written to yet.

>From the "ls -l pg_xlog" output that you  sent, it can be seen that
segments starting from 0001000E00A8 through
00010010007E have been pre-allocated (at that point of
time) and 0001000E00A7 is currently being written to. Just
look at the modified times in your "ls -l" listing.
0001000E00A7 has May 22 15:32 (the latest writes seem to
have happened to this segment) whereas pre-allocated ones seem to have
around May 22 12:05 to 12:15 (which are yet to be written to).

Does that help?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-24 11:29:10 -0400, Robert Haas wrote:
> > But even if that means needing a full page write via the usual mechanism
> > for all visible if any hint bits needed to be set we are still out far
> > ahead of the current state imo.
> > * cleanup would quite possibly do an FPI shortly after in vacuum
> >   anyway. If we do it for all visible, it possibly does not need to be
> >   done for it.
> > * freezing would FPI almost guaranteedly since we do it so much
> >   later.
> > * Not having to rescan the whole heap will be a bigger cost saving...
> 
> The basic problem is that if the data is going to be removed before it
> would have gotten frozen, then the extra FPIs are just overhead.  In
> effect, we're just deciding to freeze a lot sooner.

Well, freezing without removing information for debugging.

> And while that
> might well be beneficial in some use cases (e.g. the data's already in
> cache) it might also not be so beneficial (the table is larger than
> cache and would have been dropped before freezing kicked in).

Not sure how caching comes into play here? At this point we know the
page to be in cache already since vacuum is looking at it anyway?

I think it's not really comparable since in those situations we a)
already do an XLogInsert(). b) already dirty the page. so the only
change is that we possibly write an additionall full page image. If
there is actually near future DML write activity that would make the
all-visible superflous that would have to FPI likely anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:29 AM, Robert Haas  wrote:
> On Fri, May 24, 2013 at 10:53 AM, Andres Freund  
> wrote:
>>> [all-visible cannot restore hint bits without FPI because of torn pages]
>>
>> I haven't yet thought about this sufficiently yet. I think we might have
>> a chance of working around this, let me ponder a bit.
>
> Yeah.  I too feel like there might be a solution.  But I don't know
> have something specific in mind, yet anyway.

One thought I had is that it might be beneficial to freeze when a page
ceases to be all-visible, rather than when it becomes all-visible.
Any operation that makes the page not-all-visible is going to emit an
FPI anyway, so we don't have to worry about torn pages in that case.
Under such a scheme, we'd have to enforce the rule that xmin and xmax
are ignored for any page that is all-visible; and when a page ceases
to be all-visible, we have to go back and really freeze the
pre-existing tuples.  I think we might be able to use the existing
all_visible_cleared/new_all_visible_cleared flags to trigger this
behavior, without adding anything new to WAL at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:52 AM, Andres Freund  wrote:
>> The basic problem is that if the data is going to be removed before it
>> would have gotten frozen, then the extra FPIs are just overhead.  In
>> effect, we're just deciding to freeze a lot sooner.
>
> Well, freezing without removing information for debugging.

Sure, but what I'm trying to avoid is incurring the WAL cost of
freezing.  If we didn't mind paying that sooner, we could just drop
vacuum_freeze_min/table_age.  But we do mind that.

>> And while that
>> might well be beneficial in some use cases (e.g. the data's already in
>> cache) it might also not be so beneficial (the table is larger than
>> cache and would have been dropped before freezing kicked in).
>
> Not sure how caching comes into play here? At this point we know the
> page to be in cache already since vacuum is looking at it anyway?

OK, true.

> I think it's not really comparable since in those situations we a)
> already do an XLogInsert(). b) already dirty the page. so the only
> change is that we possibly write an additionall full page image. If
> there is actually near future DML write activity that would make the
> all-visible superflous that would have to FPI likely anyway.

Well, if there's near-future write activity, then freezing is pretty
worthless anyway.  What I'm trying to avoid is adding WAL overhead in
the case where there *isnt* any near-future write activity, like
inserting 100MB of data into an existing table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:25 AM, Alvaro Herrera
 wrote:
> Andres Freund escribió:
>> I wonder if we shouldn't make background workers use connections slots
>> from max_connections similar to how superuser_reserved_connections
>> work. That would mean we don't need to care about it for HS.
>
> I remember considering this and concluding that it's messy.  Suppose we
> decide that the registered bgworker number would be subtracted from
> max_connections: if the configuration registers as many bgworkers as
> max_connections, then no client connections can take place; if there are
> more bgworkers than max_connections, there's going to be errors at
> startup because the last few bgworkers cannot start at all (and no
> client connections will be allowed).  So users would be forced to change
> max_connections according to bgworkers configuration.  That doesn't
> sound friendly.

I agree.  To put that more succinctly, if we take that approach, then
max_connections is no longer the maximum number of connections, which
is a POLA violation.

> (I currently have no proposal on how to go about solving this problem.)

If the problem were only with the size of the lock table, I'd be
somewhat inclined to propose ripping out max_locks_per_transaction and
putting in a GUC called max_locks instead.  The current system seems
more confusing than helpful; when the default proves insufficient, the
recourse is usually to figure out how many objects we need to lock,
and then divide by max_connections to figure out how to set
max_locks_per_transaction, which is really backing into the problem
from the wrong end.

But I don't know what to do about the problem of needing to know how
many backends there are.  I agree with Andres that it's not very
friendly to enforce a restriction that all the same worker processes
must be present on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Jim Nasby

On 5/24/13 9:53 AM, Andres Freund wrote:

We don't even necessarily need to log the hint bits for all items since
> >the redo for all_visible could make sure all items are hinted. The only
> >problem is knowing up to where we can truncate pg_clog...

>[all-visible cannot restore hint bits without FPI because of torn pages]

I haven't yet thought about this sufficiently yet. I think we might have
a chance of working around this, let me ponder a bit.

But even if that means needing a full page write via the usual mechanism
for all visible if any hint bits needed to be set we are still out far
ahead of the current state imo.
* cleanup would quite possibly do an FPI shortly after in vacuum
   anyway. If we do it for all visible, it possibly does not need to be
   done for it.
* freezing would FPI almost guaranteedly since we do it so much
   later.
* Not having to rescan the whole heap will be a bigger cost saving...


Would we only set all the hint bits within vacuum? If so I don't think the WAL 
hit matters at all, because vacuum is almost always a background, throttled 
process.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue & Wed.

2013-05-24 Thread Jim Nasby

I'm moving this to -advocacy, as it seems more appropriate there...

On 5/20/13 10:31 AM, Alfred Perlstein wrote:

Hello PostgreSQL Hackers,

I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping to chat 
with a few Postgresql developers in person about using Postgresql in FreeNAS 
and offering it as an extension to the platform as a plug-in technology.  
Unfortunately due to time constraints I can not attend the entire conference 
and I am only in town until Wednesday at noon.

I'm hoping there's a good time to talk to a few developers about Postgresql + 
FreeNAS before I have to depart back to the bay area.


Were you able to meetup with anyone? If not you should attend the SF Postgres 
Users Group meeting.


Some info on me:  My name is Alfred Perlstein, I am a FreeBSD developer and 
FreeNAS project lead.  I am the VP of Software Engineering at iXsystems.  I 
have been a fan of Postgresql for many years.  In the early 2000s we build a 
high speed web tracking application on top of Postgresql and worked closely 
with the community to shake out performance and bug, so closely that Tom Lane 
and Vadim Mikheevhad logins on our box.  Since that time I have tried to 
get Postgresql into as many places as possible.

Some info on the topics I wanted to briefly discuss:

1) Using Postgresql as the config store for FreeNAS.
We currently use SQLITE, SQLITE fits our needs until we get to the point of 
replication between HA (high availability) units.  Then we are forced to 
manually sync data between configurations.  A discussion on how we might do 
this better using Postgresql, while still maintaining our ease of config export 
(single file) and small footprint would be interesting.


Probably best to discuss in person at SFPUG... :)


2) Postgresql plugin for FreeNAS.
Flip a switch and suddenly your file server is also serving enterprise data.  
We currently have a plug-in architecture, but would like to discuss the 
possibility of a tighter integration so that Postgresql looks like a more 
cohesive addition to FreeNAS.


Doesn't sound like that should be that terribly hard; you'd probably want to 
tweak the stock postgresql.conf for improved performance.


3) Statistic monitoring / EagleEye
In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye is a system 
where all mibs are easily exportable with timestamps in a common format (for now 
YAML & modified CSV) which is then consumed by a utility which can then provide 
graphs. The entire point of EagleEye is to eventually upstream the modifications to 
future proof statistics tracking into the FreeBSD and FreeNAS systems.  I have 
spoken with some Illuminos/ZFS developers and they are interested as well.


Are you thinking of storing that data in Postgres? You might find 
http://labs.omniti.com/labs/reconnoiter interesting.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Andres Freund
Hi,

while thinking about vacuum freeze I noticed that since the checksums
patch visibilitymap_set() does:
/*
 * If data checksums are enabled, we need to protect the heap
 * page from being torn.
 */
if (DataChecksumsEnabled())
{
Page heapPage = BufferGetPage(heapBuf);

/* caller is expected to set PD_ALL_VISIBLE first */
Assert(PageIsAllVisible(heapPage));
PageSetLSN(heapPage, recptr);
}

That pattern looks dangerous. Setting the lsn of the heap page will
prevent the next action from doing a FPI even if it would be required.

Its e.g. called like this from lazy_scan_heap:

if (all_visible && !all_visible_according_to_vm)
{
/*
 * It should never be the case that the visibility map page is 
set
 * while the page-level bit is clear, but the reverse is allowed
 * (if checksums are not enabled).  Regardless, set the both 
bits
 * so that we get back in sync.
 *
 * NB: If the heap page is all-visible but the VM bit is not 
set,
 * we don't need to dirty the heap page.  However, if checksums 
are
 * enabled, we do need to make sure that the heap page is 
dirtied
 * before passing it to visibilitymap_set(), because it may be
 * logged.  Given that this situation should only happen in rare
 * cases after a crash, it is not worth optimizing.
 */
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, 
visibility_cutoff_xid);
}

other callers look similarly dangerous.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
Thanks Amit, I understand now. Is there a way to know/predict how many
prealocated segments will there be in a certain moment? What does it
deppend on?


On Fri, May 24, 2013 at 12:46 PM, Amit Langote wrote:

> > I didn't quite understand what you mean by that... But anyways so do you
> > people think this sequence number overlap is "normal" ?
>
> There is "no overlap" at all. The newer segments that you see are
> "pre-allocated" ones. They have not been written to yet.
>
> From the "ls -l pg_xlog" output that you  sent, it can be seen that
> segments starting from 0001000E00A8 through
> 00010010007E have been pre-allocated (at that point of
> time) and 0001000E00A7 is currently being written to. Just
> look at the modified times in your "ls -l" listing.
> 0001000E00A7 has May 22 15:32 (the latest writes seem to
> have happened to this segment) whereas pre-allocated ones seem to have
> around May 22 12:05 to 12:15 (which are yet to be written to).
>
> Does that help?
>
> --
> Amit Langote
>


Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Simon Riggs
On 24 May 2013 18:40, Andres Freund  wrote:

> That pattern looks dangerous. Setting the lsn of the heap page will
> prevent the next action from doing a FPI even if it would be required.

Can you be more specific about the danger you see?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread Jim Nasby

On 5/13/13 9:28 AM, Noah Misch wrote:

It would be great if one client session could take advantage of multiple CPU
cores.  EnterpriseDB wishes to start the trek into this problem space for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort.  This
touches on a notable subset of the infrastructure components we'll need for
parallel general query.  My intent is to map out the key design topics, hear
about critical topics I hadn't considered, and solicit feedback on the quality
of the high-level plan.  Full designs for key pieces will come later.


Have you considered GPU-based sorting? I know there's been discussion in the 
past.

To me, the biggest advantage of GPU sorting is that most of the concerns you've 
laid out go away; a backend that needs to sort just throws data at the GPU to 
do the actual sorting; all the MVCC issues and what not remain within the scope 
of a single backend.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
On Sat, May 25, 2013 at 3:08 AM, German Becker  wrote:
> Thanks Amit, I understand now. Is there a way to know/predict how many
> prealocated segments will there be in a certain moment? What does it deppend
> on?

Upthread, Fujii Masao-san suggested what might have happened that
caused these pre-allocated segments to be created. To quote him:

"WAL recycling is performed by checkpoint. Checkpoint always checks
whether there are WAL files no longer  required for crash recovery,
IOW, WAL files which were generated before the prior checkpoint
happened, and then if they are found, checkpoint tries to recycle
them."

Reading here would also help:
http://www.postgresql.org/docs/9.1/static/wal-configuration.html

If you are still using the same values as during this observation,
could you provide values for these postgresql.conf parameters:
checkpoint_segments, checkpoint_timeout, wal_keep_segments?


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incomplete description of pg_start_backup?

2013-05-24 Thread Dmitry Koterov
I don't get still.

Suppose we have a data file with blocks with important (non-empty) data:

A B C D

1. I call pg_start_backup().
2. Tar starts to copy A block to the destination archive...
3. During this copying, somebody removes data from a table which is
situated in B block. So this data is a subject for vacuuming, and the block
is marked as a free space.
4. Somebody writes data to a table, and this data is placed to a free space
- to B block. This is also added to the WAL log (so the data is stored at 2
places: at B block and at WAL).
5. Tar (at last!) finishes copying of A block and begins to copy B block.
6. It finishes, then it copies C and D to the archive too.
7. Then we call pg_stop_backup() and also archive collected WAL (which
contains the new data of B block as we saw above).

The question is - *where is the OLD data of B block in this scheme?* Seems
it is NOT in the backup! So it cannot be restored. (And, in case when we
never overwrite blocks between pg_start_backup...pg_stop_backup, but always
append the new data, it is not a problem.) Seems to me this is not
documented at all! That is what my initial e-mail about.

(I have one hypothesis on that, but I am not sure. Here is it: does vacuum
saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes,
it is, of course, a part of the backup. But it wastes space a lot...)




On Tue, May 14, 2013 at 6:05 PM, Jeff Janes  wrote:

> On Mon, May 13, 2013 at 4:31 PM, Dmitry Koterov  wrote:
>
>> Could you please provide a bit more detailed explanation on how it works?
>>
>> And how could postgres write at the middle of archiving files during an
>> active pg_start_backup? if it could, here might be a case when a part of
>> archived data file contains an overridden information "from the future",
>>
>
> The data files cannot contain information from the future.  If the backup
> is restored, it must be restored to the time of pg_stop_backup (at least),
> which means the data would at that point be from the past/present, not the
> future.
>
> Cheers,
>
> Jeff
>


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 10:36 AM, Jim Nasby wrote:

Instead of KB/s, could we look at how much time one process is spending
waiting on IO vs the rest of the cluster? Is it reasonable for us to
measure IO wait time for every request, at least on the most popular OSes?


It's not just an OS specific issue.  The overhead of collecting timing 
data varies massively based on your hardware, which is why there's the 
pg_test_timing tool now to help quantify that.


I have a design I'm working on that exposes the system load to the 
database usefully.  That's what I think people really want if the goal 
is to be adaptive based on what else is going on.  My idea is to use 
what "uptime" collects as a starting useful set of numbers to quantify 
what's going on.  If you have both a short term load measurement and a 
longer term one like uptime provides, you can quantify both the overall 
load and whether it's rising or falling.  I want to swipe some ideas on 
how moving averages are used to determine trend in stock trading 
systems: 
http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html


Dynamic load-sensitive statement limits and autovacuum are completely 
feasible on UNIX-like systems.  The work to insert a cost delay point 
needs to get done before building more complicated logic on top of it 
though, so I'm starting with this part.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
On Fri, May 24, 2013 at 3:24 PM, Amit Langote wrote:

> On Sat, May 25, 2013 at 3:08 AM, German Becker 
> wrote:
> > Thanks Amit, I understand now. Is there a way to know/predict how many
> > prealocated segments will there be in a certain moment? What does it
> deppend
> > on?
>
> Upthread, Fujii Masao-san suggested what might have happened that
> caused these pre-allocated segments to be created. To quote him:
>
> "WAL recycling is performed by checkpoint. Checkpoint always checks
> whether there are WAL files no longer  required for crash recovery,
> IOW, WAL files which were generated before the prior checkpoint
> happened, and then if they are found, checkpoint tries to recycle
> them."
>
> Reading here would also help:
> http://www.postgresql.org/docs/9.1/static/wal-configuration.html
>
> If you are still using the same values as during this observation,
> could you provide values for these postgresql.conf parameters:
> checkpoint_segments, checkpoint_timeout, wal_keep_segments?
>
>
> --
> Amit Langote
>

Amit,

Frist, thanks for your help and your interest. I (think) I understand how
checkpoint /wal segment work. What I didn't understand from the
documentation is the possibility of segments being prealocated. I thought
that the WAL segment with the higher sequence number is the one being
written at present time, as opposed to a segment allocated to be written in
the future. If you could clarify this point to me, that would be great.

Here are the parameters related to checkpoint

For "restore" config
checkpoint_segments = 256   # in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 -
1.0
#checkpoint_warning = 30s

#wal_keep_segments = 0  # in logfile segments, 16MB each; 0 disables




For "production" config:
checkpoint_segments = 256   # in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
checkpoint_completion_target = 0.7  # checkpoint target duration, 0.0 -
1.0
wal_keep_segments = 256 # in logfile segments, 16MB each; 0 disables

Thanks again,

Germán


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 9:21 AM, Robert Haas wrote:


But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit =  and read_rate_limit = .


I already made and lost the argument for doing vacuum in KB/s units, so 
I wasn't planning on putting that in the way of this one.  I still think 
it's possible to switch to real world units and simplify all of those 
parameters.  Maybe I'll get the energy to fight this battle again for 
9.4.  I do have a lot more tuning data from production deployments to 
use as evidence now.


I don't think the UI end changes the bulk of the implementation work 
though.  The time consuming part of this development is inserting all of 
the cost delay hooks and validating they work.  Exactly what parameters 
and logic fires when they are called can easily be refactored later.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread james
> Have you considered GPU-based sorting? I know there's been discussion 
in the past.


If you use OpenCL, then you can use a CPU driver if there is no GPU, and 
that can allow you to leverage all the CPU cores without having to do 
the multi-thread stuff in the backend.


While the compilation of a specific kernel can be quite expensive, it 
also has the effect of a JIT compiler in terms of system independence.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
Hi,
I'm wondering if it would be OK to change the procedure code before
execution. I'm thinking about adding magically an import at the beginning
of a function.

Currently numeric arguments passed to the procedure are converted into
floats. This is not good, as it causes loss of information.

The proposed solution in code comment is "maybe use a string?".

I'm thinking about something else. We could convert it into Decimal (
http://docs.python.org/2/library/decimal.html) class in Python.
Unfortunately this class requires import like `from decimal import Decimal`
from a standard Python library.

I'm wondering if it would be a good idea to do it like this. It shouldn't
fail even with the trusted version of pl/python, as I'd rather see the
trusted version to allow importing packages from standard library.

regards,
Szymon


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 10:36 AM, Jim Nasby  wrote:
> Doesn't that hit the old issue of not knowing if a read came from FS cache
> or disk? I realize that the current cost_delay mechanism suffers from that
> too, but since the API is lower level that restriction is much more
> apparent.

Sure, but I think it's still useful despite that limitation.

> Instead of KB/s, could we look at how much time one process is spending
> waiting on IO vs the rest of the cluster? Is it reasonable for us to measure
> IO wait time for every request, at least on the most popular OSes?

I doubt that's going to be very meaningful.  The backend that dirties
the buffer is fairly likely to be different from the backend that
writes it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Claudio Freire
On Fri, May 24, 2013 at 4:10 PM, Szymon Guz  wrote:
>
> I'm thinking about something else. We could convert it into Decimal
> (http://docs.python.org/2/library/decimal.html) class in Python.
> Unfortunately this class requires import like `from decimal import Decimal`
> from a standard Python library.
>
> I'm wondering if it would be a good idea to do it like this. It shouldn't
> fail even with the trusted version of pl/python, as I'd rather see the
> trusted version to allow importing packages from standard library.


Why would passing a decimal require an import?

The extension itself needs a reference to Decimal, to build them, but
the procedure's context doesn't need to have it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] shmem startup and shutdown hooks

2013-05-24 Thread Alvaro Herrera
Hi,

While going through Andres' BDR code I noticed that it has a shmem
startup hook, which installs a on_shmem_exit() callback to write stuff
at shutdown time.  This works fine but seems somewhat of a hazard: it is
having postmaster do the actual write, which has to access shared memory
while doing it.  Moreover, since the callback is installed to be called
at shmem reset, it means that a process crash causes this code to be
called, which seems similarly broken.

It seems that the right place to do this is checkpointer shutdown, i.e.
when checkpointer is told to close shop it should also invoke various
modules' shutdown callbacks.  There's no hook point there though, so
we'd need to introduce something new for this specific purpose.

We already have pg_stat_statements doing this, viz. pgss_shmem_startup():

/*
 * If we're in the postmaster (or a standalone backend...), set up a 
shmem
 * exit hook to dump the statistics to disk.
 */
if (!IsUnderPostmaster)
on_shmem_exit(pgss_shmem_shutdown, (Datum) 0);

Also, it'd be good to discuss what should happen when another process
crashes.  Should we attempt to write the data anyway, or should we reset
it just like we do for pgstats?

Thoughts?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
On 24 May 2013 21:14, Claudio Freire  wrote:

> On Fri, May 24, 2013 at 4:10 PM, Szymon Guz  wrote:
> >
> > I'm thinking about something else. We could convert it into Decimal
> > (http://docs.python.org/2/library/decimal.html) class in Python.
> > Unfortunately this class requires import like `from decimal import
> Decimal`
> > from a standard Python library.
> >
> > I'm wondering if it would be a good idea to do it like this. It shouldn't
> > fail even with the trusted version of pl/python, as I'd rather see the
> > trusted version to allow importing packages from standard library.
>
>
> Why would passing a decimal require an import?
>
> The extension itself needs a reference to Decimal, to build them, but
> the procedure's context doesn't need to have it.
>

Hm... maybe you're right. I think I don't understand fully how the
procedures are executed, and I need to read more to get it.


thanks,
Szymon


Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Andres Freund
On 2013-05-24 19:09:57 +0100, Simon Riggs wrote:
> On 24 May 2013 18:40, Andres Freund  wrote:
> 
> > That pattern looks dangerous. Setting the lsn of the heap page will
> > prevent the next action from doing a FPI even if it would be required.
> 
> Can you be more specific about the danger you see?

CHECKPOINT at lsn 0/10;
vacuum starts
vacuum finds page which is all visible
vacuum sets all_visible
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, visibility_cutoff_xid);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
  cutoff_xid);
if (DataChecksumsEnabled())
PageSetLSN(heapPage, recptr);

So at this point the *heap* page will have the lsn of the
xl_heap_visible record. Which I thought to be rather dangerous because I
somewow missed the fact that log_heap_visible does:
if (DataChecksumsEnabled())
{
rdata[1].next = &(rdata[2]);

rdata[2].data = NULL;
rdata[2].len = 0;
rdata[2].buffer = heap_buffer;
rdata[2].buffer_std = true;
rdata[2].next = NULL;
}

So. Forget what I said, I just was confused.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Claudio Freire
On Fri, May 24, 2013 at 4:22 PM, Szymon Guz  wrote:
> Hm... maybe you're right. I think I don't understand fully how the
> procedures are executed, and I need to read more to get it.


Well, it's easy.

Instead of PLyFloat_FromNumeric[0], you can make a
PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]:

PyObject *decimal = PyImport_Import("decimal");
PyObject *decimal_dict = PyModule_GetDict(decimal);
PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, "Decimal");

And invoke it with a string rep of your Numeric:

PyObject *value = PyObject_CallFunction(decimal_ctor, "S", string_value);

Add of course all kinds of error checking and reference count boiler
plate, and you'd have a very dumb version of it.

To make it more "pro", you'd want to do all that stuff to get
decimal_ctor only at initialization time. Especially since you don't
want to fumble with the import lock right there in _FromNumeric.

And to make it totally "pro", you can even freeze Decimal (using
pyfreeze) if you'd like. I would only do this in contexts where you
don't have a stdlib of course. Not sure whether windows falls into
that category. Linux doesn't.


[0] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518
[1] http://docs.python.org/2/c-api/import.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Josh Berkus
Andres,

If I understand your solution correctly, though, this doesn't really
help the pathological case for freezing, which is the time-oriented
append-only table.  For data which isn't being used, allvisible won't be
set either because it won't have been read, no?  Is it still cheaper to
set allvisible than vacuum freeze even in that case?

Don't get me wrong, I'm in favor of this if it fixes the other (more
common) cases.  I just want to be clear on the limitations.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
On 24 May 2013 21:46, Claudio Freire  wrote:

> On Fri, May 24, 2013 at 4:22 PM, Szymon Guz  wrote:
> > Hm... maybe you're right. I think I don't understand fully how the
> > procedures are executed, and I need to read more to get it.
>
>
> Well, it's easy.
>
> Instead of PLyFloat_FromNumeric[0], you can make a
> PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]:
>
> PyObject *decimal = PyImport_Import("decimal");
> PyObject *decimal_dict = PyModule_GetDict(decimal);
> PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, "Decimal");
>
> And invoke it with a string rep of your Numeric:
>
> PyObject *value = PyObject_CallFunction(decimal_ctor, "S", string_value);
>
> Add of course all kinds of error checking and reference count boiler
> plate, and you'd have a very dumb version of it.
>
> To make it more "pro", you'd want to do all that stuff to get
> decimal_ctor only at initialization time. Especially since you don't
> want to fumble with the import lock right there in _FromNumeric.
>
> And to make it totally "pro", you can even freeze Decimal (using
> pyfreeze) if you'd like. I would only do this in contexts where you
> don't have a stdlib of course. Not sure whether windows falls into
> that category. Linux doesn't.
>
>
> [0]
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518
> [1] http://docs.python.org/2/c-api/import.html
>


Thanks, I will take a look at this, looks pretty easy. However testing on
Windows will be pretty funny :)

thanks,
Szymon


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-24 15:49:31 -0400, Josh Berkus wrote:
> If I understand your solution correctly, though, this doesn't really
> help the pathological case for freezing, which is the time-oriented
> append-only table.  For data which isn't being used, allvisible won't be
> set either because it won't have been read, no?  Is it still cheaper to
> set allvisible than vacuum freeze even in that case?

all visible is only set in vacuum and it determines which parts of a
table will be scanned in a non full table vacuum. So, since we won't
regularly start vacuum in the insert only case there will still be a
batch of work at once. But nearly all of that work is *already*
performed. We would just what the details of that around for a
bit. *But* since we now would only need to vacuum the non all-visible
part that would get noticeably cheaper as well.

I think for that case we should run vacuum more regularly for insert
only tables since we currently don't do regularly enough which a) increases
the amount of work needed at once and b) prevents index only scans from
working there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread Kohei KaiGai
Let me introduce one thing we discussed in the developer meeting at
Ottawa. We got a consensus that pluggable exec-node may be useful to
replace a part of exec-node tree with an alternative one being
implemented by extensions; which will allow to run something like
"GpuSort" instead of existing Sort.

http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting#Pluggable_plan.2Fexec_nodes

2013/5/24 james :
>> Have you considered GPU-based sorting? I know there's been discussion in
>> the past.
>
> If you use OpenCL, then you can use a CPU driver if there is no GPU, and
> that can allow you to leverage all the CPU cores without having to do the
> multi-thread stuff in the backend.
>
> While the compilation of a specific kernel can be quite expensive, it also
> has the effect of a JIT compiler in terms of system independence.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Simon Riggs
On 24 May 2013 20:26, Andres Freund  wrote:
> On 2013-05-24 19:09:57 +0100, Simon Riggs wrote:
>> On 24 May 2013 18:40, Andres Freund  wrote:
>>
>> > That pattern looks dangerous. Setting the lsn of the heap page will
>> > prevent the next action from doing a FPI even if it would be required.
>>
>> Can you be more specific about the danger you see?
>
> CHECKPOINT at lsn 0/10;
> vacuum starts
> vacuum finds page which is all visible
> vacuum sets all_visible
> PageSetAllVisible(page);
> MarkBufferDirty(buf);
> visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
>   vmbuffer, visibility_cutoff_xid);
> recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
>   cutoff_xid);
> if (DataChecksumsEnabled())
> PageSetLSN(heapPage, recptr);
>
> So at this point the *heap* page will have the lsn of the
> xl_heap_visible record. Which I thought to be rather dangerous because I
> somewow missed the fact that log_heap_visible does:
> if (DataChecksumsEnabled())
> {
> rdata[1].next = &(rdata[2]);
>
> rdata[2].data = NULL;
> rdata[2].len = 0;
> rdata[2].buffer = heap_buffer;
> rdata[2].buffer_std = true;
> rdata[2].next = NULL;
> }
>
> So. Forget what I said, I just was confused.

I think its perfectly understandable. Robert, Jeff and I discussed
that for a while before we passed it. I'm still not happy with it, and
think its a pretty confusing section of code with multiple paths
through it, but I just can't see a better way.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incomplete description of pg_start_backup?

2013-05-24 Thread Heikki Linnakangas

On 24.05.2013 14:33, Dmitry Koterov wrote:

I don't get still.

Suppose we have a data file with blocks with important (non-empty) data:

A B C D

1. I call pg_start_backup().
2. Tar starts to copy A block to the destination archive...
3. During this copying, somebody removes data from a table which is
situated in B block. So this data is a subject for vacuuming, and the block
is marked as a free space.
4. Somebody writes data to a table, and this data is placed to a free space
- to B block. This is also added to the WAL log (so the data is stored at 2
places: at B block and at WAL).
5. Tar (at last!) finishes copying of A block and begins to copy B block.
6. It finishes, then it copies C and D to the archive too.
7. Then we call pg_stop_backup() and also archive collected WAL (which
contains the new data of B block as we saw above).

The question is - *where is the OLD data of B block in this scheme?* Seems
it is NOT in the backup!


Correct.


So it cannot be restored.


Right, the old data on block B is gone. The backup is fine, the old data 
on block B is is not needed to recover the backup.



(And, in case when we never overwrite blocks between
pg_start_backup...pg_stop_backup, but always append the new data, it
is not a problem.) Seems to me this is not documented at all! That is
what my initial e-mail about.


When you restore the backup, the database is restored to the state it 
was when pg_stop_backup() was called. What did you expect?



(I have one hypothesis on that, but I am not sure. Here is it: does vacuum
saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes,
it is, of course, a part of the backup. But it wastes space a lot...)


Nope, it doesn't do that.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Hannu Krosing
On 05/24/2013 07:00 PM, Robert Haas wrote:
> On Fri, May 24, 2013 at 11:29 AM, Robert Haas  wrote:
>> On Fri, May 24, 2013 at 10:53 AM, Andres Freund  
>> wrote:
 [all-visible cannot restore hint bits without FPI because of torn pages]
>>> I haven't yet thought about this sufficiently yet. I think we might have
>>> a chance of working around this, let me ponder a bit.
>> Yeah.  I too feel like there might be a solution.  But I don't know
>> have something specific in mind, yet anyway.
> One thought I had is that it might be beneficial to freeze when a page
> ceases to be all-visible, rather than when it becomes all-visible.
That what I aimed to describe in my mail earlier, but your
description is much clearer :)
> Any operation that makes the page not-all-visible is going to emit an
> FPI anyway, so we don't have to worry about torn pages in that case.
> Under such a scheme, we'd have to enforce the rule that xmin and xmax
> are ignored for any page that is all-visible; 
Agreed. We already relay on all-visible pages enough that we
can trust it to be correct. Making that universal rule should not
add any risks .
The rule "page all-visible ==> assume all tuples frozen" would
also enable VACUUM FREEZE to only work only on the
non-all-visible pages .
> and when a page ceases
> to be all-visible, we have to go back and really freeze the
> pre-existing tuples.  
We can do this unconditionally, or in milder case use vacuum_freeze_min_age
if we want to retain xids for forensic purposes.
> I think we might be able to use the existing
> all_visible_cleared/new_all_visible_cleared flags to trigger this
> behavior, without adding anything new to WAL at all.
This seems to be easiest

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Amit kapila
On Friday, May 24, 2013 8:22 PM Jim Nasby wrote:
On 5/14/13 8:42 AM, Amit Kapila wrote:
>> In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && 
>> refcount = 0) buffer’s to end of freelist. I have implemented a new API 
>> StrategyMoveBufferToFreeListEnd() to
>>
>> move buffer’s to end of freelist.
>>

> Instead of a separate function, would it be better to add an argument to 
> StrategyFreeBuffer? 

  Yes, it could be done with a parameter which will decide whether to put 
buffer at head or tail in freelist.
  However currently the main focus is to check in which cases this optimization 
can give benefit.
  Robert had ran tests for quite a number of cases where it doesn't show any 
significant gain.
  I am also trying with various configurations to see if it gives any benefit.
  Robert has given some suggestions to change the way currently new function is 
getting called, 
  I will try it and update the results of same.

  I am not very sure that default pgbench is a good test scenario to test this 
optimization.
  If you have any suggestions for tests where it can show benefit, that would 
be a great input.

> ISTM this is similar to the other strategy stuff in the buffer manager, so 
> perhaps it should mirror that...

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers