from:"Jesper Krogh"

Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-03-20 Thread Jesper Krogh


On 15/03/14 20:27, Heikki Linnakangas wrote:
That said, I didn't expect the difference to be quite that big when 
you're appending to the end of the table. When the new entries go to 
the end of the posting lists, you only need to recompress and WAL-log 
the last posting list, which is max 256 bytes long. But I guess that's 
still a lot more WAL than in the old format.


That could be optimized, but I figured we can live with it, thanks to 
the fastupdate feature. Fastupdate allows amortizing that cost over 
several insertions. But of course, you explicitly disabled that...


In a concurrent update environment, fastupdate as it is in 9.2 is not 
really useful. It may be that you can bulk up insertion, but you have no 
control over who ends up paying the debt. Doubling the amount of wal 
from gin-indexing would be pretty tough for us, in 9.2 we generate 
roughly 1TB wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to 
gin-index updates as new data is added and needs to be searchable by 
users. We do run gzip that cuts it down to 25-30% before keeping the for 
too long, but doubling this is going to be a migration challenge.


If fast-update could be made to work in an environment where we both 
have users searching the index and manually updating it and 4+ backend 
processes updating the index concurrently then it would be a good 
benefit to gain.


the gin index currently contains 70+ million records with and average 
tsvector of 124 terms.


--
Jesper .. trying to add some real-world info.




- Heikki






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] issue with gininsert under very high load

2014-02-13 Thread Jesper Krogh


On 14/02/14 00:49, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2014-02-13 16:15:42 -0500, Tom Lane wrote:

Something like the attached?  Can somebody who's seen this problem confirm
this improves matters?

Hm. Won't that possiby lead to the fast tuple list growing unboundedly?
I think we would need to at least need to stop using the fast tuple
mechanism during gininsert() if it's already too big and do plain
inserts.

No, because we've already got a process working on cleaning it out.

In any case, this needs some testing to see if it's an improvement
or not.


Having some real-world experience with the fastupdate mechanism. Under 
concurrent load
it behaves really bad. Random processes waiting for cleanup (or 
competing with cleanup) is
going to see latency-spikes, because they magically hit that corner, 
thus reverting to plain
inserts if it cannot add to the pending list, will not remove this 
problem, but will

make it only hit the process actually doing the cleanup.

The build in mechanism, that cleanup is i cost paid by the process who 
happened to
fill the pendinglist, is really hard to deal with in production. More 
control is appreciated,
perhaps even an explicit flush-mechanism..I'd like to batch up 
inserts during one transaction only

and flush on commit.

--
Jesper - with fastupdate turned off due to above issues.


--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-03 Thread Jesper Krogh


On 03/02/14 02:44, Tomas Vondra wrote:

(2) The question is whether the new patch works fine on rare words. See
 this for comparison of the patches against HEAD:

   http://www.fuzzy.cz/tmp/gin/3-rare-words.png
   http://www.fuzzy.cz/tmp/gin/3-rare-words-new.png

 and this is the comparison of the two patches:

   http://www.fuzzy.cz/tmp/gin/patches-rare-words.png

 That seems fine to me - some queries are slower, but we're talking
 about queries taking 1 or 2 ms, so the measurement error is probably
 the main cause of the differences.

(3) With higher numbers of frequent words, the differences (vs. HEAD or
 the previous patch) are not that dramatic as in (1) - the new patch
 is consistently by ~20% faster.

Just thinking, this is about one algorithm is being better or frequent words
and another algorithm being better at rare words... we do have
this information (at least or tsvector) in the statistics, would
it be possible to just call the consistent function more often if the
statistics gives signs that it actually is a frequent word?

Jesper - heavily dependent on tsvector-searches, with both frequent and 
rare words.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-13 Thread Jesper Krogh


On 11/10/13 19:06, Andres Freund wrote:

On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:

I think it will be difficult to prove by using any compression
algorithm, that it compresses in most of the scenario's.
In many cases it can so happen that the WAL will also not be reduced
and tps can also come down if the data is non-compressible, because
any compression algorithm will have to try to compress the data and it
will burn some cpu for that, which inturn will reduce tps.

Then those concepts maybe aren't such a good idea after all. Storing
lots of compressible data in an uncompressed fashion isn't an all that
common usecase. I most certainly don't want postgres to optimize for
blank padded data, especially if it can hurt other scenarios. Just not
enough benefit.
That said, I actually have relatively high hopes for compressing full
page writes. There often enough is lot of repetitiveness between rows on
the same page that it should be useful outside of such strange
scenarios. But maybe pglz is just not a good fit for this, it really
isn't a very good algorithm in this day and aage.


Hm,. There is a clear benefit for compressible data and clearly
no benefit from incompressible data..

how about letting autovacuum taste the compressibillity of
pages on per relation/index basis and set a flag that triggers
this functionality where it provides a benefit?

not hugely more magical than figuring out wether the data ends up
in the heap or in a toast table as it is now.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.4 CF] Free VMs for Reviewers Testers

2013-07-09 Thread Jesper Krogh

 
 The really, really big ones are useful even for pushing limits, such
 as cr1.8xlarge, with 32 CPUs and 244GiB memory.  Current spot instance
 price (the heavily discounted can die at any time one) is $0.343/hr.
 Otherwise, it's 3.500/hr.
 

Just to keep in mind cpus are similar throttled:

One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 
Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 
1.7 GHz Xeon processor referenced in our original documentation.

Who knows what that does to memory bandwidth / context switches  etc.

Jesper

Re: [HACKERS] Regarding GIN Fast Update Technique

2013-06-07 Thread Jesper Krogh


On 07/06/13 16:39, Tom Lane wrote:

Amit Langote amitlangot...@gmail.com writes:

Okay, aside from that case, what else would move those to the main
structure? They (the entries in the unsorted pending list) are in the
local memory (work_mem?) of the backend, right?

No.  If they were, it wouldn't be crash-safe.

Thats how it is, but if we could push in wishes, then I would
wish that is woulndn't be crash-safe, and be flushed by the backends
commit. The way it currently operates is that a random backend
pays the penalty of other backends pushes to the pending-list and all 
queries

pays the penalty of searching the pendinglist in queries.

If the pending list were backend only it would not have to be searched
by all queries since commit needs to flush it and random backends
wouldn't be penalized. Allthough we'd still have the benefit of batching up
gin-inserts over mulitiple changes to the index done within the same
transaction.

Jesper

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Jesper Krogh


On 12/11/12 05:55, Greg Smith wrote:
The only guarantee I see that we can give for online upgrades is that 
after a VACUUM CHECKSUM sweep is done, and every page is known to both 
have a valid checksum on it and have its checksum bits set, *then* any 
page that doesn't have both set bits and a matching checksum is 
garbage.  Until reaching that point, any old data is suspect.  The 
idea of operating in an we'll convert on write but never convert old 
pages can't come up with any useful guarantees about data integrity 
that I can see.  As you say, you don't ever gain the ability to tell 
pages that were checksummed but have since been corrupted from ones 
that were corrupt all along in that path.


You're right about that, but I'd just like some rough guard against 
hardware/OS related data corruption.
and that is more likely to hit data-blocks constantly flying in and out 
of the system.
I'm currently running a +2TB database and the capabillity to just see 
some kind of corruption earlier
rather than later is a major benefit by itself. Currently corruption can 
go undetected if it just

happens to hit data-only parts of the database.

But I totally agree that the scheme described with integrating it into a 
autovacuum process would

be very close to ideal, even on a database as the one I'l running.

--
Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-08 Thread Jesper Krogh


On 09/11/12 02:01, Jeff Davis wrote:

As I understand it, the main part of the remaining work to be done for
the checksums patch (at least the first commit) is to have a better way
to enable/disable them.

For the sake of simplicity (implementation as well as usability), it
seems like there is agreement that checksums should be enabled or
disabled for the entire instance, not per-table.

I can definately see that simplicity is an argument here, but
I can easily imagine that some performance hungry users
would prefer to be able to disable the functionality on a
per table level. UNCHECKSUMMED TABLES (similar to UNLOGGED TABLES).

I would definately stuff our system in state = 2 in your
description if it was available.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-25 Thread Jesper Krogh


 Naturally, there are other compression and delta encoding schemes.  Does
 anyone feel the need to explore further alternatives?
 
 We might eventually find the need for multiple, user-selectable, WAL
 compression strategies.  I don't recommend taking that step yet.
 

my currently implemented compression strategy is to run the wal block through 
gzip in the archive command. compresses pretty nicely and achieved 50%+ in my 
workload (generally closer to 70)

on a multi core system it will take more cpu time but on a different core and 
not have any effect on tps. 

General compression should probably only be applied if it have positive gain on 
tps you could.

Jesper




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2012-08-09 Thread Jesper Krogh


On 08/08/12 21:34, Robert Haas wrote:

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

Another (not necessarily better) idea is to use a buffer that's part
of the index, like the GIN fastupdate stuff, so that there's no
particular constraint on when the buffer has to be flushed, but
competing index scans may be slower until it is.

If it is an implementation artifact or an result of this
approach I dont know. But currently, when the GIN fastupdate
code finally decides to flush the buffer, it is going to stall all
other processes doing updates while doing it. If you only have
one update process then this doesn't matter. But if you're trying to get
user-interactive-updates to flow in with batch-updates from
background processes, then you'd better kill off this feature,
since you're gauranteed that the user-interactive process is
either going to flush the buffer or wait on someone else doing
it.

I havent done the benchmarking, but I'm actually fairly sure that
fastupdate isn't overall faster if you bump concurrency slightly and run of
memory or SSD-based backends due to this cross-backend contention
of the buffer.

A buffer that is backend local, so you can use transactions to
batch up changes would get around this, but that may have another
huge set of consequenses I dont know if.

... based on my own real-world experience with this feature.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Qual evaluation cost estimates for GIN indexes

2012-02-16 Thread Jesper Krogh


Hi.

First, thanks for looking at this. Except from GIN indexes and
full-text-search being really good in our applications, this also
points to those excact places where it can be improved.

On 2012-02-17 00:15, Tom Lane wrote:

I looked into the complaint here of poor estimation for GIN indexscans:
http://archives.postgresql.org/pgsql-performance/2012-02/msg00028.php


I think this is the excact same issue:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01754.php


At first glance it sounds like a mistake in selectivity estimation,
but it isn't: the rowcount estimates are pretty nearly dead on.
The problem is in the planner's estimate of the cost of executing the
@@ operator.  We have pg_proc.procost set to 1 for ts_match_vq, but
actually it's a good deal more expensive than that.  Some
experimentation suggests that @@ might be about 500 times as expensive
as a simple integer comparison.  I don't propose pushing its procost
up that much, but surely at least 10 would be appropriate, maybe even
100.

However ... if you just alter pg_proc.procost in Marc's example, the
planner *still* picks a seqscan, even though its estimate of the seqscan
cost surely does go up.  The reason is that its estimate of the GIN
indexscan cost goes up just as much, since we charge one qual eval cost
per returned tuple in gincostestimate.  It is easy to tell from the
actual runtimes that that is not what's happening in a GIN indexscan;
we are not re-executing the @@ operator for every tuple.  But the
planner's cost model doesn't know that.


There is something about lossy vs. non-lossy, if the index-result
is lossy, then it would need to execute the @@ operator
on each tuple and de-toast the toasted stuff and go all the way.

If it isn't then at  least count() on a gin-index should be able to
utillize an index-only scan now?

I've had a significant amout of struggle over the years in this
corner and the patch that went in for gincostestimate brought
a huge set of problems to the ground, but not all.

Other related threads:
http://archives.postgresql.org/pgsql-performance/2010-05/msg00031.php
(ts_match_vq cost in discussion)
http://archives.postgresql.org/pgsql-performance/2010-05/msg00266.php

I dont think I have ever seen the actual run-time of any @@ query
to be faster going through the seq-scan than going through the index. Not
even if it is pulling near all the tuples out.

(test-case that tries to go in that corner).
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php

And I think is it due to a coulple of real-world things:
1) The tsvector-column is typically toasted.
2) The selected columns are typically in the main table.
3) The gin-index search + pulling main table is in
fact a measuable cheaper operation than pulling main+toast
uncompressing toast and applying ts_match_vq even in the most
   favourable case for the seqscan.

Another real-world thing is that since the tsvector column is in toast
and isn't read when performing a bitmap-heap-scan, in addition
to the decompress-cost is it almost never hot in memory either,
causing its actuall runtime to be even worse.

Same problems hit a index-scan on another key where filtering
on a @@ operator, but I think I got around most of them by bumping
both cost of @@ and limit in the query to 10K instead of the 200 actually
wanted.

I do think I have been digging sufficiently in this corner and can
fairly easy test and craft test-examples that will demonstrate
the challenges. (a few is attached in above links).

Thanks for digging in this corner. Let me know if i can help, allthough
my actual coding skills are spare (at best).

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-11 Thread Jesper Krogh


On 2012-02-09 22:17, Jesper Krogh wrote:

On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.


Ok, but there are still cases where we don't even need to construct
a data tuple at all:

2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable 
where fts @@ to_tsquery('english','test1');

QUERY PLAN
---
 Aggregate  (cost=31.24..31.25 rows=1 width=0)
   -  Bitmap Heap Scan on testtable  (cost=16.03..31.23 rows=4 width=0)
 Recheck Cond: (fts @@ '''test1'''::tsquery)
 -  Bitmap Index Scan on ftsid  (cost=0.00..16.03 rows=4 width=0)
   Index Cond: (fts @@ '''test1'''::tsquery)
(5 rows)


Another idea sprung into my head, that indices on (ctid,some mix of 
columns)

could actually serve as some kind of vertical partitioning of the table.

Wether it actually will me more efficient or not need to be tested.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh


On 2012-02-09 18:02, Robert Haas wrote:

I don't have any appetite for trying to do anything more with
index-only scans for 9.2, though maybe someone else will think
otherwise.  But I would like very much to get KaiGai's leakproof stuff
committed, and so it seems like a good idea to reconcile the needs of
that machinery with what might eventually be needed here.

Those were a couple of nice cases where index-only-scans
could win more than they does today. I have another one here:

2012-02-09 19:17:28.788 jk=# \d testtable
  Table public.testtable
 Column |   Type   |   Modifiers
+--+
 id | integer  | not null default nextval('testtable_id_seq'::regclass)
 fts| tsvector |
Indexes:
prk_idx UNIQUE, btree (id)
fts_id gin (fts)

2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts 
@@ to_tsquery('english','test1000');

  QUERY PLAN
---
 Bitmap Heap Scan on testtable  (cost=20.29..161.28 rows=37 width=4)
   Recheck Cond: (fts @@ '''test1000'''::tsquery)
   -  Bitmap Index Scan on fts_id  (cost=0.00..20.28 rows=37 width=0)
 Index Cond: (fts @@ '''test1000'''::tsquery)
(4 rows)

Time: 0.494 ms
2012-02-09 19:19:52.748 jk=#

In this situation the tuple can be regenerated from the index, but
not from the index-satisfying the where clause, this allows significantly
more complex where-clauses and may also benefit situations where
we only going for one or more of the primary-key/foreing-key columns
for join-conditions.

Above situation does not need to involve a gin-index, but a btree index
where the where clause can be matched up using one index, and the tuple
constructed using another falls into the same category.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh


On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Group commit, revised

2012-01-29 Thread Jesper Krogh


On 2012-01-29 01:48, Jeff Janes wrote:

I ran three modes, head, head with commit_delay, and the group_commit patch

shared_buffers = 600MB
wal_sync_method=fsync

optionally with:
commit_delay=5
commit_siblings=1

pgbench -i -s40

for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clients

ran 5 times each, taking maximum tps from the repeat runs.

The results are impressive.

clients headhead_commit_delay   group_commit
1   23.923.023.9
5   31.051.359.9
10  35.056.595.7
15  35.664.9136.4
20  34.368.7159.3
25  36.564.1168.8
30  37.283.871.5

I haven't inspected that deep fall off at 30 clients for the patch.

By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.

If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?


Adding to the list of tests that isn't excactly a real-world system I 
decided

to repeat Jeff's tests on a Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
with 4GB of memory and an Intel X25-M 160GB SSD drive underneath.


BaselineCommitdelay Group commit
1   1168.67 1233.33 1212.67
5   2611.33 3022.00 2647.67
10  3044.67 .33 3296.33
15  3153.33 3177.00 3456.67
20  3087.33 3126.33 3618.67
25  2715.00 2359.00 3309.33
30  2736.33 2831.67 2737.67


Numbers are average over 3 runs.

I have set checkpoint_segments to 30 .. otherwise same configuration as 
Jeff.

Attached is a graph.

Nice conclusion is.. group commit outperforms baseline in all runs (on 
this system).


My purpose was actual more to try to quantify the difference between a 
single SSD and

a single rotating disk.

--
Jesper
attachment: pgbench.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Jesper Krogh


On 2011-12-22 09:42, Florian Weimer wrote:

* David Fetter:


The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache

What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

Anyway, I have an archive_command that looks like:
archive_command = 'test ! -f /data/wal/%f.gz  gzip --fast  %p  
/data/wal/%f.gz'


It brings on along somewhere between 50 and 75% reduction in log-volume
with no cost on the production system (since gzip just occupices one 
of the

many cores on the system) and can easily keep up even during
quite heavy writes.

Recovery is a bit more tricky, because hooking gunzip into the command 
there
will cause the system to replay log, gunzip, read data, replay log cycle 
where the gunzip
easily could be done on the other logfiles while replay are being done 
on one.


So a straightforward recovery will cost in recovery time, but that can 
be dealt with.


Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-20 Thread Jesper Krogh


On 2011-12-20 18:44, Simon Riggs wrote:

On Mon, Dec 19, 2011 at 11:10 AM, Simon Riggssi...@2ndquadrant.com  wrote:


The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I've had another look at this just to make sure.

Doing this for 9.2 will change the page format, causing every user to
do an unload/reload, with no provided mechanism to do that, whether or
not they use this feature.


How about only calculating the checksum and setting it in the bgwriter 
just before

flying the buffer off to disk.

Perhaps even let autovacuum do the same if it flushes pages to disk as a 
part

of the process.

If someone comes along and sets a hint bit,changes data, etc.  its only 
job is to clear

the checksum to a meaning telling we dont have a checksum for this page.

Unless the bgwriter becomes bottlenecked by doing it, the impact on 
foreground

work should be fairly limited.


Jesper .. just throwing in random thoughts ..
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-20 Thread Jesper Krogh


On 2011-12-19 02:55, Greg Stark wrote:

On Sun, Dec 18, 2011 at 7:51 PM, Jesper Kroghjes...@krogh.cc  wrote:

I dont know if it would be seen as a half baked feature.. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just skipping the hit bit entirely.

That was one approach discussed. The problem is that the hint bits are
currently in each heap tuple header which means the checksum code
would have to know a fair bit about the structure of the page format.
Also the closer people looked the more hint bits kept turning up
because the coding pattern had been copied to other places (the page
header has one, and index pointers have a hint bit indicating that the
target tuple is deleted, etc). And to make matters worse skipping
individual bits in varying places quickly becomes a big consumer of
cpu time since it means injecting logic into each iteration of the
checksum loop to mask out the bits.

I do know it is a valid and really relevant point (the cpu-time spend),
but here in late 2011 it is really a damn irritating limitation, since if
there any resources I have plenty available of in the production environment
then it is cpu-time, just not on the single core currently serving the 
client.


Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Jesper Krogh


On 2011-12-19 18:08, Marti Raudsepp wrote:

The query uses the gin index and the tsearch ranking function
ts_rank_cd.

Another thought -- have you read about the GIN fast updates feature?
This existed in 9.0 too. Instead of updating the index directly, GIN
appends all changes to a sequential list, which needs to be scanned in
whole for read queries. The periodic autovacuum process has to merge
these values back into the index.

Maybe the solution is to tune autovacuum to run more often on the table.

http://www.postgresql.org/docs/9.1/static/gin-implementation.html


I have to say that I consistently have to turn fastupdate off for
our heavily updated gin-indexes. The overall performance gain
may be measurable, but its not intolerable without. The spikes seen
from the applications, when cleanup happens. Either in the foreground
or in the background are not tolerable. (multiple seconds).

I may just not have experienced suffienctly with the various sizes of 
work_mem,
but I would indeed love to have a connection only fastupdate, so 
within a single
transaction it could use the fastupdate technique, but not stuffing 
index-updates

onto unreleated queries by random.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-18 Thread Jesper Krogh


On 2011-12-18 11:19, Heikki Linnakangas wrote:

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.


Doesn't help. Hint bit updates are not WAL-logged.


I dont know if it would be seen as a half baked feature.. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just skipping the hit bit entirely.

It would still provide checksumming for the majority of the data sitting
underneath the system, and would still be extremely usefull in my
eyes.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SP-GiST versus index-only scans

2011-12-14 Thread Jesper Krogh


On 2011-12-14 19:00, Tom Lane wrote:

So the problem is that we have to either disallow such opclass designs,
or support per-opclass rather than per-index-AM decisions about whether
index-only scans are possible.


Just a quick comment, for some queries like the famous
select count(*) from table where column @@ to_tsquery('something');
I do think index-only-scans does make sense even on indices
where the original tuple cannot be re-constructed. This also goes
for gin indices as well.

.. and yes, I do have a real-world application that would utillize this.
(and love it)

Jesper
--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SP-GiST versus index-only scans

2011-12-14 Thread Jesper Krogh


On 2011-12-14 19:48, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:

On 2011-12-14 19:00, Tom Lane wrote:

So the problem is that we have to either disallow such opclass designs,
or support per-opclass rather than per-index-AM decisions about whether
index-only scans are possible.

Just a quick comment, for some queries like the famous
select count(*) from table where column @@ to_tsquery('something');
I do think index-only-scans does make sense even on indices
where the original tuple cannot be re-constructed. This also goes
for gin indices as well.

I think this is somewhat wishful thinking unfortunately.  The difficulty
is that if the index isn't capable of reconstructing the original value,
then it's probably giving only an approximate (lossy) answer, which
means we'll have to visit the heap to recheck each result, which
pretty much defeats the purpose of an index-only scan.  So I can't get
excited about contorting things to support this.


I can see that it is hard to generalize, but in the tsvector case the
we are indeed not capable of reconstructing the row since the
positions are not stored in the index, the actual lookup is not a
lossy and I'm fairly sure (based on experience) that pg dont
revisit heap-tuples for checking (only for visibillity).

--
Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is anybody actually using XLR_BKP_REMOVABLE?

2011-12-12 Thread Jesper Krogh

 
 
 So: is there actually any such compression program out there?
 Would anybody really cry if this flag went away?

Perhaps http://pglesslog.projects.postgresql.org/

Jesper

[HACKERS] Accounting for toast in query planner. (gin/gist indexes).

2011-11-30 Thread Jesper Krogh


Hi list.

I have currently hit a problem which I dug into finding the cause for, in
particular, searching in GIN indices seems in some situations to
un-fairly favor Sequential Scans.

Googling a bit I found this page:
http://postgis.refractions.net/docs/ch06.html#id2635817

Describing the excact problem.

It seemed to be discussed back in the pre 8.1 days and wasn't
solved there, is there a chance someone may address it in 9.2 ?

http://archives.postgresql.org/pgsql-performance/2005-02/msg00041.php

Would you coin it a hard task or can a fairly naive C-coder, with
a fair amount of PG experience approach it?

Test-dataset can be created with:

CREATE table ftstest (id serial unique, fts tsvector);

DO
$$DECLARE r RECORD;
BEGIN
FOR r in SELECT generate_series(1,5000)
LOOP insert into ftstest(fts) (select 
strip(to_tsvector('english',string_agg(test,' '))) from (select 'test' 
|| generate_series(1,(select (random()*1)::int)) as test ) as foo);

END LOOP;
END;
$$;

CREATE INDEX ON ftstest using gin(fts);
ANALYZE;



2011-11-30 21:13:30.302 jktest=# explain ( buffers on, analyze on  ) 
select count(id) from ftstest where fts @@ to_tsquery('english','test500');

QUERY PLAN
--
 Aggregate  (cost=122.37..122.38 rows=1 width=4) (actual 
time=1114.096..1114.097 rows=1 loops=1)

   Buffers: shared hit=13384 read=24445 written=3002
   -  Seq Scan on ftstest  (cost=0.00..110.50 rows=4748 width=4) 
(actual time=0.567..1112.447 rows=4748 loops=1)

 Filter: (fts @@ '''test500'''::tsquery)
 Rows Removed by Filter: 252
 Buffers: shared hit=13384 read=24445 written=3002
 Total runtime: 1114.134 ms
(7 rows)

Time: 1114.945 ms
2011-11-30 21:14:30.382 jktest=# set enable_seqscan to off;
SET
Time: 0.132 ms
2011-11-30 21:14:50.965 jktest=# explain ( buffers on, analyze on  ) 
select count(id) from ftstest where fts @@ to_tsquery('english','test500');

 QUERY PLAN
-
 Aggregate  (cost=184.02..184.03 rows=1 width=4) (actual 
time=2.502..2.502 rows=1 loops=1)

   Buffers: shared hit=1 read=56 written=3
   -  Bitmap Heap Scan on ftstest  (cost=64.80..172.15 rows=4748 
width=4) (actual time=1.160..1.989 rows=4748 loops=1)

 Recheck Cond: (fts @@ '''test500'''::tsquery)
 Buffers: shared hit=1 read=56 written=3
 -  Bitmap Index Scan on ftstest_fts_idx  (cost=0.00..63.61 
rows=4748 width=0) (actual time=1.137..1.137 rows=4748 loops=1)

   Index Cond: (fts @@ '''test500'''::tsquery)
   Buffers: shared hit=1 read=8
 Total runtime: 2.529 ms
(9 rows)

Time: 3.016 ms


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ginfastupdate.. slow

2011-09-15 Thread Jesper Krogh


Hi List.

This is just an observation I'll try to reproduce it in a test set later.

I've been trying to performancetune a database system which does
a lot of updates on GIN indexes. I currently have 24 workers running
executing quite cpu-intensive stored procedures that helps generate
the body for the gin index (full-text-search).

The system is all memory resident for the data that gets computed on
and there is a 1GB BBWC before data hits the disk-system. The iowait
is 5-10% while running.

The system is nearly twice as fast with fastupdate=off as with 
fastupdate=on.

Benchmark done on a 9.0.latest

System AMD Opteron, 4x12 cores @ 2.2ghz, 128GB memory.

It is probably not as surprising as it may seem, since the fastupdate is
about batching up in a queue for processing later, but when the later
arrives, concurrency seems to stop.

Is it worth a documentation comment?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector concatenation - backend crash

2011-08-27 Thread Jesper Krogh


On 2011-08-26 23:02, Tom Lane wrote:

AFAICT this is a red herring: the bug exists all the way back to where
tsvector_concat was added, in 8.3.  I think the reason that your test
case happens to not crash before this commit is that it changed the sort
ordering rules for lexemes.  As you can see from my minimal example
above, we might need different numbers of pad bytes depending on how the
lexemes sort relative to each other.

Anyway, patch is committed; thanks for the report!

I've just confirmed the fix.. thanks for your prompt action.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] tsvector concatenation - backend crash

2011-08-25 Thread Jesper Krogh


Hi

Attached SQL files gives (at least in my hands) a reliable backend crash
with this stacktrace .. reproduced on both 9.0.4 and HEAD. I'm sorry
I cannot provide a more trimmed down set of vectors the reproduces the 
bug, thus

the obsfucated dataset. But even deleting single terms in the vectors make
the bug go away.

*** glibc detected *** postgres: jk jk [local] SELECT: corrupted 
double-linked list: 0x02279f80 ***

=== Backtrace: =
/lib/libc.so.6(+0x775b6)[0x7fe4db4b25b6]
/lib/libc.so.6(+0x7aa25)[0x7fe4db4b5a25]
/lib/libc.so.6(cfree+0x73)[0x7fe4db4b8e83]
postgres: jk jk [local] SELECT[0x710de5]
postgres: jk jk [local] SELECT(MemoryContextReset+0x2a)[0x71119a]
postgres: jk jk [local] SELECT(ExecScan+0x4a)[0x57887a]
postgres: jk jk [local] SELECT(ExecProcNode+0x238)[0x571708]
postgres: jk jk [local] SELECT(standard_ExecutorRun+0xd2)[0x5705e2]
postgres: jk jk [local] SELECT[0x63c627]
postgres: jk jk [local] SELECT(PortalRun+0x248)[0x63d948]
postgres: jk jk [local] SELECT[0x639fdb]
postgres: jk jk [local] SELECT(PostgresMain+0x547)[0x63af97]
postgres: jk jk [local] SELECT[0x5fb959]
postgres: jk jk [local] SELECT(PostmasterMain+0xa97)[0x5fe137]
postgres: jk jk [local] SELECT(main+0x490)[0x59f4d0]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7fe4db459c4d]
postgres: jk jk [local] SELECT[0x45d569]
=== Memory map: 
0040-008d6000 r-xp  08:01 4071141
/tmp/pgsql/bin/postgres
00ad5000-00ad6000 r--p 004d5000 08:01 4071141
/tmp/pgsql/bin/postgres
00ad6000-00ae2000 rw-p 004d6000 08:01 4071141
/tmp/pgsql/bin/postgres

00ae2000-00b43000 rw-p  00:00 0
0215d000-0227e000 rw-p  00:00 0  
[heap]

7fe4d400-7fe4d4021000 rw-p  00:00 0
7fe4d4021000-7fe4d800 ---p  00:00 0
7fe4d908f000-7fe4d90a5000 r-xp  08:01 4194383
/lib/libgcc_s.so.1
7fe4d90a5000-7fe4d92a4000 ---p 00016000 08:01 4194383
/lib/libgcc_s.so.1
7fe4d92a4000-7fe4d92a5000 r--p 00015000 08:01 4194383
/lib/libgcc_s.so.1
7fe4d92a5000-7fe4d92a6000 rw-p 00016000 08:01 4194383
/lib/libgcc_s.so.1

7fe4d92c1000-7fe4d9342000 rw-p  00:00 0
7fe4d9342000-7fe4db22e000 rw-s  00:04 8716337
/SYSV0052ea91 (deleted)
7fe4db22e000-7fe4db23a000 r-xp  08:01 4194415
/lib/libnss_files-2.11.1.so
7fe4db23a000-7fe4db439000 ---p c000 08:01 4194415
/lib/libnss_files-2.11.1.so
7fe4db439000-7fe4db43a000 r--p b000 08:01 4194415
/lib/libnss_files-2.11.1.so
7fe4db43a000-7fe4db43b000 rw-p c000 08:01 4194415
/lib/libnss_files-2.11.1.so
7fe4db43b000-7fe4db5b5000 r-xp  08:01 4194349
/lib/libc-2.11.1.so
7fe4db5b5000-7fe4db7b4000 ---p 0017a000 08:01 4194349
/lib/libc-2.11.1.so
7fe4db7b4000-7fe4db7b8000 r--p 00179000 08:01 4194349
/lib/libc-2.11.1.so
7fe4db7b8000-7fe4db7b9000 rw-p 0017d000 08:01 4194349
/lib/libc-2.11.1.so

7fe4db7b9000-7fe4db7be000 rw-p  00:00 0
7fe4db7be000-7fe4db84 r-xp  08:01 4194398
/lib/libm-2.11.1.so
7fe4db84-7fe4dba3f000 ---p 00082000 08:01 4194398
/lib/libm-2.11.1.so
7fe4dba3f000-7fe4dba4 r--p 00081000 08:01 4194398
/lib/libm-2.11.1.so
7fe4dba4-7fe4dba41000 rw-p 00082000 08:01 4194398
/lib/libm-2.11.1.so
7fe4dba41000-7fe4dba43000 r-xp  08:01 4194363
/lib/libdl-2.11.1.so
7fe4dba43000-7fe4dbc43000 ---p 2000 08:01 4194363
/lib/libdl-2.11.1.so
7fe4dbc43000-7fe4dbc44000 r--p 2000 08:01 4194363
/lib/libdl-2.11.1.so
7fe4dbc44000-7fe4dbc45000 rw-p 3000 08:01 4194363
/lib/libdl-2.11.1.so
7fe4dbc45000-7fe4dbc65000 r-xp  08:01 4194325
/lib/ld-2.11.1.so

7fe4dbc85000-7fe4dbce7000 rw-p  00:00 0
7fe4dbce7000-7fe4dbd26000 r--p  08:01 5512971
/usr/lib/locale/en_DK.utf8/LC_CTYPE
7fe4dbd26000-7fe4dbe44000 r--p  08:01 5512650
/usr/lib/locale/en_DK.utf8/LC_COLLATE

7fe4dbe44000-7fe4dbe47000 rw-p  00:00 0
7fe4dbe58000-7fe4dbe59000 r--p  08:01 5515083
/usr/lib/locale/en_DK.utf8/LC_TIME
7fe4dbe59000-7fe4dbe5a000 r--p  08:01 5515084
/usr/lib/locale/en_DK.utf8/LC_MONETARY
7fe4dbe5a000-7fe4dbe5b000 r--p  08:01 5640299
/usr/lib/locale/en_DK.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7fe4dbe5b000-7fe4dbe62000 r--s  08:01 5511621
/usr/lib/gconv/gconv-modules.cache

7fe4dbe62000-7fe4dbe64000 rw-p  00:00 0
7fe4dbe64000-7fe4dbe65000 r--p 0001f000 08:01 4194325
/lib/ld-2.11.1.so

Re: [HACKERS] tsvector concatenation - backend crash

2011-08-25 Thread Jesper Krogh


On 2011-08-26 05:28, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:

Attached SQL files gives (at least in my hands) a reliable backend crash
with this stacktrace .. reproduced on both 9.0.4 and HEAD. I'm sorry
I cannot provide a more trimmed down set of vectors the reproduces the
bug, thus
the obsfucated dataset. But even deleting single terms in the vectors make
the bug go away.

Hm ... I can reproduce this on one of my usual machines, but not
another.  What platform are you on exactly?

64 bit Ubuntu Lucid (amd64).

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mosbench revisited

2011-08-08 Thread Jesper Krogh


On 2011-08-08 15:29, Robert Haas wrote:

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontainedimi...@2ndquadrant.fr  wrote:

Robert Haasrobertmh...@gmail.com  writes:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode-i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-06-25 Thread Jesper Krogh


On 2011-06-06 09:42, Heikki Linnakangas wrote:
took about 15 hours without the patch, and 2 hours with it. That's 
quite dramatic.


With the precense of robust consumer-class SSD-drives that can be
found in sizes where they actually can fit many database usage
scenarios. A PostgreSQL version is not likely to hit the streets before
50% of PostgreSQL users are sitting on some kind of flash based
storage (for the part where the entire dataset doesn't fit in memory
any more). Point is:

* Wouldn't it be natural to measure the performance benefits of
   disc-bound tests in an SSD setup?

... my understanding of Fast gi(n|st) index build is that it is
more or less a challenge to transform a lot of random IO workload
to be more sequential and collapse multiple changes into fewer.

In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize against that world?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FOR KEY LOCK foreign keys

2011-06-20 Thread Jesper Krogh


On 2011-06-20 22:11, Noah Misch wrote:

On Sun, Jun 19, 2011 at 06:30:41PM +0200, Jesper Krogh wrote:

I hope this hasn't been forgotten. But I cant see it has been committed
or moved
into the commitfest process?

If you're asking about that main patch for $SUBJECT rather than those
isolationtester changes specifically, I can't speak to the plans for it.  I
wasn't planning to move the test suite work forward independent of the core
patch it serves, but we could do that if there's another application.

Yes, I was actually asking about the main patch for foreign key locks.

Jesper
--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FOR KEY LOCK foreign keys

2011-06-19 Thread Jesper Krogh

I hope this hasn't been forgotten. But I cant see it has been committed 
or moved

into the commitfest process?

Jesper


On 2011-03-11 16:51, Noah Misch wrote:

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing.  Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions.  The only thing it needs to be
suitable for this work is a way to handle blocked commands.  If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that.  It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch.  I add three isolation test specs:

   fk-contention: blocking-only test case from your blog post
   fk-deadlock: the deadlocking test case I used during patch review
   fk-deadlock2: Joel Jacobson's deadlocking test case

When a spec permutation would have us run a command in a currently-blocked
session, we cannot implement that permutation.  Such permutations represent
impossible real-world scenarios, anyway.  For now, I just explicitly name the
valid permutations in each spec file.  If the test harness detects this problem,
we abort the current test spec.  It might be nicer to instead cancel all
outstanding queries, issue rollbacks in all sessions, and continue with other
permutations.  I hesitated to do that, because we currently leave all
transaction control in the hands of the test spec.

I only support one waiting command at a time.  As long as one commands continues
to wait, I run other commands to completion synchronously.  This decision has no
impact on the current test specs, which all have two sessions.  It avoided a
touchy policy decision concerning deadlock detection.  If two commands have
blocked, it may be that a third command needs to run before they will unblock,
or it may be that the two commands have formed a deadlock.  We won't know for
sure until deadlock_timeout elapses.  If it's possible to run the next step in
the permutation (i.e., it uses a different session from any blocked command), we
can either do so immediately or wait out the deadlock_timeout first.  The latter
slows the test suite, but it makes the output more natural -- more like what one
would typically after running the commands by hand.  If anyone can think of a
sound general policy, that would be helpful.  For now, I've punted.

With a default postgresql.conf, deadlock_timeout constitutes most of the run
time.  Reduce it to 20ms to accelerate things when running the tests repeatedly.

Since timing dictates which query participating in a deadlock will be chosen for
cancellation, the expected outputs bearing deadlock errors are unstable.  I'm
not sure how much it will come up in practice, so I have not included expected
output variations to address this.

I think this will work on Windows as well as pgbench does, but I haven't
verified that.

Sorry for the delay on this.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade using appname to lock out other users

2011-06-14 Thread Jesper Krogh


On 2011-06-15 05:01, Bruce Momjian wrote:

You might remember we added a postmaster/postgres -b switch to indicate
binary upgrade mode.  The attached patch prevents any client without an
application_name of 'binary-upgrade' from connecting to the cluster
while it is binary upgrade mode.  This helps prevent unauthorized users
from connecting during the upgrade.  This will not help for clusters
that do not have the -b flag, e.g. pre-9.1.

Does this seem useful?  Something for 9.1 or 9.2?

This idea came from Andrew Dunstan via IRC during a pg_upgrade run by
Stephen Frost when some clients accidentally connected.  (Stephen reran
pg_upgrade successfully.)

Couldn't the -b flag also imply a very strict hba.conf configuration, that
essentially only lets pg_upgrade in..?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Jesper Krogh


On 2011-05-10 14:48, Robert Haas wrote:

We could avoid all of this complexity - and the possibility of pinning
the visibility map page needlessly - by locking the heap buffer first
and then pinning the visibility map page if the heap page is
all-visible.  However, that would involve holding the lock on the heap
buffer across a possible disk I/O to bring the visibility map page
into memory, which is something the existing code tries pretty hard to
avoid.

Assuming that the visibillity map would be used for visibillity testing,
just picking the lock would effectively mean we want it in the buffers,
which would not be that bad?

Or what is the downside for keeping it across IO? Will it block other
processes trying to read it?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] the big picture for index-only scans

2011-05-10 Thread Jesper Krogh


On 2011-05-11 01:54, Greg Stark wrote:

To be fair about 3/4 of them were actually complaining about the lack
of some global materialized cache of the aggregate value. Covering
index-only scans are only going to be a linear speedup no matter how
large the factor it's not going to turn select count(*) into a O(1)
operation.

Actually, if we could get to count(*) into the situation of a
very thin row today, so the cost of visibillity-testing didn't depend
hugely on the width of the row any more, then we be half-
way-there in terms of performance optimization.

If rows typically just were tuple-headers + a bit more, then it
would be way harder to go down this road and claim good
benefits. But currently the system needs to drag in allmost
one page pr. visibillity test from disk on random large tables.

I tried to graph the differences of thin vs. wide rows here:
http://shrek.*krogh*.cc/~*jesper*/*visibillitytesting*.pdf 
http://shrek.krogh.cc/%7Ejesper/visibillitytesting.pdf

http://postgresql.1045698.n5.nabble.com/Visibillity-testing-some-numbers-on-current-performance-td4284836.html

The getting the visibillitymap down to an O(n) is on large tables
shifting to be memory based vs. disk-based as now.

Jesper (It not a goal, but it would most-likely postpone some
peoples needs for buying a FusionIO card or similar)
--
Jesper

Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Jesper Krogh


On 2011-04-25 20:00, Leonardo Francalanci wrote:

 The amount of data loss on a big table will be 1% of the data
 loss caused by truncating the whole table.

 If that 1% is random (not time/transaction related), usually you'd
 rather have an empty table. In other words: is a table that is not
 consistant with anything else in the db useful?

Depends on the application, if it serves for pure caching then it is 
fully acceptable and way

better than dropping everything.

--
Jesper

Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Jesper Krogh


On 2011-04-18 11:00, Greg Stark wrote:

On Mon, Apr 18, 2011 at 6:25 AM, Jesper Kroghjes...@krogh.cc  wrote:

Getting the value for the first sortkey and carrying on a closure
for the rest would mostly (very often) be optimal ?

Well that might depend. The input data to the function might be much
larger than the output. Consider the, quite common, idiom of:

order by case when (complex expresssion) 1 when (complex expression) 2 else 3


How come that expression be relevant? There is only one sortkey and no
limit, so no matter what it should clearly get the full resultset in all 
cases.



It would also enable a select that has to sortkeys to utilize an
index that only contains the primary sortkey, which is a huge
negative effect of what's being done today.

This is a separate problem entirely. It would be nice to have a
strategy for ordering that can take advantage of partially ordered
results. It's not hard to see how to do the executor side -- it could
keep a tuplesort for each group and truncate it when the group
changes. As usual the hard part is having the planner figure out
*when* to use it. We have a hard enough time calculating ndistinct for
individual columns -- this would require having an idea of how many
values are present for each major key column.


Yes, as with all other cases it would be hard to get the optimum, but
there is also cases where it is straightforward, say when the secondary
sort column has an ndistinct of -1 (or similar close to). The current 
standard
assumption is that 2 columns are unrelated, that would also work here. 
(As good as is

does similar places in PG).

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Evaluation of secondary sort key.

2011-04-17 Thread Jesper Krogh


On 2011-04-09 18:54, Tom Lane wrote:

I think that would be a positive disimprovement.  The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

If I could only evaluate the rows needed, then that would also
be a huge win, below case shifts what definitely shouldn't be a seqscan
into one due to a secondary sort key.


Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.

Getting the value for the first sortkey and carrying on a closure
for the rest would mostly (very often) be optimal ?

It would also enable a select that has to sortkeys to utilize an
index that only contains the primary sortkey, which is a huge
negative effect of what's being done today.

2011-04-18 07:12:43.931 testdb=# explain select id from testdb.testtable 
order by id limit 500;

 QUERY PLAN

 Limit  (cost=0.00..262.75 rows=500 width=4)
   -  Index Scan using testtable_pkey on testtable  
(cost=0.00..15015567.84 rows=28573400 width=4)

(2 rows)

Time: 1.363 ms
2011-04-18 07:12:52.498 testdb=# explain select id from testdb.testtable 
order by id,name limit 500;

QUERY PLAN
---
 Limit  (cost=5165456.70..5165457.95 rows=500 width=35)
   -  Sort  (cost=5165456.70..5236890.20 rows=28573400 width=35)
 Sort Key: id, name
 -  Seq Scan on testtable  (cost=0.00..3741675.00 
rows=28573400 width=35)

(4 rows)

Time: 1.420 ms

Enabling any users to sort using multiple keys, without ending in Seq 
Scans somewhere down
the line seems to require multi dimension indexes on all combinations of 
colums


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Locking when concurrent updated of foreign references

2011-04-11 Thread Jesper Krogh


Hi.

This seem a bit strange to me. In short:

2 tables, one with has a foreign key to the other one;

CREATE TABLE test (id SERIAL primary key, data text);
CREATE TABLE testref(id SERIAL primary key, test_id integer references 
test(id) not null, data text);

INSERT INTO test(data) values('something');
INSERT INTO testref(test_id,data) values(1,'something else');
CREATE OR REPLACE FUNCTION upd(data text) RETURNS text as $$ BEGIN 
PERFORM pg_sleep(3); return data; END; $$ LANGUAGE 'plpgsql' immutable;



Then start 2 transactions (from different psql prompts):

TXN1: BEGIN;
TXN1: update test set data = upd('something');
TXN2: BEGIN;
TXN2: update testref set data = upd('something'); (within 3 seconds of 
the other update, so they are overlapping in execution)
this one blocks on a lock-wait of the other transaction even after this 
other command is done


Wether this is correct or not I'm not sure, but if you remove the 
upd() calls

in both above so the update isn't exactly executed at the same time
then both commands succeed without interfering with each other.

While waiting on TXN2's update, pg_locks shows that there is a ShareLock 
that
cannot be granted, which will first get further when TXN1 commits or 
rolls back.


It seems as the lock isn't released if some other process is actually 
waiting

for it?

It may be related to the plpgsql function, I have a similar one (doing more
clever things than sleep) in the production system, but I cannot figure 
out how

to get the updates overlapping in execution in other ways.

A hugely trimmed down example of something I currently see in a production
system. (8.4.7) but above is on 9.1HEAD
--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Locking when concurrent updated of foreign references

2011-04-11 Thread Jesper Krogh


On 2011-04-11 20:18, Jesper Krogh wrote:

Hi.

This seem a bit strange to me. In short:

Not any more I.. I guess what made me a bit confused was that
a update table set key = value would acually block out changes
on tables referencing this tuple even if the referenced column wasn't
effected by the update.

But when the locking is done row-level then it is correct
to do it that way. It would allthough be nice with a weaker
locklevel for that kind of updates (I have no clue if that is
a hard problem).

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Locking when concurrent updated of foreign references

2011-04-11 Thread Jesper Krogh


On 2011-04-11 23:30, Alvaro Herrera wrote:

Excerpts from Jesper Krogh's message of lun abr 11 17:07:33 -0300 2011:


But when the locking is done row-level then it is correct
to do it that way. It would allthough be nice with a weaker
locklevel for that kind of updates (I have no clue if that is
a hard problem).

http://www.commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks/


That looks exactly what I have been seeing.

Naive suggestion (at least to part of the problem):
Would it be possible to identify updates that never
can violate any constraints and not do any verification
of foreign keys on the update and only pick a lock
that block concurrent updates of the same tuple?

UPDATE table set something which is neither referenced or a reference;
would all be of that type.

Would allthough require the database to examine
the UPDATE statement and in comparison with the
table definition figure out which of the column are
safe to update.

There might actually be a potential speedup since the update
would require to go visit the foreign table at all.

Jesper
--
Jesper


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] k-neighbourhood search in databases

2011-04-10 Thread Jesper Krogh


On 2011-04-10 12:18, Oleg Bartunov wrote:

Wow, custom solution for 2008 still much faster Denali 2011  solution.
Also, what's about not spatial data types ? In our approach, we can 
provide 
knn for any datatype, which has GiST index and distance method.


Can you share some insight about how it would
work if the distance method is expensive (as in 100ms)?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh


This seems like a place where there is room for improvement.

2011-04-09 15:18:08.016 testdb=# select id from test1 where id  3 order 
by id;

 id

  1
  2
(2 rows)

Time: 0.328 ms
2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id 
integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; 
END; $$ language plpgsql;

CREATE FUNCTION
Time: 12.349 ms
2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3 order 
by id,testsort(id);

 id

  1
  2
(2 rows)

Time: 3001.896 ms

It seems strange that there is a need to evaluate testsort(id) at all in 
this case.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh


 
 How would PostgreSQL know that sorting by id leaves no ambiguity for
 the next key to address?

It wouldn't   But it could postpone evaluation until ambiguity was actually 
met.  

Jesper
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh


On 2011-04-09 20:00, David Fetter wrote:

Given the horrors query generators perpetrate, it might be worth
dropping provably redundant ORDER BYs on the floor at planning time.

Well, many people often add a secondary sort-key to their SQL
for the only purpose of obtainting a consistent result in the
corner-cases where the first sort key is ambiguios.

If the first sort-key isn't planned to be supported by an index-scan,
then you'll end up calculating the second sortkey for the entire
dataset even if you end up doing a limit 100 at the end.

You can only deem it redundant if there is a primary key in front.
if you have a primary key in front, where as a fix may be really
good in cases where you have a n_distinct at or near -1 in pg_stats
for the column.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh


On 2011-04-09 18:54, Tom Lane wrote:

I think that would be a positive disimprovement. The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.

The current approach gives that:

select id from test1 where some clause that matches say 10% random by 
another index

order by sortfunc1(id),sortfunc(2) limit 20;

on a table with 100.000 elements will also end up applying
both sortfunc1(id) and sortfunc2(id) to all 10.000 elements
even though sortfunc2(id) might only brings value to a very few amount
of tuples (the ones needed as secondary sortkeys for top20 within
the dataset).

It might be worth noting in the manual, that if at all possible you should
stuff the sortfunc2(id) into the table as a column (perhaps computed by 
a before

trigger), since it might actully be evaluated way more often than
you anticipated.

Thanks a lot for the insight.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Visibillity testing - some numbers on current performance.

2011-04-05 Thread Jesper Krogh


Hi.

I initially set out to put some numbers on why the visibillity
map was important for select count(*), primarily to give some
feedback to Simon Riggs stating:
Your tests and discussion remind me that I haven't yet seen any tests
that show that index-only scans would be useful for performance.

I created at small testscript that (on my lousy desktop) created
a bunch of different tables, all with 2.000.000 rows and different
width. The tests are created so that the tuple-with of around 1500bytes
ends around the size of memory of my system, so we eventually have
some number for when we're going to disk. My desktop is only a
single SATA 7200 rpm drive, 3GB of memory and no battery backed
write cache (who would ever run a REAL database on such a system anyway).
(script and output-data, graphs in links at the end of the email).

The on this system, visibillity testing take from:

0.2s to 0.5s in total for 2.000.000 tuples

dependent on tuple-size (header+14 bytes to header
+ 504 bytes), this is all in situations where we can assume that 
hit-bits are

set and all thing have kicked in and the complete dataset
is memory cached.

This made me conclude that, this is just awesome, if I could get
somewhere near these numbers in my production
system then I would have been haunting perfomance in totally different
areas.

One the first select after load, still in situations where we're fully
memory recident the range is instead:

0.6s to 35s in total for 2.000.000 tuples
(First run - second run for +10 bytes and +500 bytes respectively).

This degradation can mostly be attributed to concurrent writeout of
hit-bits on a single SATA-drive (one-spindle), who would sanely run a
db on such a system anyway, this number is probably the least robust
one in the test, but the huge range should lead to some concern anyway.

The last range is the range where we're hitting disk, and that is
fairly uninteresting as is can more or less be seen as a speed of
the underlying disk-system, where the one in this one is in the low
range. But is seen from a sequential throughput perspective which
this one tests is still does about 80MB/s and an expensive one
will not buy an order of magnitude in this area.

The range is in this case:
1.5s to 42s in total for 2.000.000 tuples.
the first one for a +10 bytes tuple, the last one for a +1500 bytes.

Of the really non-interesting information is that when I add +2500 bytes
to the tuple it goes down to 2.1s, which is due to toast kicking in and
the fact that the data I load are highly compressable so it ends up
filling next to nothing.

Conclusion:
Visibillity testing of 2.000.000 tuples takes between 0.2s and 42s, where
your system fits into that range hugely depends on your tuple-size, the
amount of bloat on your tables, the amount of memory for caching
compared to the total database size and if you have sufficient activity
on your system for keeping your active dataset in memory in favor
of background activities.

If your active dataset approaches cache-size of the system, you're
very likely to hit in the last part of that range.

So  Simon Riggs question is really good, since the answer would be
it depends. It seems plausible that it is really hard to beat the
0.2s-0.5s for 2m tuples we currently meet for any kind of memory
resident dataset.

The approach currently being pursues, splitting of the PD_ALL_VISIBLE
bit and using that for visibillity testing, would improve the situation
enourmously making all the diskbound cases to be in the order of

primary-key-index-size+vm-map-size/disk-throughput

instead of

main-table-size/disk-throughput

Which for slim tables wouldn't be that much, but for fat tables it
can/would be substantial. But it would be crusial to have the bit set
at all, and the likelihood would fall with the amount of churn in the
table. The worst-case situation would be where the bit is not set
anyway and there the speed would be a primary index worse than
currently and if the best-case would be better at all. But I'm fairly
sure that the average case would be quite a lot lower than it is today,
just by the likelyhood of indexes and vm-maps being in-memory .
(at least for databases hitting disk occationally).

Getting below 0.2s for 2.000.000 tuples would somewhat be nice
but gettting the worst-case numbers an order of magnitue down
would be an enourmous benefit to large or infrequently used
databases.

My conclusion is somewhere along the line of:
Gettting the visibillity map crash-safe and updated is not the
primary goal, but getting the visibillity testing separated
from the varying size of tuples seems to be the key point here
and doing it by moving the PD_ALL_VISIBLE bit out is definately
one way of doing it.

Another approach could be to way more aggressively push to TOAST,
this would effectively push the worst-case behaviours down.
(I'll try to do some sane benchmarking around that).

A third approach could be to do a slim table, with only the relevant
bits from the

Re: [HACKERS] crash-safe visibility map, take four

2011-03-23 Thread Jesper Krogh


On 2011-03-22 21:43, Robert Haas wrote:

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
  Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.


Looks simple, but there is now benefit on the usage side in the patch,
so it isn't really testable yet? I would love to spend some time testing
when its doable (even with rough corners.)

I'm still a bit puzzled with how it would end up working with a page-level
visibillity map bit for index-scans. There is a clear drop off in 
usabillity

when the change rates of the table goes up, which may or may not be
relevant, but I cannot really judge, since I haven't even got a ballpark
figure about how much table churn would disable say 50% of the usage.

= Really naive suggestion approaching =
Another layout might be to simply drag out t_xmin, t_xmax pr row (8 bytes)
into a table by itself. This table will be way bigger than the one bit 
per page

map, but could be wal-logged as any other change in the system?

It would, by definition make the visibility testing work (way faster 
than today),

no matter how fast the underlying table changes.

State of today (PG 8.4) is that a query like this:
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
 count
---
 69753
(1 row)

Time: 5863.600 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
 count
---
 69753
(1 row)

Time: 659.832 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag');
 count
---
 69753
(1 row)

Time: 1005.765 ms

Somewhere around 15ns / tuple (not bad at all).
(the first was probably half warm)

The average rows per tuple is somewhere between 4 and 8 for this 
table, assuming
8 and that the 69K are randomly distributed among the 16M other tuples 
(fair assumption
in this case). The 600-1000ms for the fresh cache run are the timing to 
drag:
69753*8192 (page size) = 571MB into memory for visibillity testing 
alone, on warm cache
all pages being in main memory. Packing 16M tuples with 8 bytes / tuple 
in a map would be

around 128MB.

given 8 bytes/row and random distribution of data, that would require us 
to read all 128MB,
so a speedup of x4 on this example, but it would rougly let us count the 
entire table in

the same time.

With regard to disk vs. memory hotness.. those 128MB compares to a table 
size of 32GB
(with a toast table next to it of 64GB) but that shouldn't be touched by 
above query.


The ns/tuple number (today) on a thin table in my system is 
approaching 1ns / tuple.


If the page-level bitmap would be set quite fast on a fairly busy 
system anyway, then
the above is just noise in the air, but I have currently no feeling, and 
there is

some math in there I have trouble setting reliable ballpark numbers on.

There is, by all approaches room for significant improvements for the 
visibillity

testing for a huge range of installations.

Can I drag out numbers of frozenness of tuples from my current systems 
to fill in the

discussion? (how?)

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Jesper Krogh


Robert Haas wrote:

Right.  Really-lazy vacuum could freeze tuples.  Unlike regular
vacuum, it can also sensibly be done incrementally.  One thing I was
thinking about is counting the number of times that we fetched a tuple
that was older than RecentGlobalXmin and had a committed xmin and an
invalid xmax, but where the page was not PD_ALL_VISIBLE.  If that's
happening a lot, it probably means that some vacuuming would speed
things up, by getting those PD_ALL_VISIBLE bits set.  Perhaps you
could work out some formula where you do a variable amount of
super-lazy vacuuming depending on the number of such tuple fetches.
The trick would be to avoid overdoing it (so that you swamp the I/O
system) or underdoing it (so that the system never converges).  It
would be really nice (for this and for other things) if we had some
way of measuring the I/O saturation of the system, so that we could
automatically adjust the aggressiveness of background processes
accordingly.

Note also that if and when we get index-only scans, making sure the
PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
set is going to be a lot more important.
  


Is it obvious that the visibillity map bits should track complete
pages and not individual tuples? If the visibillity map tracks at
page-level the benefit would fall on slim tables where you squeeze
200 tuples into each page and having an update rate of 1% would
lower the likelyhood even more. (it may be that for slim tables the
index-only-scans are not as benefitial as to wide tables).

In collaboration with a vacuuming discussion, I dont know if it
is there allready but how about opportunistic vacuuming. Say
you have a page what due to changes in one of the tuples are
being written out, will it, while being written out anyway get the
other tuples on the page vacuumed?

It actually dont have to hook into the process directly to benefit
the IO-usage, if it just can get the opportunity to do it before
the page gets evicted from the OS-cache, then it would save a
second read on that page, but it seems way harder to do something
sane around that assumption.

Really lazy vacuums would only benefit really static tables ?  where
vacuuming is not that big a problem in the first place.


--
Jesper - Demonstrating totally lack of insight I would assume.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Jesper Krogh

On 2011-03-17 15:02, Robert Haas wrote:

On Thu, Mar 17, 2011 at 4:17 AM, Jesper Kroghjes...@krogh.cc wrote:

Is it obvious that the visibillity map bits should track complete
pages and not individual tuples? If the visibillity map tracks at
page-level the benefit would fall on slim tables where you squeeze
200 tuples into each page and having an update rate of 1% would
lower the likelyhood even more. (it may be that for slim tables the
index-only-scans are not as benefitial as to wide tables).

I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
say it's 200. If you track visibility info per tuple rather than per
page, then the size of the visibility map is going to expand by a
factor of 200. That might decrease contention, but otherwise it's a
bad thing - the whole point of having the visibility map in the first
place is that it's much, much smaller than the heap. If it were the
same size as the heap, we could just read the heap. What the map
attempts to accomplish is to allow us, by reading a small number of
pages, to check whether the tuples we're thinking of reading are
likely to be all-visible without actually looking at them.

Yes, that was sort of the math I was trying to make. I do allthough
belive that you have a way better feeling about it. But according
to this:
http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F

The bulk row-overhead is around 24bytes, which will with 1 bit per row
give a
size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341
tuples/page

(where each tuple, does not contain any data at all). With that ratio, the
visibillitymap of a relation of 10GB would fill 52MB on disk (still
worst case)

and that by itself would by all means be awesome. (with that small tuples a
10GB relation would have around 42 billion tuples).

On the 1 bit per page the best case would be 341 times better than above
reducing the size of the visibiility map on a 10GB table to around 152KB
which

is extremely small (and thus also awesome) But the consequenses of a single
update would mean that you loose visibilllity map benefit on 341 tuples in
one shot.

Worst case situations are, where we approach the 4 tuples per page, before
we hit toast where the ratio of space reduction in 1 bit per tuple would
be:

1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
In the 1 bit per tuple a visibillity map of a 10GB relation would be
around 610KB

1 bit per page would then drop it to around 160KB.

Can we drag out some average-case numbers on row-size in the heap
from some real production systems?

I may have gotten something hugely wrong in above calculations and/or
have missed some important points.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Read uncommitted ever possible?

2011-03-10 Thread Jesper Krogh


On 2011-03-10 18:00, Bruce Momjian wrote:


 Dirty reads are unlikely to be implemented. We do have a TODO item
 and wiki page about how to allow index scans without heap access:

 http://wiki.postgresql.org/wiki/Index-only_scans


I think we (the company I work for) would help co-sponsor such
a feature. Would it be ok to add a section on the wiki with
a list of potential sponsors that might in total be able to sponsor
development of such a feature? Then perhaps a developer would
drop by.

... it would be easier if there was a feeling about how much actually
is required.

... can anyone create wiki accounts?

--
Jesper

Re: [HACKERS] Read uncommitted ever possible?

2011-03-10 Thread Jesper Krogh


On 2011-03-10 19:25, Bruce Momjian wrote:


Sure, anyone can add text to that wiki;  you create a community account
here:

http://www.postgresql.org/community/signup


Suggestion: Add this url to the login box on the wiki.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsearch Parser Hacking

2011-02-17 Thread Jesper Krogh

On 16 Feb 2011, at 23:22, David E. Wheeler da...@kineticode.com wrote:

 On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:
 
 IMO, sooner or later we need to trash that code and replace it with
 something a bit more modification-friendly.
 
 We thought about configurable parser, but AFAIR, we didn't get any support 
 for this at that time.
 
 What would it take to change the requirement such that *any* SQL function 
 could be a parser, not only C functions? Maybe require that they turn a 
 nested array of tokens? That way I could just write a function in PL/Perl 
 quite easily.

I had just the same thought in mind. But so far I systematically substitute _ 
and a few other characters to ł which doesn't get interpreted as blanks.  But 
more direct control would be appreciated 

Jesper
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Jesper Krogh


Hi All.

The NULL element always suprises me in unpleasant ways..  my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing 
that instead.


I have a table with a null_frac of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where  null @@ to_tsquery('testterm80');
 id

(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@ 
to_tsquery('testterm80');

  QUERY PLAN
---
 Seq Scan on testtable  (cost=0.00..1985.03 rows=1966 width=4)
   Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the fts is not null 
clause to the query.


testdb=# explain select id from testtable where fts @@ 
to_tsquery('testterm80') and fts is not null;

 QUERY PLAN
-
 Bitmap Heap Scan on testtable  (cost=130.34..1735.19 rows=983 width=4)
   Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS 
NOT NULL))
   -  Bitmap Index Scan on testtable_fts_idx  (cost=0.00..130.09 
rows=983 width=0)
 Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts 
IS NOT NULL))

(4 rows)

When something evaluates to null isn't included in the result, 
shouldn't the query-planner

then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the  operator seem to 
take the null_frac into

account.

Below snippet allows to reproduce the dataset.


create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
 RETURNS boolean
 LANGUAGE plpgsql
AS $function$
DECLARE
count integer;
BEGIN
count := 0;
LOOP
EXIT WHEN count = rows;
count := count +1;
insert into testtable(document,fts) select 
document,to_tsvector('english',document) from (select 
string_agg(concat,' ') as document from (select concat('testterm' || 
generate_series(1,floor(random()*100)::integer))) as foo) as bar;

END LOOP;
RETURN TRUE;
END;
$function$

select filltable(1);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 1835 of 1835 pages, containing 10002 live 
rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows

ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
 null_frac
---
   0.5
(1 row)

... trying with integers:

testdb=# ALTER  TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 2186 of 2186 pages, containing 10002 live 
rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows

ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 2282 of 2282 pages, containing 10002 live 
rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows

analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
 QUERY PLAN
-
 Seq Scan on testtable  (cost=0.00..2407.03 rows=64 width=4)
   Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
 QUERY PLAN
-
 Seq Scan on testtable  (cost=0.00..2407.03 rows=48 width=4)
   Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint  50;
  QUERY PLAN
---
 Seq Scan on testtable  (cost=0.00..2407.03 rows=2470 width=4)
   Filter: (testint  50)
(2 rows)


(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper
From 4b4be27864f683a9b3464b86d6cfa567f8ab6bd2 Mon Sep 17 00:00:00 2001
From: Jesper Krogh j...@novozymes.com
Date: Thu, 17 Feb 2011 22:21:52 +0100
Subject: [PATCH] Take null_frac into account in calculating selectivity for @@ operator. This makes behaviour align with what  operator does for integers.

---
 src/backend/tsearch/ts_selfuncs.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/tsearch/ts_selfuncs.c b/src

Re: [HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Jesper Krogh


On 2011-02-17 23:20, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:

When something evaluates to null isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions.  IOW, aren't you
double-counting the null fraction?

It might be the wrong place to fix, but here it seems like we're only
counting MCE-freqs based on non-null elements:

http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

And the testdata confirms the behaviour.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-24 Thread Jesper Krogh


On 2011-01-24 16:34, Alexander Korotkov wrote:

Hi!

On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbańskiwulc...@wulczer.org  wrote:


I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:

  * relation size: 18MB
  * index size: 109 MB

while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.


Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on  5
copies of /usr/share/dict/american-english with patch and got 43 MB index
size.



That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.


I'll try to write sufficient comment and send new revision of patch.


Would it be hard to make it support n-grams (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser sensitivity = Faster lookup .. perhaps my logic 
is wrong?


Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard 
search

patch above.

Jesper
--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2011-01-05 Thread Jesper Krogh


On 2010-11-30 05:57, Robert Haas wrote:

Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment.  Since
this is an important prerequisite to index-only scans, I'm trying
again.


The logic seems to be:

* If the visibillity map should be crash-safe if should be WAL-logged.
* PD_ALL_VISIBLE is currently not being WAL-logged when vacuum sets it.
* WAL logging the visibillity map bit is not that bad (small size).
* WAL-logging the PD_ALL_VISIBLE bit would can WAL-record for the entire
  relation to be written out (potentially huge).

Would the problem not be solved by not trying to keep the two bits in 
sync but

simply removing the PD_ALL_VISIBLE bit in the page-header in favor
for the bit in the visibillity map, that is now WAL-logged and thus safe 
to trust?


Then vacuum could emit WAL records for setting the visibillity map bits, 
combined

with changes on the page could clear it?

The question probably boils down to:

Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE 
bit serve?


I've probably just missed some logic?

Having index-only scans per-table ajustable would make quite some sense..

I have a couple of tables with a high turn-over rate that never get out 
of the OS-cache
anyway, the benefit of index-only scans are quite small, especially if 
they come with

additional overhead on INSERT/UPDATE/DELETE operations, whereas I also have
huge tables with a very small amount of changes. Just the saved IO of 
not having to

go to the heap in some cases would be good.

I could see some benefits in having the index-only scan work on 
tuple-level visibillity information

and not page-level, but that would require a bigger map
(allthough still less than 1% of the heap size if my calculations are 
correct), but
would enable visibillity testing of a tuple without going to the heap 
even other (unrelated)

changes happend on the same page.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2011-01-05 Thread Jesper Krogh


On 2011-01-06 03:10, Robert Haas wrote:

On Wed, Jan 5, 2011 at 3:22 PM, Jesper Kroghjes...@krogh.cc  wrote:

Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit
serve?

If we modify a page on which PD_ALL_VISIBLE isn't set, we don't
attempt to update the visibility map.  In theory, this is an important
optimization to reduce contention on the visibility map page, since
there are something like 64K heap pages per visibility map page.  In
practice, I'm not sure in what workloads it matters or by how much


If that is the only reason, I would suggest just making the visibillity map
pages more sparse. If you just had 500 or 1K heap pages per visibillity map
page, then it would not change a thing for the performance. With 1K heap 
pages

per VM map page a relation with 100GB of data would have a VM map of
the VM map of 100MB, even 100 heap pages per VM-map page would still
enable the database to have visibillity information of 100GB data stored 
in 1GB

of memory.

But it would need testing and benchmarking to find out.

Jesper

--
Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] The cost of visibillity testing? (gin-search)

2010-12-21 Thread Jesper Krogh


Hi Hackers.

I have a feeling that GIN is cheating on the visibillity checks:

test=# set enable_seqscan = off;
SET
Time: 0.129 ms
test=# select count(id) from fts_test where fts @@ to_tsquery('core');
 count

 158827
(1 row)

Time: 95.530 ms
test=# explain select count(id) from fts_test where fts @@ 
to_tsquery('core');

  QUERY PLAN
--
 Aggregate  (cost=211571.52..211571.53 rows=1 width=4)
   -  Bitmap Heap Scan on fts_test  (cost=134925.95..211174.01 
rows=159004 width=4)

 Recheck Cond: (fts @@ to_tsquery('core'::text))
 -  Bitmap Index Scan on fts_idx  (cost=0.00..134886.20 
rows=159004 width=0)

   Index Cond: (fts @@ to_tsquery('core'::text))
(5 rows)

Time: 0.609 ms

test=# select count(id) from fts_test;
 count

 168556
(1 row)

Time: 164.655 ms

test=# explain select count(id) from fts_test;
   QUERY PLAN

 Aggregate  (cost=1075969.95..1075969.96 rows=1 width=4)
   -  Seq Scan on fts_test  (cost=100.00..1075548.56 
rows=168556 width=4)

(2 rows)

Time: 0.338 ms

This is run multiple times for both queries and the seqscan of the table
is consistently about 1.8 times more expensive than the fts-scan.
This is all on a fully memory cached dataset.

The first query should have the cost of the GIN-search + 
visibillity-test of 158K tuples,
the latter should have the cost of visibillity-testing 168K tuples. If 
we set the cost
of actually searching GIN to 0 then the gin-search - visibillity costs: 
95/158000 0.000373ms/tuple
where the seq-scan case costs close to 0.001ms/tuple (close to 3 times 
as much).


So I have a strong feeling that GIN is cheating on the visibillity tests
otherwise I have problems imagining how it ever can become faster to execute
than the seq_scan of the table.

Or is a Bitmap Heap Scan simply 3 times faster than a Seq-scan for 
visibillity-testing?


What have I missed in the logic?

Thanks.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The cost of visibillity testing? (gin-search)

2010-12-21 Thread Jesper Krogh


On 2010-12-21 21:28, Andres Freund wrote:

On Tuesday 21 December 2010 20:25:16 Jesper Krogh wrote:
   

What have I missed in the logic?
 

A reproducible testcase ;-)
   

Yes, I did a  complete dump/restore of the dataset and the numbers
looked like expected. So table bloat seems to be the problem/challenge.

I must have hit a strange sitauation where my table-bloat proportionally
was significantly higher than my gin-index-bloat.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] window function count(*) and limit

2010-10-23 Thread Jesper Krogh


Hi.

I have been puzzled about the evaluation order when using window
functions and limit.

jk=# select * from testtable;
 id | value
+---
  1 | 1
  2 | 2
  3 | 3
  4 | 4
  5 | 5
  6 | 6
  7 | 7
  8 | 8
  9 | 9
 10 |10
(10 rows)

jk=# select id,count(*) over () from testtable where id  9 limit 3;
 id | count
+---
  1 | 8
  2 | 8
  3 | 8
(3 rows)


So the first element id is definately picked after the limit 3, whereas
the window function is applied before. I have been digging in the
documentation but I didnt find this case specified out.

This behaviour may be correct, but it hugely surprises me...
I expected it to either count to 3 or blow up and tell me that count(*)
wasn't a window function.

It looks like something about the type of the function where
count(*) is a agg and row_number() is a window. But shouldn't
count(*) exist as a type window and behave accordingly?

Same goes on for min() max() and other standard aggregates.

.. postgresql 8.4.4 (but couldn't find anyting mentioned in 8.4.5/9.0 
release notes about this).


Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] window function count(*) and limit

2010-10-23 Thread Jesper Krogh


On 2010-10-23 18:42, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:
   

I have been puzzled about the evaluation order when using window
functions and limit.
 

It's basically FROM -  WHERE -  window functions -  LIMIT.
   

I expected it to either count to 3 or blow up and tell me that count(*)
wasn't a window function.
 

Any aggregate function can be used as a window function.  It just
aggregates over the whole partition (which in this case is all 8
rows that satisfy the WHERE).
   

Thank you for clarifying.

Testing more.. I can see that it does that in all situations, so
it is not that confusing anymore. I still think it is hugely
counter intuitive and attached is a documentation patch
that should clarify it a bit.

I would just assume the reverse behaviour would be way closer to
useful for everyone. (say if you want the window function to operate
over the full set, then you wouldn't specify the limit). But that
doesn't help anyone if the SQL-spec specifies it otherwise.

As a sidenote.. the queryplan for some of them seems quite unoptimal.

# explain select id,last_value(id) over () from testtable order by id 
asc limit 3;

  QUERY PLAN
---
 Limit  (cost=0.00..1.30 rows=3 width=4)
   -  WindowAgg  (cost=0.00..6117917.93 rows=14165769 width=4)
 -  Index Scan using testtable_pkey on testtable  
(cost=0.00..5940845.82 rows=14165769 width=4)


The output is essentially the 3 smallest ids and the largest one in the 
table which
all can be found by both a forward and reverse scan on the primary key 
index so above

is absolutely not the cheapest way to find the result.

--
Jesper

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 183b72c..7bcbe25 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10684,6 +10684,30 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
clause follows the call; otherwise they act as regular aggregates.
   /para
 
+  para
+   The evaluation order may be tricky for Window Functions when used in conjunction 
+   with a literalLIMIT/ clause. The literalWHERE/ clause is evaluated  
+   first then the Window Function and last the literalLIMIT/ is applied.  
+   An example:
+  /para
+
+screen
+# select * from testtable;
+ id | value 
++---
+  1 | 1
+  2 | 2
+  3 | 3
+(3 rows)
+
+# select id,max(id) over () from testtable order by id asc limit 2;
+ id | max 
++-
+  1 |   3
+  2 |   3
+(2 rows)
+/screen
+
+
   table id=functions-window-table
titleGeneral-Purpose Window Functions/title
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Make tuples_per_page pr. table configureable.

2010-09-21 Thread Jesper Krogh


Hi.

This is a follow up and updated patch on several old discussions:
http://archives.postgresql.org/pgsql-hackers/2009-07/msg01065.php
http://archives.postgresql.org/pgsql-admin/2010-04/msg00164.php
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00831.php
First patch:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg00096.php

Currently the aim for the amount of tuples per page is 4 resulting
in a target size of the individual tuple to be more than 2KB before
the tupletoaster kicks in. This patch makes it tuneable on a per
table basis.

The main reasoning is that if people have knowledge about the usage
pattern of their database, they can have huge benefit in tuning
TOAST to be more or less aggressive. This is obviously true if:

* The dataset isn't entirely memory cached and
* columns stored in main (and just visibility checking) is more frequently
  done than accessing data columns stored in TOAST.

But even in the case where the dataset is entirely memory cached this
tuneable can transform the database to a widely different performance
numbers than currently. This typically happens in cases where only
visibillity checks are done (select count()) and when aggregates on
stuff stored in main is used.

I must admit that I have chosen a poor test data set, since based
on the average length of the tuple the sweet point is just around
the current default, but constructing a dataset with an average  2.5KB 
tuple

size would absolutely benefit. But I hope that people can see the benefit
anyway. The dataset is 500.000 records in a table with:

id serial,
code text, (small text block)
entry text (larger text block)

where code is length(code)  10 and entry:

  avg  | max  | min
---+--+--
 3640.2042755914488171 | 8708 | 1468

The queries are run multiple time and numbers are based on runs where
iowait was 0 while the query executed. So entirely memory and cpu-bound 
numbers:


testdb=# select * from data order by tuples_per_page;
 time_sum_length | time_count | tuples_per_page | main_size | toast_size
-++-+---+
5190.258 | 689.34 |   1 | 1981MB| 0MB
5478.519 |660.841 |   2 | 1894MB| 0MB
9740.768 |481.822 |   3 | 1287MB| 4MB
   12875.479 | 73.895 |(default)  4 | 79MB  | 1226MB
   13082.768 | 58.023 |   8 | 29MB  | 1276MB
(5 rows)

time_sum_length = select sum(length(entry)) from data;
time_count = select count(*) from data;
All timings are in ms.

With this data

Command to set tuples_per_page is:
ALTER TABLE tablename set (tuples_per_page = X)
where 1 = X = 32.

The patch really need some feedback, I've tried to adress Tom Lane's 
earlier

comment about fixing the place where it figure out wether it needs a toast
table (and actually tested that it works).

While there surely are more that can be done in order to improve the 
flexibillity

in this area I do think that there is sufficient benefit.

This is my second shot at coding C, so please let me know if I have been 
doing

anything wrong. Comments are all welcome.

Thanks.

--
Jesper
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 52b2dc8..ba36923 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -848,6 +848,27 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } ] TABLE [ IF NOT EXISTS ] repl
/varlistentry
 
varlistentry
+termliteraltuples_per_page/ (typeinteger/)/term
+listitem
+ para
+  The tuples_per_page for a table is an between 1 and 32. It will 
+  instruct the database to aim for this amount of tuples per page (8KB)
+  when updating or inserting rows, thereby tuning how agressive columns
+  will be compressed and/or transferred to the corresponding TOAST table. 
+  Default is 4, which aims for tuplesizes less the 2KB. 
+  Tuning the amount of tuples per page up, will increase the density 
+  of tuples in the main table giving more speed for queries that only fetches
+  simple values or checking visibillity at the cost of having slower access
+  to the larger entries. Tuning the amount of tuples per page down will give
+  more tuple data in the main table thus faster access to data that would
+  otherwise have been moved to toast. This functionality can be viewed 
+  as a way to vertically partition data into two files. 
+ /para
+/listitem
+   /varlistentry
+
+
+   varlistentry
 termliteralautovacuum_enabled/, literaltoast.autovacuum_enabled/literal (typeboolean/)/term
 listitem
  para
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 1e619b1..6e6d0eb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -15,6 +15,7 @@
 
 #include postgres.h

Re: [HACKERS] Make tuples_per_page pr. table configureable.

2010-09-21 Thread Jesper Krogh


On 2010-09-22 04:33, Itagaki Takahiro wrote:

On Wed, Sep 22, 2010 at 3:44 AM, Jesper Kroghjes...@krogh.cc  wrote:
   

Command to set tuples_per_page is:
ALTER TABLEtablename  set (tuples_per_page = X)
where 1= X= 32.
 

The tuples_per_page means *minimal* number of tuples in a page, right?
A page may contain more tuples when the size of tuples are small enough.
If so, I think the parameter name is confusable because it sounds
*restrict* the number of tuples per page like fillfactor.
min_tuples_per_page might be better for the name.
   


Yes, minimum.. I'll change that. The name was picked to this since
it is the name of the variable in the source-code, but thats actully a
really bad argument by itself. min_tuples_per_page is more correct.

I didn't read previous discussions, but did we have consensus that number
of tuples is better than other units for the parameter? For example,
threshold bytes or percentage in a page also seems to be reasonable for me.
   

I thought about chaning it to bytes, but my feeling was that it would
make most sense with something that would multiply to 8KB anyway
people wouldn't end up with 5KB and a table mostly filled with 3KB nothing
per page. So I tried to describe the small math in the documentation.

Thanks for you feedback.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Configuring synchronous replication

2010-09-17 Thread Jesper Krogh


On 2010-09-17 10:09, Heikki Linnakangas wrote:

 I think it makes most sense to set sync vs. async in the master, and
 the level of synchronicity in the slave. Although I have sympathy
 for the argument that it's simpler if you configure it all from the
 master side as well.


Just a comment as a sysadmin, It would be hugely beneficial if the
master and slaves all was able to run from the exact same configuration
file. This would leave out any doubt of the configuration of the 
complete cluster

in terms of debugging. Slave would be able to just copy over the masters
configuration, etc. etc.

I dont know if it is doable or has any huge backsides.

--
Jesper

Re: [HACKERS] SHOW TABLES

2010-07-15 Thread Jesper Krogh


On 2010-07-15 18:07, Marc G. Fournier wrote:

 On Thu, 15 Jul 2010, Thom Brown wrote:

 If it's only a psql problem, why implement it as SQL?  Is it just
 so we're not adding keywords specifically to psql?  In that case,
 it shouldn't support QUIT.

 Personally, I think this is somethign that should go into the backend
 ... I'd like to be able to write perl scripts that talk to the
 backend without having to remember all the various system tables I
 need to query / join to get the same results as \d gives me in psql
 ... same for any interface language, really ...


Moving it into the backend (together with the other commands) would
also solve this usabillity issue:

WARNING:  You are connected to a server with major version 8.4,
but your psql client is major version 8.3.  Some backslash commands,
such as \d, might not work properly.

testdb \d testtable
ERROR:  column reltriggers does not exist
LINE 1: SELECT relhasindex, relkind, relchecks, reltriggers, relhasr...

--   ^
Jesper

Re: [HACKERS] pessimal trivial-update performance

2010-07-05 Thread Jesper Krogh


On 2010-07-04 06:11, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:
   

CREATE OR REPLACE FUNCTION update_tab() RETURNS void AS $$
BEGIN
INSERT INTO tab VALUES (0);
FOR i IN 1..10 LOOP
UPDATE tab SET x = x + 1;
END LOOP;
END
$$ LANGUAGE plpgsql;
 

I believe that none of the dead row versions can be vacuumed during this
test.  So yes, it sucks, but is it representative of real-world cases?

   

The problem can generally be written as tuples seeing multiple
updates in the same transaction?

I think that every time PostgreSQL is used with an ORM, there is
a certain amount of multiple updates taking place. I have actually
been reworking clientside to get around multiple updates, since they
popped up in one of my profiling runs. Allthough the time I optimized
away ended being both roundtrip time + update time, but having
the database do half of it transparently, might have been sufficient
to get me to have had a bigger problem elsewhere..

To sum up. Yes I think indeed it is a real-world case.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pessimal trivial-update performance

2010-07-05 Thread Jesper Krogh


On 2010-07-05 12:11, Pierre C wrote:


 The problem can generally be written as tuples seeing multiple
 updates in the same transaction?

 I think that every time PostgreSQL is used with an ORM, there is a
 certain amount of multiple updates taking place. I have actually
 been reworking clientside to get around multiple updates, since
 they popped up in one of my profiling runs. Allthough the time I
 optimized away ended being both roundtrip time + update time,
 but having the database do half of it transparently, might have
 been sufficient to get me to have had a bigger problem elsewhere..

 To sum up. Yes I think indeed it is a real-world case.

 Jesper

 On the Python side, elixir and sqlalchemy have an excellent way of
 handling this, basically when you start a transaction, all changes
 are accumulated in a session object and only flushed to the
 database on session commit (which is also generally the transaction
 commit). This has multiple advantages, for instance it is able to
 issue multiple-line statements, updates are only done once, you save
 a lot of roundtrips, etc. Of course it is most of the time not
 compatible with database triggers, so if there are triggers the ORM
 needs to be told about them.


How about unique constraints, foreign key violations and checks? Would
you also pospone those errors to commit time? And transactions with lots 
of data?


It doesn't really seem like a net benefit to me, but I can see applications
where it easily will fit.

Jesper

Re: [HACKERS] Admission Control

2010-06-28 Thread Jesper Krogh


On 2010-06-25 22:44, Robert Haas wrote:

On Fri, Jun 25, 2010 at 3:52 PM, Kevin Grittner
kevin.gritt...@wicourts.gov  wrote:
   

Heck, I think an even *more* trivial admission control policy which
limits the number of active database transactions released to
execution might solve a lot of problems.
 

That wouldn't have any benefit over what you can already do with a
connection pooler, though, I think.  In fact, it would probably be
strictly worse, since enlarging the number of backends slows the
system down even if they aren't actually doing anything much.
   


Sorry if I'm asking silly questions, but how does transactions and
connection pooler's interact?

Say if you have 100 clients all doing fairly inactive database work
in transactions lasting a couple of minutes at the same time. If I 
understand

connection poolers they dont help much in those situations where an
accounting system on limited resources across all backends 
definately would help.


(yes, its a real-world application here, wether it is clever or not...  )

In a fully web environment where all transaction last 0.1s .. a pooler
might make fully sense (when traffic goes up).

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Admission Control

2010-06-28 Thread Jesper Krogh


On 2010-06-28 21:24, Kevin Grittner wrote:

Jesper Kroghjes...@krogh.cc  wrote:

   

Sorry if I'm asking silly questions, but how does transactions and
connection pooler's interact?
 


That depends a great deal on the pooler and its configuration, as
well as your client architecture.  Our shop gathers up the
information needed for our database transaction and submits it to a
server application which has all the logic needed to use that data
to apply the transaction.  We determined long ago that it is a Very
Bad Idea for us to have an open database transaction which is
waiting for a user to do something before it can proceed.
   


The situation is more:
1) Grab a bunch of data (using pg_try_advisory_lock() to lock out
  other processes from grabbing the same).
2) Process the data (in external software).
3) Push results back into the database, including a flag
   telling that the data has been processed.
4) Release advisory locks.


Step 2 takes somewhere between a couple of seconds to a couple of
minutes depending on the task to be done.

It might not be optimal but it is extremely robust and simple
to wrap 1 to 4 within a BEGIN / COMMIT block.
On the application side is really nice not having to deal with
partly processed data in the database, which I can get around
with by just keeping the transaction open.

From my POV, a connection pooler doesn't buy anything, and
I cannot stop all processes from executing at the same time, allthough
it most likely will not happen. There is no wait for user
involved.

And that means somewhere in the 100+ backends, allthough they
are mostly idle, seen from a database perspective.

I have not hit any issues with the work_mem being too high, but
I'm absolutely sure that I could flood the system if they happened to
be working at the same time.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh


On 2010-05-30 20:02, Jan Urbański wrote:

Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.
   

Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.

Thanks alot.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

2010-05-31 Thread Jesper Krogh


Hi.

The test data a set of generated terms using this perl-script
http://shrek.krogh.cc/~jesper/build-test.pl
and http://shrek.krogh.cc/~jesper/words.txt

I have generated a test dataset with an average tsvector length of
around 250 and 200.000 tuples in the dataset.

Conceptually searching for the full dataset would always be fastest
solved by a seq-scan. The query planner enforces this so much, so not
even enable_seqscan=off can convince it to to something else. So in
the next two explain analyze I compare a query searching 99% of the
table up with a seqscan. The 98% case is enforced to be a 
bitmap-index-scan
I would expect the runtime of the seqscan to be shortest and the 
bitmap-index-scan
to be quite a lot larger, due to random access and the fact that the 
index-data

also needs to be read in from disk.

Bot runs are run with a freshly started postgresql backend and
echo 3  /proc/sys/vm/drop_caches so the os caching should not come 
into play.


ftstest=# EXPLAIN ANALYZE select id from ftstest where body_fts @@ 
to_tsquery('commonterm98');
 QUERY 
PLAN


 Bitmap Heap Scan on ftstest  (cost=6579.81..992733.57 rows=195976 
width=4) (actual time=4813.258..7081.277 rows=195976 loops=1)

   Recheck Cond: (body_fts @@ to_tsquery('commonterm98'::text))
   -  Bitmap Index Scan on ftstest_gist_idx  (cost=0.00..6530.82 
rows=195976 width=0) (actual time=4787.513..4787.513 rows=195976 loops=1)

 Index Cond: (body_fts @@ to_tsquery('commonterm98'::text))
 Total runtime: 7389.346 ms
(5 rows)

ftstest=# set enable_bitmapscan = off;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
ftstest=# set enable_bitmapscan = off;
SET
ftstest=# EXPLAIN ANALYZE select id from ftstest where body_fts @@ 
to_tsquery('commonterm98');

  QUERY PLAN
--
 Seq Scan on ftstest  (cost=0.00..1006314.00 rows=195976 width=4) 
(actual time=96.077..60092.080 rows=195976 loops=1)

   Filter: (body_fts @@ to_tsquery('commonterm98'::text))
 Total runtime: 60436.556 ms
(3 rows)

So searching the full table via a bitmap-index-scan is actually 9 times
cheaper than a seq-scan.  (same on 9.0b1 and 8.4).

Digging more into it reveals that the body_fts tsvector is indeed needed for
the filter in the SeqScan. The tsvector data is stored in a TOAST 
table and

the in the bitmap-index-scan case it only needs to read in the main table
for checking visibillity. In the end it translates to reading in 1.4GB of
TOAST-data vs. reading in 34MB of table data.

Thinking a bit, then I dont think this is a particular rare case, 
allthough the
ratio between the tables may be a real cornercase. The ratio is not 1:33 
in the
dataset that looks like the production dataset, but more 1:10, but in 
all cases
in production there would be a much higher cache-hit ratio on the 
gin-index
and the main table pages than on the TOAST table, so even with a ratio 
of 1:1

there most likely would be a real-world benefit.

Would it be possible to implement the Filtering using the gin-index and
a subsequent visibillity-check as on the index-scan?

The same end up being the case for queries ordered by btree indexes and
filtered by gin-indexes.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

2010-05-31 Thread Jesper Krogh


On 2010-05-31 22:09, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:
   

Conceptually searching for the full dataset would always be fastest
solved by a seq-scan. The query planner enforces this so much, so not
even enable_seqscan=off can convince it to to something else.
...
Would it be possible to implement the Filtering using the gin-index and
a subsequent visibillity-check as on the index-scan?
 

You're failing to make any sense whatsoever.  If you're reading the full
dataset, there is no filter condition.  If there is a potentially
indexable filter condition, the planner will certainly consider that.
   

Yes, you're totally right on that (about making sense).

But that is because of the simplified use-case described. A more elaborate
description ..
I have a table with has a set of colums attached to it typically
used for sorting these columns may also be inferred on the table
by a typical join condition and a document that is fts-indexed.
So the actual use-case is that people query for:

give me the 1000 most recent documents matching term

Term may in some cases be hugely trivial, only filtering away 0.001% of the
dataset resulting in a index-scan on a btree date index filtering on the
tsvector column for term.

Term may also be something really specific only returning a single
or a few documents and just pushing a post-sorting to get the ordering.

But in the case where the query-planner falls over to a index-scan
on one of the btree-indices it ends up reading over from the TOAST data.

Will the planner consider doing the index-scan(forward or backwards)
on a btree-index and filter using a gin-index instead of filtering directly
on the tuple-data?
(I haven't been able to enforce an query-plan that looks like that).


Personally I think the issue here has got more to do with the
non-immutability of the single-argument form of to_tsquery, which means
it gets re-evaluated at every row during a seqscan.  Do your results
change if you work with to_tsquery('english', ...)  (or whatever your
default TS config is)?
   


It is english..  and yes it did indeed change the results. So the 
expensive case

dropped from ~60s to ~28s and the cheap case from ~7.3s to ~4.3s, that
is quite surprising that such small change can have that huge impact. The
single-argument version should be forbidden.

But the performance ratio between the two cases is still the same.

The test was actually run with the preliminary gincostestimate-patch from
Oleg Bartunov so the actual cost estimates match way better now, but that
should not impact the actual runtime.

Thanks

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh


On 2010-05-31 20:38, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:
   

Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.
 

Good.  How much did the ANALYZE time change for your table?
   

1.3m documents.

New code ( 3 runs):
statistics target 1000 = 155s/124s/110s
statictics target 100 = 86s/55s/61s
Old code:
statistics target 1000 = 158s/101s/99s
statistics target 100 = 90s/29s/33s

Somehow I think that the first run is the relevant one, its pretty much 
a dead disk test,
and I wouldn't expect that random sampling of tuples would have any sane 
caching
effect in a production system. But it looks like the algoritm is a bit 
slower.


Thanks again..

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jesper Krogh


On 2010-05-29 15:56, Jan Urbański wrote:

On 29/05/10 12:34, Jesper Krogh wrote:
   

On 2010-05-28 23:47, Jan Urbański wrote:
 

On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

   

I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).
 

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

   

How about setting it to some default in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?


Isn't it the same type of logic that is used for collecting statistics
for array-types, say integer-arrays and text arrays?
 

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).
   


In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the 
node-ids
in the set and then indexing them using gin. That seems like a really 
bad idea

based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?


The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.
   


I know it is not percieved the correct way to do things, but I would
really like to keep the stop words in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the full set and not the empty 
set as

we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more 
generalized

code around it.


I can fairly easy try out patches or do other kind of testing.
 

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.
   


Excellent.


testdb=# explain select id from testdb.reference where document_tsvector 
@@ plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored

 QUERY PLAN
-
 Limit  (cost=41.02..41.03 rows=1 width=4)
   -  Sort  (cost=41.02..41.03 rows=1 width=4)
 Sort Key: id
 -  Bitmap Heap Scan on reference  (cost=34.50..41.01 rows=1 
width=4)
   Recheck Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))
   -  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.50 rows=1 width=0)
 Index Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))

(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@ 
plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored

 id

(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id..  trivial 
to extract.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh


On 2010-05-28 23:47, Jan Urbański wrote:

On 28/05/10 22:22, Tom Lane wrote:
   

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.  Then use that estimate as
the s value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper.  If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.
 

I see what you mean, so the idea would be:

  * assume some value of W as the number of all words in the language
  * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
  * set e = s/10 and w = 1/e, that is 10/s
  * perform LC using that value of w
  * remove all elements for which f  (s-e)N, that is f  0.9*sN, where N
is the total number of lexemes processed
  * create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.
   


I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).

Isn't it the same type of logic that is used for collecting statistics 
for

array-types, say integer-arrays and text arrays?

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
   


Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per row.


After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f
0.054 * 35017 * 154 that is f  291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is
   


If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can fairly easy try out patches or do other kind of testing.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh


On 2010-05-28 04:47, Tom Lane wrote:

Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
   


I should have been more precise in the wording. Cranking up the
stats target gave me overall a better plan, but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a Bitmap Index Scan on the gin-index to
Index Scan on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Statistics for tsvector wildcards. term*

2010-05-29 Thread Jesper Krogh


Hi.

There seems to be an unimplemented area around getting statistics for
wildcard searches done. Wildcards anchored to the left can be matched
up by the gin-index and the ts_match_vq operator:


testdb=# select to_tsvector('project') @@ to_tsquery('proj:*');
 ?column?
--
 t
(1 row)

Searching for project gives me this estimate:
testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('projects') order by id limit 50;

   QUERY PLAN
-
 Limit  (cost=0.00..3008.54 rows=50 width=76)
   -  Index Scan using reference_pkey on reference  
(cost=0.00..3762544.72 rows=62531 width=76)

 Filter: (document_tsvector @@ to_tsquery('projects'::text))
(3 rows)

whereas searching for proj:* gives:

testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('proj:*') order by id limit 50;

 QUERY PLAN
-
 Limit  (cost=73.56..73.58 rows=6 width=76)
   -  Sort  (cost=73.56..73.58 rows=6 width=76)
 Sort Key: id
 -  Bitmap Heap Scan on reference  (cost=34.55..73.49 rows=6 
width=76)
   Recheck Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))
   -  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.54 rows=6 width=0)
 Index Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))

(7 rows)

There are abouvios challenges in getting statistics for submatches where 
there are no real information
in the pg_stats table, but there will also be a huge amount of cases 
where a fairly reliable

guess can be extracted.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-05-27 Thread Jesper Krogh



On 27/05/2010, at 20.00, Josh Berkus j...@agliodbs.com wrote:




Well, maybe I'm confused here, but arranging things so that we NEVER
have to visit the page after initially writing it seems like it's
setting the bar almost impossibly high.


That is the use case, though.  What I've encountered so far at 3  
client

sites is tables which are largely append-only, with a few selects and
very few updates ( 2%) on recent data.   In general, once data gets
flushed out of memory, it goes to disk and never gets recalled, and
certainly not written.  Thinks are hunky-dory until we reach
max_freeze_age, at which point the server has to chew through hundreds
of gigabytes of old data just to freeze them, sometimes bringing the
application to a halt in the process.


The data doesn't get in there in  no time if autovacuum was aware of  
inserts too it would incrementally freeze the table as it grows.


It would still cause it to be read in again but not in a big chunck.

Couldn't pages that are totally filled by the same transaction, be  
frozen on the initial write?


Jesper - given my limited knowledge about how it works.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-05-26 Thread Jesper Krogh



On 27/05/2010, at 02.48, Robert Haas robertmh...@gmail.com wrote:


On Wed, May 26, 2010 at 8:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Josh Berkus j...@agliodbs.com writes:

How does that get us out of reading and writing old pages, though?


Yeah.  Neither PD_ALL_VISIBLE nor the visibility map are going to  
solve
your problem, because they cannot become set without having visited  
the

page.


Well, maybe I'm confused here, but arranging things so that we NEVER
have to visit the page after initially writing it seems like it's
setting the bar almost impossibly high.   Consider a table that is
regularly written but append-only.  Every time autovacuum kicks in,
we'll go and remove any dead tuples and then mark the pages
PD_ALL_VISIBLE and set the visibility map bits, which will cause
subsequent vacuums to ignore the all-visible portions of the table...
until anti-wraparound kicks in, at which point we'll vacuum the entire
table and freeze everything.


Just a thought.  Wouldn't a All-visible bit also enable index only  
scans to some degree?


Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jesper Krogh



On 26/05/2010, at 01.16, Jan Urbański wulc...@wulczer.org wrote:


On 19/05/10 21:01, Jesper Krogh wrote:

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.


So for tsvectors the statistics target means more or less at any time
track at most 10 * target lexemes simultaneously where track  
means

keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the  
10
factor is meant to approximate the average number of unique lexemes  
in a

tsvector. If your documents are very large, this might not be a good
approximation.


I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.


But the distribution is very flat at the end, the last 128 values  
are

excactly
1.00189e-05
which means that any term sitting outside the array would get an
estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows


Yeah, this might meant that you could try cranking up the stats  
target a

lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning  
of

the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.


I Think i would prefer something less magic   I Can increase the  
statistics target and get more reliable data but that increases also  
the amount of tuples being picked out for analysis which is really  
time consuming.


But that also means that what gets stored as the lower bound of the  
historgram isn't anywhere near the lower bound, more the lower bound  
of the artificial histogram that happened after the last pruning.


I Would suggest that the pruning in the end should be aware of this.  
Perhaps by keeping track of the least frequent value that never got  
pruned and using that as the last pruning ans lower bound?


Thanks a lot for the explanation it fits fairly well why i couldn't  
construct a simple test set that had the problem.




So far I have no idea if this is bad or good, so a couple of sample  
runs

of stuff that
is sitting outside the most_common_vals array:

[gathered statistics suck]


So the most_common_vals seems to contain a lot of values that  
should

never have been kept in favor
of other values that are more common.


In practice, just cranking the statistics estimate up high enough  
seems

to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?


The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about  
the

algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c


I'll do some Reading

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade - link mode and transaction-wraparound data loss

2010-05-19 Thread Jesper Krogh


On 2010-05-18 18:57, Bruce Momjian wrote:

jes...@krogh.cc wrote:
   

Hi

I tried running pg_upgrade from the current snapshot of postgresql and
upgrading from 8.4.4 to the snapshot version. Everything seem to look fine
in the process and all that came out was only ok's but when I tried a
simple query on the databse it keeps throwing these message out of the back
side.

DETAIL:  You might have already suffered transaction-wraparound data loss.
WARNING:  some databases have not been vacuumed in over 2 billion
transactions


The database was around 600GB and it took a couple of minutes to run
pg_upgrade after I had all the binaries in the correct place.

It is not really an easy task to throw around 600GB of data, so I cannot
gaurantee that the above is reproducible, but I'll see if I can get time
and try to reproduce it.
 

This certainly should never have happened, so I am guessing it is a bug.
pg_upgrade tries hard to make sure all your datfrozenxid and
relfrozenxid are properly migrated from the old server, and the
transaction id is set properly.  Unfortunately this is the first time I
have heard of such a problem, so I am unclear on its cause.
   


Other people are typically way faster than I am looking into it.
Depesz has produced a full trace to reproduce the problem here:
http://www.depesz.com/index.php/2010/05/19/waiting-for-9-0-pg_upgrade/

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-19 Thread Jesper Krogh


Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO:  analyzing reference
INFO:  reference: scanned 14486 of 14486 pages, containing 350174 live 
rows and 6027 dead rows; 30 rows in sample, 350174 estimated total rows

ANALYZE

Ok, so analyze allmost examined all rows. Looking into 
most_common_freqs I find
# select count(unnest) from (select unnest(most_common_freqs) from 
pg_stats where attname = 'document_tsvector') as foo;

 count
---
  2810
(1 row)


But the distribution is very flat at the end, the last 128 values are 
excactly

1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runs 
of stuff that

is sitting outside the most_common_vals array:

# explain analyze select id from efam.reference where document_tsvector 
@@ to_tsquery('searchterm') order by id limit 2000;

QUERY PLAN
--
 Limit  (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135 
rows=1612 loops=1)
   -  Sort  (cost=35.99..35.99 rows=2 width=4) (actual 
time=20.709..23.190 rows=1612 loops=1)

 Sort Key: id
 Sort Method:  quicksort  Memory: 124kB
 -  Bitmap Heap Scan on reference  (cost=28.02..35.98 rows=2 
width=4) (actual time=3.522..17.238 rows=1612 loops=1)
   Recheck Cond: (document_tsvector @@ 
to_tsquery('searchterm'::text))
   -  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613 
loops=1)
 Index Cond: (document_tsvector @@ 
to_tsquery('searchterm'::text))

 Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly as 
predicted, works as expected but

in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in 
1612/350174 rows ~ freq: 0.0046 which
is way higher than the lower bound of 1.00189e-05 .. or it should have 
been sitting around the center of the 2810

values of the histogram collected.

So the most_common_vals seems to contain a lot of values that should 
never have been kept in favor

of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems 
to solve the problem, but doesn't

there seem to be something wrong in how the statistics are collected?

# select version();
  version
---
 PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC 
gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit



Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade - link mode and transaction-wraparound data loss

2010-05-18 Thread Jesper Krogh


On 2010-05-18 20:52, Bruce Momjian wrote:

This line above looks very odd because I didn't think the template0
datfrozenxid could be advanced.  Can I see the output of this query:

SELECT datname, datfrozenxid, datallowconn FROM pg_database;

   


Only from the old database:
data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database;
  datname  | datfrozenxid | datallowconn
---+--+--
 template0 |   2073823552 | f
 postgres  |   2023820521 | t
 data  |   2023782337 | t
 jk|   2023822188 | t
 template1 |   2073823552 | t
 workqueue |   2023822188 | t
(6 rows)


I am wondering if you set datallowconn for template0 to 'true'.


From this database, I cannot give any more results, I ran some other
queries and then restarted postgres, subsequently it seemed
totally broken. I'm in the process of running the test over again, but
it'll take a while before data is in.  I'll report back.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] BYTEA / DBD::Pg change in 9.0 beta

2010-05-18 Thread Jesper Krogh


Hi.

I'm trying to do a test move of one of our applications onto 9.0beta1.
We use storable and serializes data into a bytea column in the database.
This script uses that:

#!/usr/bin/perl
use strict;
use warnings;
use Storable;
use DBI;
use DBD::Pg;
use Data::Dumper;

my $dbh = DBI-connect(dbi:Pg:dbname=testdb,,,{AutoCommit = 1});
my $sql = END
create table testtable (id serial, testbytea bytea);
END
;

eval {
$dbh-do($sql);
};
$dbh-do(delete from testtable);
my $href = { this = 1, that = 2};
print Before:  . Dumper($href) . \n;
my $sth = $dbh-prepare(insert into testtable (testbytea) values (?));
my $frozen = Storable::nfreeze($href);
$sth-bind_param(1, $frozen, { pg_type=DBD::Pg::PG_BYTEA });
$sth-execute;
$sth = $dbh-prepare(select testbytea from testtable);
$sth-execute();
my $row = $sth-fetchrow_hashref();
my $href2 = Storable::thaw($row-{testbytea});

print Dumper($href2);


Running it against 8.4 gives:

$ perl bin/test-bytea
NOTICE:  CREATE TABLE will create implicit sequence testtable_id_seq1 
for serial column testtable.id
DBD::Pg::db do failed: ERROR:  relation testtable already exists at 
bin/efam/test-bytea line 16.

Before: $VAR1 = {
  'that' = '2',
  'this' = '1'
};

$VAR1 = {
  'that' = '2',
  'this' = '1'
};

Whereas 9.0beta1 gives:
$ perl bin/test-bytea
NOTICE:  CREATE TABLE will create implicit sequence testtable_id_seq1 
for serial column testtable.id
DBD::Pg::db do failed: ERROR:  relation testtable already exists at 
bin/efam/test-bytea line 16.

Before: $VAR1 = {
  'that' = '2',
  'this' = '1'
};

Storable binary image v60.48 more recent than I am (v2.7) at 
../../lib/Storable.pm (autosplit into ../../lib/auto/Storable/thaw.al) 
line 366, at bin/test-bytea line 28


Inspecting the data seems that it is the insert that does something to 
the data:


8.4
 id |   testbytea
+
  9 | 
\005\007\003\000\000\000\002\012\0012\000\000\000\004that\012\0011\000\000\000\004this

(1 row)

9.0beta1
 id |  testbytea
+--
  3 | \x05070300020a01320004746861740a0131000474686973
(1 row)


Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade - link mode and transaction-wraparound data loss

2010-05-18 Thread Jesper Krogh


On 2010-05-18 21:56, Bruce Momjian wrote:

Jesper Krogh wrote:
   

On 2010-05-18 20:52, Bruce Momjian wrote:
 

This line above looks very odd because I didn't think the template0
datfrozenxid could be advanced.  Can I see the output of this query:

SELECT datname, datfrozenxid, datallowconn FROM pg_database;


   

Only from the old database:
data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database;
datname  | datfrozenxid | datallowconn
---+--+--
   template0 |   2073823552 | f
   postgres  |   2023820521 | t
   data  |   2023782337 | t
   jk|   2023822188 | t
   template1 |   2073823552 | t
   workqueue |   2023822188 | t
(6 rows)
 

OK, datallowconn = false is right for template0, but I am still confused
how it got set to that high value.
   


This is the production system. I have absolutely no indications that
anything should be wrong in there. It has run rock-solid since it got
migrated (dump/restore) to 8.4 for about 7 months now. So I am a bit
scared about you telling that it seems wrong. (but that cannot be
attributed to pg_upgrade)


OK, thanks.  This does seem odd.  Frankly, having template0's
datfrozenxid be wrong would not cause any kind of instability because
template0 is used only by pg_dump, so I am wondering if something else
is seriously wrong.
   

I also think that something was seriously wrong with the pg_upgrade'd
version. I'll try to reproduce and be a bit more carefull in tracking 
the steps

this time.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BYTEA / DBD::Pg change in 9.0 beta

2010-05-18 Thread Jesper Krogh


On 2010-05-18 23:12, Alex Hunsaker wrote:

set bytea_output 'escape';


That was it. Knowing what the problem was I had no problem finding it in 
the release notes.


May I ask whats the reason is for breaking the compatibillity?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_migrator to /contrib in a later 9.0 beta

2010-05-05 Thread Jesper Krogh


On 2010-05-03 23:09, Bruce Momjian wrote:

Robert Haas wrote:
   

On Sun, May 2, 2010 at 3:45 PM, Dimitri Fontainedfonta...@hi-media.com  wrote:
 

Now you tell me how awful this idea really is :)
   

I'm not sure I can count that high.  :-)
 

While I can't improve on Robert's reply, I can supply a PDF about how
pg_migrator works:

http://momjian.us/main/presentations/technical.html#pg_migrator

   

There is a huge amount of users to whom pg_migrator is at least
a big a feature as HS+SR is.

Last dump/restore was a 24 hours process in one of our installations.
I think it was due to in-efficiency in handling BYTEA types in the
process (but not sure).

But I'm one of the few guys who seem to have an infinite amount of
time for reading on mailing lists, but without my knowledge from
reading this list I would never have run pg_migrator on my production
data if I had to pick it from pg_foundry.

Just my 0.25€

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_migrator to /contrib in a later 9.0 beta

2010-05-05 Thread Jesper Krogh


On 2010-05-06 01:45, Bruce Momjian wrote:

Jesper Krogh wrote:
   

On 2010-05-03 23:09, Bruce Momjian wrote:
 

Robert Haas wrote:

   

On Sun, May 2, 2010 at 3:45 PM, Dimitri Fontainedfonta...@hi-media.com   
wrote:

 

Now you tell me how awful this idea really is :)

   

I'm not sure I can count that high.  :-)

 

While I can't improve on Robert's reply, I can supply a PDF about how
pg_migrator works:

http://momjian.us/main/presentations/technical.html#pg_migrator


   

There is a huge amount of users to whom pg_migrator is at least
a big a feature as HS+SR is.

Last dump/restore was a 24 hours process in one of our installations.
I think it was due to in-efficiency in handling BYTEA types in the
process (but not sure).

But I'm one of the few guys who seem to have an infinite amount of
time for reading on mailing lists, but without my knowledge from
reading this list I would never have run pg_migrator on my production
data if I had to pick it from pg_foundry.
 

So, did you use copy or link mode, and how fast was the pg_migrator
upgrade?

   

I did go the painful way (dump+restore) at that point in time.
It was an 8.1 - 8.3 migration. Since then data has grown and the dump
restore is even less favorable on the 8.3 - 9.0 migration.

So in general the pg_migrator way seems to be the only way to aviod
the slony way which is orders of magnitude more complicated.

Q: I read you pdf, why isn't statistics copied over? It seems to be the last
part missing from doing an upgrade in a few minutes.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_migrator to /contrib in a later 9.0 beta

2010-05-05 Thread Jesper Krogh


On 2010-05-06 06:41, Alvaro Herrera wrote:

Excerpts from Jesper Krogh's message of jue may 06 00:32:09 -0400 2010:

   

Q: I read you pdf, why isn't statistics copied over? It seems to be the last
part missing from doing an upgrade in a few minutes.
 

Seems fraught with peril, and a bit pointless.  What's so bad about having to
run ANALYZE afterwards?
   


There is nothing directly bad about it.. but:

It's just an extra step, that might be overseen and is absolutely 
required.


I should have written:
Why isn't statistics copied over or why doesnt pg_migrator run analyze by
itself?

The database (of a reasonable size) is useless until statistics is 
available.


I guess it is because pg_dump/restore doesn't do it either.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Using GIN/Gist to search the union of two indexes?

2010-03-05 Thread Jesper Krogh

Hi.

How complicated would it be to make postgresql-fts search the union of
several GIN/Gist indexes.

The use-case is that you have two tables:

tablea(id serial, tableb_id int, text tsvector);
and
tableb(id serial, text tsvector);
and indices on both tsvectors.

The typical query would join the two tables on the key:

select id from tablea,tableb where tablea.tableb_id = tableb.id;

And then filter the results on the fts-indexes:

select id from tablea,tableb where tablea.tableb_id = tableb.id and
tablea.text @@ to_tsquery('ftsquery') or tableb.text @@
to_tsquery('ftsquery');

This one is doable .. using some mocking of queries in the
application. But if ftsquery is:

terma  termb where terma only exists in tablea and termb only exists
in tableb, then it doesn't work. The path would seem to be to not use
the indexes.

I guess it would be something like a new ts_match_vq() that can take
more than one vector and get the underlying logik to do the union at
search time?

Can someone with more insigth into the code tell me if it is persieved a
hard task to do?

Thanks.
Jesper
-- 
Jesper

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Make TOAST_TUPLES_PER_PAGE configurable per table.

2010-02-01 Thread Jesper Krogh

Hi

This is my first attempt to hack PostgreSQL (even C actually), so bear
over with obvious mistakes done.

I've had a wish to be able to teach Postgres a bit more about how to
store its data on disk. Our systems is a typical web-based system where
all access more or less can be devided into 2 categories:
List view .. which is overview, counts, aggregates on simple values
with 50..200 rows per page and
details views which is more or less all data from a single rows
combined with aggregates of relations and similar.

Bases on this knowledge I know that there is a significant amount of
data stored inline in tuples and being read of disk for the listing
that is never needed. At the moment it'll try to compress an get below
pagesize/4 ~ 2KB/tuple before it gets out to TOASTING the large tables.

Looking at the current implementation it seems to do the right thing
since the large, variable length attributes are the most likely to
not be shown on listing pages anyway, but it is not aggressive enough
(in my view for all common web-things), so this patch tries to make
TOAST_TUPLES_PER_PAGE per table configurable (the desired tuple-density
on the main storage).

This patch enables users to set TOAST_TUPLES_PER_PAGE with

ALTER TABLE table SET (tuples_per_page = X); .. currently with 1 = X
= 32;

ftstest=# create table testtext8(id SERIAL,col text) with
(tuples_per_page=8);
NOTICE:  CREATE TABLE will create implicit sequence testtext8_id_seq
for serial column testtext8.id
CREATE TABLE
ftstest=# create table testtext2(id SERIAL,col text) with
(tuples_per_page=2);
NOTICE:  CREATE TABLE will create implicit sequence testtext2_id_seq
for serial column testtext2.id
CREATE TABLE
ftstest=# insert into testtext8(col) (select (select
array_to_string(array_agg(chr((random()*95+30)::integer)),'') from
generate_series(1,3000)) as testtext from generate_series(1,5));
INSERT 0 5
ftstest=# insert into testtext2(col) (select (select
array_to_string(array_agg(chr((random()*95+30)::integer)),'') from
generate_series(1,3000)) as testtext from generate_series(1,5));
INSERT 0 5
ftstest=# \timing
### Here i stop PG and echo 3  /proc/sys/vm/drop_caches
ftstest=# select count(id) from testtext2;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
ftstest=# select count(id) from testtext2;
 count
---
 5
(1 row)

Time: 4613.044 ms
ftstest=# select count(id) from testtext8;
 count
---
 5
(1 row)

Time: 318.743 ms

This obviously comes with a drawback if I actually Need the data.

ftstest=# select max(length(col)) from testtext2;
 max
--
 3000
(1 row)

Time: 1445.016 ms
ftstest=# select max(length(col)) from testtext8;
 max
--
 3000
(1 row)

Time: 4184.994 ms

relation |size
-+
 pg_toast.pg_toast_1450869   | 195 MB
 public.testtext2| 195 MB
 public.testtext8| 2552 kB


No documentation on the patch. I'll do that a bit later.

Generally speaking.. if you have some knowledge about the access
patterns of your data then this patch can enable you to teach postgresql
to take advantage of that. In my situation I would estimate that the
production set would be able to drop a couple of GB from main memory
(leaving room for more index-pages and such).


Thanks in advance.

-- 
Jesper Krogh
diff -rc ../postgresql-8.5alpha3.orig/src/backend/access/common/reloptions.c ./src/backend/access/common/reloptions.c
*** ../postgresql-8.5alpha3.orig/src/backend/access/common/reloptions.c	2009-08-27 19:18:44.0 +0200
--- ./src/backend/access/common/reloptions.c	2010-02-01 21:12:41.0 +0100
***
*** 15,20 
--- 15,21 
  
  #include postgres.h
  
+ #include access/tuptoaster.h
  #include access/gist_private.h
  #include access/hash.h
  #include access/nbtree.h
***
*** 157,162 
--- 158,170 
  			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
  		}, -1, 0, 20
  	},
+ 	{
+ 		{
+ 			tuples_per_page,
+ 			Desired number of tuples per page (worst-case),
+ 			RELOPT_KIND_HEAP
+ 		},TOAST_TUPLES_PER_PAGE , 1,32
+ 	},
  	/* list terminator */
  	{{NULL}}
  };
***
*** 1074,1079 
--- 1082,1088 
  	int			numoptions;
  	static const relopt_parse_elt tab[] = {
  		{fillfactor, RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
+ 		{tuples_per_page, RELOPT_TYPE_INT, offsetof(StdRdOptions, tuples_per_page)},
  		{autovacuum_enabled, RELOPT_TYPE_BOOL,
  		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, enabled)},
  		{autovacuum_vacuum_threshold, RELOPT_TYPE_INT,
diff -rc ../postgresql-8.5alpha3.orig/src/backend/access/heap/tuptoaster.c ./src/backend/access/heap/tuptoaster.c
*** ../postgresql-8.5alpha3.orig/src/backend/access/heap

Re: [HACKERS] Make TOAST_TUPLES_PER_PAGE configurable per table.

2010-02-01 Thread Jesper Krogh

Tom Lane wrote:
 Jesper Krogh jes...@krogh.cc writes:
 This patch enables users to set TOAST_TUPLES_PER_PAGE with
 ALTER TABLE table SET (tuples_per_page = X); .. currently with 1 = X
 = 32;
 
 It's not clear to me that fiddling with that is useful unless the toast
 tuple size also changes; and unfortunately changing that is much harder,
 because it's wired into the addressing of toast data.  See also these
 notes:
 
  * XXX while these can be modified without initdb, some thought needs to be
  * given to needs_toast_table() in toasting.c before unleashing random
  * changes.  Also see LOBLKSIZE in large_object.h, which can *not* be
  * changed without initdb.

I can see that needs_toast_table() might need some changes since it also
uses TUPLE_TOAST_THRESHOLD, and might benefit from being aware of a
toast table is triggered.

There might be more benefits with changes the toast tuple size (I dont
have enought insight to see that), but even without it I can get a
speedup of x10 on a simple test and permanently get the system to used
the caching for more commonly used data than these attributes that are
rarely used.

Ultimately I would like an infinite amount of configurabillity since I
have tables that only consists of simple values were 50% is really
rarely used and 50% is very often used. But just changing the
TOAST_TUPLE_PER_PAGE as above can easily increase my tuple-density
from 6/page to 40-60/page, which translates directly into:
* Less data to read when accessing the tuples.
* Less data to cache that is rarely used.

Where as on the the table with simple values I might at best be able to
double the tuple-density.

But yes it isn't a silverbullet, it requires knowledge of the access
patterns of the data.

What kind of arguments/tests/benchmarks is required to push for the
usefulness of fiddling with this parameter?

Realworld database in our environment has:
12M rows sitting with an average text length of ~2KB directly
toastable set is: 5GB which is really rarely used, but the webapp is
doing random reads for the presense/counts of these rows.
another table has ~700M rows sitting of a size of 135GB where around
120GB is of the really rarely used type. (but takes time to compute so
it makes sense wasting dead disk on them).

So based on the benchmark provided in email I think that it can
significantly change the ration of cache hit/misses for the application.
(which has 64GB of dedicated memory).

Jesper
-- 
Jesper

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

93 matches

Mail list logo