Re: [HACKERS] a tsearch issue

2011-11-06 Thread Sushant Sinha
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote:
 Hello
 
 I found a interesting issue when I checked a tsearch prefix searching.
 
 We use a ispell based dictionary
 
 CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
 CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
 ALTER TEXT SEARCH CONFIGURATION cs
ALTER MAPPING FOR word, asciiword WITH cspell, simple;
 
 Then I created a table
 
 postgres=# create table n(a varchar);
 CREATE TABLE
 postgres=# insert into n values('Stěhule'),('Chromečka');
 INSERT 0 2
 postgres=# select * from n;
  a
 ───
  Stěhule
  Chromečka
 (2 rows)
 
 and I tested a prefix searching:
 
 I found a following issue
 
 postgres=# select * from n where to_tsvector('cs', a) @@
 to_tsquery('cs','Stě:*') ;
  a
 ───
 (0 rows)

Most likely you are hit by this problem.
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01347.php

'Stě' may be a stopword in czech.

 I expected one row. The problem is in transformation of word 'Stě'
 
 postgres=# select * from ts_debug('cs','Stě:*') ;
 ─[ RECORD 1 ]┬──
 alias│ word
 description  │ Word, all letters
 token│ Stě
 dictionaries │ {cspell,simple}
 dictionary   │ cspell
 lexemes  │ {sto}
 ─[ RECORD 2 ]┼──
 alias│ blank
 description  │ Space symbols
 token│ :*
 dictionaries │ {}
 dictionary   │ [null]
 lexemes  │ [null]
 

':*' is only specific to to_tsquery. ts_debug just invokes the parser.
So this is not correct.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 I'm a couple of days away from posting a much better principled
 implementation of pg_stat_statements normalisation. To normalise, we
 perform a selective serialisation of the query tree, which is hashed.

That seems like an interesting approach, and definitely a lot less of
a kluge than what Tomas suggested.  Are you intending to hash the raw
grammar output tree, the parse analysis result, the rewriter result,
or the actual plan tree?  I don't necessarily have a preconceived notion
about which is best (I can think of pluses and minuses for each), but
we should hear your explanation of which one you chose and what your
reasoning was.

 ... It also does things
 like intelligently distinguishing between queries with different
 limit/offset constant values, as these constants are deemed to be
 differentiators of queries for our purposes. A guiding principal that
 I've followed is that anything that could result in a different plan
 is a differentiator of queries.

This claim seems like bunk, unless you're hashing the plan tree,
in which case it's tautological.  As an example, switching from
where x  1 to where x  2 could make the planner change plans,
if there's a sufficiently large difference in the selectivity of the two
cases (or even if there's a very small difference, but we're right at
the tipping point where the estimated costs are equal).  There's no way
to know this in advance unless you care to duplicate all the planner
cost estimation logic.  And I don't think you actually want to classify
where x  1 and where x  2 differently just because they *might*
give different plans in corner cases.

I'm not real sure whether it's better to classify on the basis of
similar plans or similar original queries, anyway.  This seems like
something that could be open for debate about use-cases.

 There will be additional infrastructure added to the parser to support
 normalisation of query strings for the patch I'll be submitting (that
 obviously won't be supported in the version that builds against
 existing Postgres versions that I'll make available). Essentially,
 I'll be adding a length field to certain nodes,

This seems like a good way to get your patch rejected: adding overhead
to the core system for the benefit of a feature that not everyone cares
about is problematic.  Why do you need it anyway?  Surely it's possible
to determine the length of a literal token after the fact.

More generally, if you're hashing anything later than the raw grammar
tree, I think that generating a suitable representation of the queries
represented by a single hash entry is going to be problematic anyway.
There could be significant differences --- much more than just the
values of constants --- between query strings that end up being
semantically the same.  Or for that matter we could have identical query
strings that wind up being considered different because of the action of
search_path or other context.

It might be that the path of least resistance is to document that we
select one of the actual query strings at random to represent all the
queries that went into the same hash entry, and not even bother with
trying to strip out constants.  The effort required to do that seems
well out of proportion to the reward, if we can't do a perfect job of
representing the common aspects of the queries.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Strange problem with create table as select * from table;

2011-11-06 Thread Tom Lane
hubert depesz lubaczewski dep...@depesz.com writes:
 Any chance of getting the fix in patch format so we could test it on
 this system?

http://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=23998fe99c1220ba3a9eefee194e37ec1f14ae07

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [GENERAL] Strange problem with create table as select * from table;

2011-11-06 Thread hubert depesz lubaczewski
On Fri, Nov 04, 2011 at 09:04:02PM -0400, Tom Lane wrote:
 that.  And that they are the only rows that, in addition to the above
 conditions, contain data fields wide enough to require out-of-line
 toasting.

checked lengths of the text/varchar columns in database.

there are 16 such columns in the table.
full report of lengths is in
http://www.depesz.com/various/lengths.report.gz

it was obtained using:
select length( first_text_column ) as length_1, count(*) from 
etsy_v2.receipts group by 1 order by 1;
and so on for every text column, and at the end I also made summary of
sum-of-lengths.

there is also:
http://www.depesz.com/various/lengths2.report.gz
which has the same summary, but only of the damaged rows.

As you can see the length of columns is not really special - somewhere
in the middle of all other rows. summarized length is also not special
in any way.

 These conditions together are enough to break the assumption in
 toast_insert_or_update that the old and new tuples must have the same
 value of t_hoff.  But it can only happen when the source tuple is an
 original on-disk tuple, which explains why only INSERT ... SELECT *
 causes the problem, not any variants that require projection of a new
 column set.  When it does happen, toast_insert_or_update correctly
 computes the required size of the new tuple ... but then it tells
 heap_fill_tuple to fill the data part at offset olddata-t_hoff, which
 is wrong (too small) and so the nulls bitmap that heap_fill_tuple
 concurrently constructs will overwrite the first few data bytes.  In
 your example, the table contains 49 columns so the nulls bitmap requires
 7 bytes, just enough to overwrite the first 6 data bytes as observed.
 (In fact, given the values we see being filled in, I can confidently say
 that you have two added-since-creation null columns, no more, no less.)
 
 I can reproduce the problem with the attached test case (using the
 regression database).  With asserts enabled, the 
   Assert(new_len == olddata-t_hoff);
 fails.  With asserts off, corrupt data.

How can I make the onek table for the test? is it standard table from
something?

 This is trivial to fix, now that we know there's a problem --- the
 function is only using that assumption to save itself a couple lines
 of code.  Penny wise, pound foolish :-(

Any chance of getting the fix in patch format so we could test it on
this system?

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Strange problem with create table as select * from table;

2011-11-06 Thread Tom Lane
hubert depesz lubaczewski dep...@depesz.com writes:
 checked lengths of the text/varchar columns in database.

 there are 16 such columns in the table.
 full report of lengths is in
 http://www.depesz.com/various/lengths.report.gz

 it was obtained using:
 select length( first_text_column ) as length_1, count(*) from 
 etsy_v2.receipts group by 1 order by 1;
 and so on for every text column, and at the end I also made summary of
 sum-of-lengths.

BTW, that probably doesn't prove a lot since it takes no account of
compression, and different values could be compressible to varying
degrees.  It'd be more useful to look at pg_column_size() numbers
to get an idea of whether toasting happened or not.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
à  a

WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Ġ  G

WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Š  S

===

I've dug around through the unaccent code a bit and I've noticed that
the sscanf it does when reading the file is producing some odd output.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Gah! Accidentally hit Send. Let me finish that last message before
sending this time!


G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
à      a
       
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Ġ      G
       
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Š      S
       
===

I've dug around through the unaccent.c code a bit and I've noticed
that the sscanf it does when reading the file is producing some odd
output. I've tried with a minimal example using the same sort of
sscanf code reading from the same unaccent.rules file, but the minimal
example doesn't produce the same output.

I put some elog debugging lines into unaccent.c and found that sscanf
sometimes reads the scanned line by finding only one byte for the for
the source character rather than the two required for the complete
UTF-8 code point. It appears that the following characters are causing
the problem, along with the code points and such:

'Å' = 'A' | c3,85 = 41
'à' = 'a' | c3,a0 = 61
'ą' = 'a' | c4,85 = 61
'Ġ' = 'G' | c4,a0 = 47
'Ņ' = 'N' | c5,85 = 4e
'Š' = 'S' | c5,a0 = 53

In each case, one byte was being read in the source string rather than
two, leading to the duplicate TO warnings above. This later leads to
the characters that produced the warning being ignored when unaccent
is called and left in the output.

I haven't been able to reproduce in a smaller example, and haven't
been able to reproduce on a CentOS server, so at this point I'm at a
loss as to the problem.

Anybody got any ideas?

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Dimitri Fontaine
Tomas Vondra t...@fuzzy.cz writes:
 If it was possible to compare the actual plan (or at least a hash of the
 plan), and keeping one record for each plan, that'd be extremely nice.
 You'd see that the query was executed with 3 different plans, number of
 calls, average duration etc.

I like that idea.  How does that integrates to current efforts?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql concise mode

2011-11-06 Thread Dickson S. Guedes
2011/11/5 Josh Kupershmidt schmi...@gmail.com:
 I'd like to propose a concise mode for psql, which users might turn
 on via a \pset option. Concise mode would affect only the output of
 psql's backslash commands. For output results which have some all-NULL
 columns, as in:

 test=# \d+ foo
                         Table public.foo
  Column |  Type   | Modifiers | Storage | Stats target | Description
 +-+---+-+--+-
  a      | integer |           | plain   |              |
  b      | integer |           | plain   |              |
 Has OIDs: no

 Concise mode would simply omit the all-NULL columns, so that the
 output would look like this:

 test=# \d+ foo
                         Table public.foo
  Column |  Type   | Storage
 +-+-
  a      | integer | plain
  b      | integer | plain
 Has OIDs: no

Using your example, what if column 'b' has a comment and 'a' not? How
the above output will be displayed?

-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Florian Pflug
On Nov6, 2011, at 18:43 , J Smith wrote:
 I put some elog debugging lines into unaccent.c and found that sscanf
 sometimes reads the scanned line by finding only one byte for the for
 the source character rather than the two required for the complete
 UTF-8 code point. It appears that the following characters are causing
 the problem, along with the code points and such:
 
 'Å' = 'A' | c3,85 = 41
 'à' = 'a' | c3,a0 = 61
 'ą' = 'a' | c4,85 = 61
 'Ġ' = 'G' | c4,a0 = 47
 'Ņ' = 'N' | c5,85 = 4e
 'Š' = 'S' | c5,a0 = 53
 
 In each case, one byte was being read in the source string rather than
 two, leading to the duplicate TO warnings above. This later leads to
 the characters that produced the warning being ignored when unaccent
 is called and left in the output.

What's the locale of the database you're seeing this in, and which charset
does it use?

I think scanf() uses isspace() and friends, and last time I looked the
locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
decides that 0xA0 is whitespace.

 I haven't been able to reproduce in a smaller example, and haven't
 been able to reproduce on a CentOS server, so at this point I'm at a
 loss as to the problem.

Have you tried to set the same locale as postgres (using setlocale()) in
your tests?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Tomas Vondra
On 6 Listopad 2011, 16:08, Tom Lane wrote:
 Peter Geoghegan pe...@2ndquadrant.com writes:
 I'm a couple of days away from posting a much better principled
 implementation of pg_stat_statements normalisation. To normalise, we
 perform a selective serialisation of the query tree, which is hashed.

 That seems like an interesting approach, and definitely a lot less of
 a kluge than what Tomas suggested.  Are you intending to hash the raw
 grammar output tree, the parse analysis result, the rewriter result,
 or the actual plan tree?  I don't necessarily have a preconceived notion
 about which is best (I can think of pluses and minuses for each), but
 we should hear your explanation of which one you chose and what your
 reasoning was.

Could you describe the pluses and minuses? My understanding is that the
later the hash is computed, the more it reflects how the queries were
actually executed. Would it make sense to turn this into a GUC and leave
the decision up to the user, something like

  pg_stat_statements.hash_phase = {grammar|analysis|rewriter|plan}

So that the user could decide how coarse the output should be?

 I'm not real sure whether it's better to classify on the basis of
 similar plans or similar original queries, anyway.  This seems like
 something that could be open for debate about use-cases.

Well, that's really tricky question - there can be different queries with
the same plan. I thing that grouping queries solely by the plan is not
much useful, so the original query should be involved somehow.

What about using two hashes - hash of the grammar tree (grammar_hash) and
hash of the rewriter output (rewriter_hash). The pg_stat_statements would
then group the queries by the (grammar_hash, rewriter_hash) pair and
include those two columns into the output.

So I could select the rows with the same grammar_hash to see observed
plans for the given query, or select rows with the same rewriter_hash to
see queries leading to that particular plan.

To make this actually usable it's important to provide access to the
plans, so that the user can get rewriter_hash and get the plan somehow.
This is not needed for grammar_hash, because the query string will be
there, but the actual plan might change (due to autoanalyze etc.).

But maybe this is a severe over-engineering and it's far too complicated.

 It might be that the path of least resistance is to document that we
 select one of the actual query strings at random to represent all the
 queries that went into the same hash entry, and not even bother with
 trying to strip out constants.  The effort required to do that seems
 well out of proportion to the reward, if we can't do a perfect job of
 representing the common aspects of the queries.

Yes, once we have the hashes we can surely use a random query string with
the values included. But it'd be nice to have the actual plans stored
somewhere, so that it's possible to see them later.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multiple queries in transit

2011-11-06 Thread Jeroen Vermeulen

On 2011-11-03 17:26, Marko Kreen wrote:

On Mon, Oct 31, 2011 at 7:09 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

Can't you do that today with a multi-command string submitted to
PQsendQuery, followed by multiple calls to PQgetResult?


It's more annoying to to error handling on that, plus it still keeps the
blocking behaviour, just with larger blocks.


You can combine multi-command query strings with nonblocking mode, 
without any change in libpq itself.


In fact that's exactly what the libpqxx pipeline class does.  So if 
you're working in C++, you already have this feature at your disposal.




Also I would ask for opposite feature: multiple rows in flight.
That means that when server is sending big resultset,
the app can process it row-by-row (or by 10 rows)
without stopping the stream and re-requesting.


Cursors.


Jeroen

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2011-11-06 Thread Jeroen Vermeulen

On 2011-11-04 01:12, Alvaro Herrera wrote:


I would like some opinions on the ideas on this patch, and on the patch
itself.  If someone wants more discussion on implementation details of
each part of the patch, I'm happy to provide a textual description --
please just ask.


Jumping in a bit late here, but thanks for working on this: it looks 
like it could solve some annoying problems for us.


I do find myself idly wondering if those problems couldn't be made to go 
away more simply given some kind of “I will never ever update this key” 
constraint.  I'm having trouble picturing the possible lock interactions 
as it is.  :-)



Jeroen

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Greg Smith

On 11/06/2011 01:08 PM, Tom Lane wrote:

Peter Geogheganpe...@2ndquadrant.com  writes:
   

... It also does things
like intelligently distinguishing between queries with different
limit/offset constant values, as these constants are deemed to be
differentiators of queries for our purposes. A guiding principal that
I've followed is that anything that could result in a different plan
is a differentiator of queries.
 

This claim seems like bunk, unless you're hashing the plan tree,
in which case it's tautological.


Peter's patch adds a planner hook and hashes at that level.  Since this 
cat is rapidly escaping its bag and impacting other contributors, we 
might as well share the work in progress early if anyone has a burning 
desire to look at the code.  A diff against the version I've been 
internally reviewing in prep for community submission is at 
https://github.com/greg2ndQuadrant/postgres/compare/master...pg_stat_statements_norm  
Easier to browse than ask questions probing about it I think.  Apologies 
to Peter for sharing his work before he was completely ready; there is a 
list of known problems with it.  I also regret the thread hijack here.


The first chunk of code is a Python based regression test program, and 
an open item here is the best way to turn that into a standard 
regression test set.



There will be additional infrastructure added to the parser to support
normalisation of query strings for the patch I'll be submitting (that
obviously won't be supported in the version that builds against
existing Postgres versions that I'll make available). Essentially,
I'll be adding a length field to certain nodes,
 

This seems like a good way to get your patch rejected: adding overhead
to the core system for the benefit of a feature that not everyone cares
about is problematic.


Struggling around this area is the main reason this code hasn't been 
submitted yet.  To step back for a moment, there are basically three 
options here that any code like this can take, in regards to storing the 
processed query name used as the key:


1) Use the first instance of that query seen as the reference version
2) Use the last instance seen
3) Transform the text of the query in a way that's stable across all 
duplicates of that statement, and output that


Downstream tools operating on this data, things that will sample 
pg_stat_statements multiple times, need some sort of stable query text 
they can operate on.  It really needs to survive server restart too.  
Neither (1) nor (2) seem like usable solutions.  Both have been 
implemented already in Peter's patch, with (2) being what's in the 
current patch.  How best to do (3) instead is not obvious though.


Doing the query matching operating at the planner level seems effective 
at side-stepping the problem of needing to parse the SQL, which is where 
most implementations of this idea get stuck doing fragile 
transformations.  My own first try at this back in September and Tomas's 
patch both fall into that category.  That part of Peter's work seems to 
work as expected.  That still leaves the problem of parsed query - 
stable normalized string.  I think that is an easier one to solve than 
directly taking on the entirety of query text - stable normalized 
string though.  Peter has an idea he's exploring for how to implement 
that, and any amount of overhead it adds to people who don't use this 
feature is an obvious concern.  There are certainly use cases that don't 
care about this problem, ones that would happily take (1) or (2).  I'd 
rather not ship yet another not quite right for everyone version of 
pg_stat_statements again though; only solving too limited of a use case 
is the big problem with the one that's already there.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2011-11-06 Thread Magnus Hagander
On Sun, Nov 6, 2011 at 04:08, Greg Smith g...@2ndquadrant.com wrote:
 Attached patch adds a new function to the pageinspect extension for
 measuring total free space, in either tables or indexes.  It returns the
 free space as a percentage, so higher numbers mean more bloat.  After trying
 a couple of ways to quantify it, I've found this particular measure
 correlates well with the nastiest bloat issues I've ran into in production
 recently.  For example, an index that had swelled to over 5X its original
 size due to autovacuum issues registered at 0.86 on this scale.  I could
 easily see people putting an alert at something like 0.40 and picking
 candidates to reindex based on it triggering.  That would be about a million
 times smarter than how I've been muddling through this class of problems so
 far.

 Code by Jaime Casanova, based on a prototype by me.  Thanks to attendees and
 sponsors of the PgWest conference for helping to fund some deeper
 exploration of this idea.

Looks pretty useful.

One quick stylistic comment - we don't generally use * 1.0 to turn
an int into a double - just use a cast.


 -Given this is doing a full table scan, should it hook into a ring buffer to
 keep from trashing the buffer cache?  Or might it loop over the relation in
 a different way all together?  I was thinking about eyeing the FSM instead
 at one point, didn't explore that yet.  There's certainly a few ways to
 approach this, we just aimed at the easiest way to get a working starter
 implementation, and associated results to compare others against.

Hooking into a ring buffer seems like an almost requirement before you
can run this on a larger production system, wouldn't it? I don't know
how  hard that is code-wise, but it certainly seems worthwhile.


 -Should there be a non-superuser version of this?  We'd certainly need to
 get a less cache demolishing version before that would seem wise.

Not sure that's necessary - at least not for now. Many other
diagnostics functions are already superuser only...


 -There were related things in the pageinspect module, but a case could be
 made for this being a core function instead.  It's a bit more likely to be
 used in production than the rest of that extension.

A case can be made for a lot of things in contrib to be in core ;) I
say let's keep it in pageinspect, but then also have you finish off
that split up the contrib patch :-)


 -What if anything related to TOAST should this handle?

Similar data for TOAST relations would be intersting, no? But that's
easily done from userspace by just querying to the toast table
specifically, I assume?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql concise mode

2011-11-06 Thread Josh Kupershmidt
On Sun, Nov 6, 2011 at 1:16 PM, Dickson S. Guedes lis...@guedesoft.net wrote:
 test=# \d+ foo
                         Table public.foo
  Column |  Type   | Storage
 +-+-
  a      | integer | plain
  b      | integer | plain
 Has OIDs: no

 Using your example, what if column 'b' has a comment and 'a' not? How
 the above output will be displayed?

Then the comments would be displayed as they previously were, like so:

 Table public.foo
 Column |  Type   | Storage | Description
+-+-+-
 a  | integer | plain   |
 b  | integer | plain   | some comment
Has OIDs: no

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2011-11-06 Thread Bernd Helmle



--On 6. November 2011 01:08:11 -0200 Greg Smith g...@2ndquadrant.com wrote:


Attached patch adds a new function to the pageinspect extension for measuring
total free space, in either tables or indexes.


I wonder if that should be done in the pgstattuple module, which output some 
similar numbers.


--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Peter Geoghegan
On 6 November 2011 15:08, Tom Lane t...@sss.pgh.pa.us wrote:
 Are you intending to hash the raw
 grammar output tree, the parse analysis result, the rewriter result,
 or the actual plan tree?

I'm hashing the Query tree from a planner plugin (function assigned to
planner_hook), immediately prior to it being passed to
standard_planner. A major consideration was backwards compatibility;
at one point, it looked like this all could be done without adding any
new infrastructure, with perhaps an officially sanctioned 9.2 version
of pg_stat_statements that could be built against earlier pg versions.
 Other than that, it seems intuitively obvious that this should happen
as late as possible in the parsing stage - any transformation
performed prior to planning was considered essential to the query,
even if that potentially meant that successive calls of the same query
string were considered different due to external factors. These things
probably don't come up to often in practice, but I think that they're
nice to have, as they prevent the DBA from looking at statistics that
aren't actually representative.

 A guiding principal that
 I've followed is that anything that could result in a different plan
 is a differentiator of queries.

 This claim seems like bunk, unless you're hashing the plan tree,
 in which case it's tautological.

I think I may have been unclear, for which I apologise. Certainly, if
the principle I (mis)described was followed to the letter, I'd end up
with something that was exactly the same as what we already have. I
merely meant to suggest, with an awareness of the fact that I was
saying something slightly controversial, that because the planner is
/invariably/ sensitive to changes in limit/offset constants,
particularly large changes, it made sense to have those constants as
differentiators, and it certainly made sense to have a limit n
differentiate the same query with and without a limit n. I do of
course appreciate that the use of a different constant in quals could
result in a different plan being generated due to their selectivity
estimates varying.

 I'm not real sure whether it's better to classify on the basis of
 similar plans or similar original queries, anyway.  This seems like
 something that could be open for debate about use-cases.

Indeed - it's generally difficult to reason about what behaviour is
optimal when attempting to anticipate the needs of every Postgres DBA.
In this case, I myself lean strongly towards similar original queries,
because classifying on the basis of similar plans isn't likely to be
nearly as actionable, and there are really no obvious precedents to
draw on - how can it be turned into a useful tool?

 There will be additional infrastructure added to the parser to support
 normalisation of query strings for the patch I'll be submitting (that
 obviously won't be supported in the version that builds against
 existing Postgres versions that I'll make available). Essentially,
 I'll be adding a length field to certain nodes,

 This seems like a good way to get your patch rejected: adding overhead
 to the core system for the benefit of a feature that not everyone cares
 about is problematic.

It's problematic, but I believe that it can be justified. Without
being glib, exactly no one cares about the location of tokens that
correspond to Const nodes until they have an error that occurs outside
the parser that is attributable to a node/corresponding token, which,
in a production environment, could take a long time. All I'm doing is
moving slightly more towards the default Bison representation of
YYLTYPE, where the column and row of both the beginning and end of
each token are stored. I'm storing location and length rather than
just location, which is a modest change. Not everyone cares about this
feature, but plenty do. It will be particularly useful to have the
Query representation stable, even to the point where it is stable
between pg_stat_statements_reset() calls - third party tools can rely
on the stability of the query string.

 Why do you need it anyway?  Surely it's possible
 to determine the length of a literal token after the fact.

Probably, but not in a manner that I deem to be well-principled. I
want to push the onus on keeping bookkeeping for the replacement of
tokens into the parser, which is authoritative - otherwise, I'll end
up with a kludge that is liable to fall out of sync. The community
will have the opportunity to consider if that trade-off makes sense.

 More generally, if you're hashing anything later than the raw grammar
 tree, I think that generating a suitable representation of the queries
 represented by a single hash entry is going to be problematic anyway.
 There could be significant differences --- much more than just the
 values of constants --- between query strings that end up being
 semantically the same.  Or for that matter we could have identical query
 strings that wind up being considered different because of the action of
 

Re: [HACKERS] [PATCH] optional cleaning queries stored in pg_stat_statements

2011-11-06 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 On 6 November 2011 15:08, Tom Lane t...@sss.pgh.pa.us wrote:
 Are you intending to hash the raw
 grammar output tree, the parse analysis result, the rewriter result,
 or the actual plan tree?

 I'm hashing the Query tree from a planner plugin (function assigned to
 planner_hook), immediately prior to it being passed to
 standard_planner.

IOW, the rewriter result.  Why that choice?  That seems like about the
least useful of the four possibilities, since it provides no certainty
about the plan while also being as far from the original text as you
can get (thus making the problem of what to display pretty hard).

 A major consideration was backwards compatibility;

This is not a consideration that the community is likely to weigh
heavily, or indeed at all.  We aren't going to back-port this feature
into prior release branches, and we aren't going to want to contort its
definition to make that easier.  If 2ndquadrant wants to ship a
back-branch version with the feature, you can certainly also back-port
a patch that adds a hook where you need it, if you need a new hook.

But frankly I'm a bit surprised that you're not hashing the query plan,
since you could get that in the ExecutorStart hook that
pg_stat_statements already uses.  With a hook placed somewhere else, you
add additional implementation problems of matching up the calls to that
hook with later calls to the executor hook.

 I'm not real sure whether it's better to classify on the basis of
 similar plans or similar original queries, anyway.  This seems like
 something that could be open for debate about use-cases.

 Indeed - it's generally difficult to reason about what behaviour is
 optimal when attempting to anticipate the needs of every Postgres DBA.
 In this case, I myself lean strongly towards similar original queries,
 because classifying on the basis of similar plans isn't likely to be
 nearly as actionable, and there are really no obvious precedents to
 draw on - how can it be turned into a useful tool?

Hm, well, if that's what you think, why so late in the sequence?
Hashing the raw grammar output would be the best way to identify similar
original queries, ISTM.  You'd also have a better chance of getting
people to hold still for the extra overhead fields, if they didn't
need to propagate further than that representation.

Basically, I think there are fairly clear arguments for using a hash of
the raw grammar output, if you lean towards the classify by original
query viewpoint; or a hash of the plan tree, if you lean towards the
classify by plan viewpoint.  I don't see what use-cases would make it
optimal to hash one of the intermediate stages.  I was hoping you had an
argument stronger than it'll be easy to back-port for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug f...@phlo.org wrote:

 What's the locale of the database you're seeing this in, and which charset
 does it use?

 I think scanf() uses isspace() and friends, and last time I looked the
 locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
 decides that 0xA0 is whitespace.


Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.

Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.

I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.

I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Tom Lane
J Smith dark.panda+li...@gmail.com writes:
 I've attached a patch against master for unaccent.c that uses swscanf
 along with char2wchar and wchar2char instead of sscanf directly to
 initialize the unaccent extension and it appears to fix the problem in
 both the master and 9.1 branches.

swscanf doesn't seem like an acceptable approach: it's a function that
is relied on nowhere else in PG, so it adds new portability risks of its
own.  It doesn't exist on some platforms that we support (like the one
I'm typing this message on) and there's no real good reason to assume
that it's not broken in its own ways on others.

If you really want to pursue this, I'd suggest parsing the line
manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
be very many more lines than what you've got here.

However, the bigger picture is that OS X's UTF8 locales are broken
through-and-through, and most of their other problems are not feasible
to work around.  So basically you can't use them for anything
interesting, and it's not clear that it's worth putting any time into
solving individual problems.  In the particular case here, the issue
presumably is that sscanf is relying on isspace() ... but we rely on
isspace() directly, in quite a lot of places, so how much is it going
to fix to dodge it right here?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] btree gist known problems

2011-11-06 Thread Jeff Davis
I'm looking for known problem areas in btree_gist. I see:

http://archives.postgresql.org/message-id/8973.1286841...@sss.pgh.pa.us

With Range Types, I'm anticipating that btree_gist will become more
important, so I'd like to know what bugs are holding it back.

So, anything else come to mind? Or does btree_gist just need a good
review? Or have the problems been fixed?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On 2011-11-06, at 7:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 swscanf doesn't seem like an acceptable approach: it's a function that
 is relied on nowhere else in PG, so it adds new portability risks of its
 own.  It doesn't exist on some platforms that we support (like the one
 I'm typing this message on) and there's no real good reason to assume
 that it's not broken in its own ways on others.

 If you really want to pursue this, I'd suggest parsing the line
 manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
 be very many more lines than what you've got here.

 However, the bigger picture is that OS X's UTF8 locales are broken
 through-and-through, and most of their other problems are not feasible
 to work around.  So basically you can't use them for anything
 interesting, and it's not clear that it's worth putting any time into
 solving individual problems.  In the particular case here, the issue
 presumably is that sscanf is relying on isspace() ... but we rely on
 isspace() directly, in quite a lot of places, so how much is it going
 to fix to dodge it right here?

regards, tom lane

There are some fixes for isspace and friend that I've seen python
using so perhaps in those cases a similar fix could be applied. For
instance, maybe something like the code around line 674 here:

http://svn.python.org/view/python/trunk/Include/pyport.h?revision=81029view=markup

Perhaps that would be suitable on OSX at least in the case of isspace
et al. As far as I can tell scanf doesn't seem to use isspace on my
system so that would only be a partial fix for this an whatever other
situations isspace is used in. (on a mobile now so I can't check a the
moment.)

This isn't really a huge deal for me but I'll try to get a chance to
write up a little parser anyways just for kicks.

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2011-11-06 Thread Mark Kirkwood

On 07/11/11 10:20, Bernd Helmle wrote:



--On 6. November 2011 01:08:11 -0200 Greg Smith g...@2ndquadrant.com 
wrote:


Attached patch adds a new function to the pageinspect extension for 
measuring

total free space, in either tables or indexes.


I wonder if that should be done in the pgstattuple module, which 
output some similar numbers.




Not meaning to disparage Greg's effort in any way, but I was thinking 
the same thing about pg_freespacemap. I have not checked what - if any 
differences there are in output, but it would be interesting to compare 
which of the various (3 at present) extensions with slightly overlapping 
areas of functionality should perhaps be merged.


I am guessing (at this point very much guessing) that pg_freespace map 
may return its data faster, as pageinspect is gonna have to grovel 
through all the pages for itself (whereas pg_freespacemap relies on 
using info from the ... free space map fork).


regards

Mark




[HACKERS] [patch] Include detailed information about a row failing a CHECK constraint into the error message

2011-11-06 Thread Jan Kundrát
Hi,
when I insert/update many rows at once using INSERT ... SELECT into a
table which has plenty of CHECK constraints, the error message that
Postgres returns has no indication of which row failed the constraint
check. The attached patch tries to provide information in a similar way
to how duplicate items in a UNIQUE constraint are handled.

Originally, I tried to simply check the new row's t_ctid, but it was
always (0,0) -- I guess that's expected, maybe it's still in memory at
that time and maybe such nodes don't have a ctid assigned yet.

Please let me know if this patch is suitable for inclusion. It's based
on REL9_0_STABLE, because that's the version I'm running.

I'd like to thank intgr on IRC for his feedback when I was wondering
about the t_ctid.

With kind regards,
Jan

-- 
Trojita, a fast e-mail client -- http://trojita.flaska.net/
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 504f4de..eb4f595 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1364,10 +1364,32 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
const char *failed;
 
if ((failed = ExecRelCheck(resultRelInfo, slot, estate)) != 
NULL)
+   {
+   StringInfoData buf;
+   int natts = rel-rd_att-natts;
+   int i;
+   initStringInfo(buf);
+   for (i = 0; i  natts; ++i)
+   {
+   char *val;
+   Oid foutoid;
+   bool typisvarlena;
+   
getTypeOutputInfo(rel-rd_att-attrs[i]-atttypid, foutoid, typisvarlena);
+   if (slot-tts_isnull[i])
+   val = NULL;
+   else
+   val = OidOutputFunctionCall(foutoid, 
slot-tts_values[i]);
+   if (i  0)
+   appendStringInfoString(buf, , );
+   appendStringInfoString(buf, val);
+   }
ereport(ERROR,
(errcode(ERRCODE_CHECK_VIOLATION),
 errmsg(new row for relation \%s\ 
violates check constraint \%s\,
-   
RelationGetRelationName(rel), failed)));
+   
RelationGetRelationName(rel), failed),
+errdetail(New row with data (%s) 
violates check constraint \%s\.,
+   buf.data, failed)));
+   }
}
 }
 


signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Alright, I wrote up another patch that uses strchr to parse out the
lines of the unaccent.rules file, foregoing sscanf completely.
Hopefully this looks a bit better than using swscanf.

As for the other problems with isspace and such on OSX, it might be
worth looking at the python portability fixes. I played briefly with
the isspace and friends macros they have and they looked okay, but I
certainly can't speak for how well they'd work for the rest of the
PostgreSQL code base.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers