Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii
  It's not a practical solution for people working with prebuilt Postgres
  versions, which is most people.  I don't object to finding a way to
  provide a not-space behavior instead of an is-alnum behavior,
  but as noted upthread a GUC isn't the right way.  How do you feel
  about a new set of functions with an additional flag argument of
  some sort?
 
 Let me see how many functions we need to create...

After thinking a little bit more, I think following patch would not
break existing behavior and also adopts mutibyte + C locale case. What
do you think?

*** trgm_op.c~  2009-06-11 23:48:51.0 +0900
--- trgm_op.c   2010-05-29 17:07:28.0 +0900
***
*** 59,65 
  }
  
  #ifdef KEEPONLYALNUM
! #define iswordchr(c)  (t_isalpha(c) || t_isdigit(c))
  #else
  #define iswordchr(c)  (!t_isspace(c))
  #endif
--- 59,67 
  }
  
  #ifdef KEEPONLYALNUM
! #define iswordchr(c)  (lc_ctype_is_c()? \
!   ((*(c)  0x80)? 
!t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \
!   (t_isalpha(c) 
|| t_isdigit(c)))
  #else
  #define iswordchr(c)  (!t_isspace(c))
  #endif

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh

On 2010-05-28 23:47, Jan Urbański wrote:

On 28/05/10 22:22, Tom Lane wrote:
   

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.  Then use that estimate as
the s value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper.  If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.
 

I see what you mean, so the idea would be:

  * assume some value of W as the number of all words in the language
  * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
  * set e = s/10 and w = 1/e, that is 10/s
  * perform LC using that value of w
  * remove all elements for which f  (s-e)N, that is f  0.9*sN, where N
is the total number of lexemes processed
  * create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.
   


I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).

Isn't it the same type of logic that is used for collecting statistics 
for

array-types, say integer-arrays and text arrays?

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
   


Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per row.


After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f
0.054 * 35017 * 154 that is f  291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is
   


If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can fairly easy try out patches or do other kind of testing.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh

On 2010-05-28 04:47, Tom Lane wrote:

Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
   


I should have been more precise in the wording. Cranking up the
stats target gave me overall a better plan, but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a Bitmap Index Scan on the gin-index to
Index Scan on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Statistics for tsvector wildcards. term*

2010-05-29 Thread Jesper Krogh

Hi.

There seems to be an unimplemented area around getting statistics for
wildcard searches done. Wildcards anchored to the left can be matched
up by the gin-index and the ts_match_vq operator:


testdb=# select to_tsvector('project') @@ to_tsquery('proj:*');
 ?column?
--
 t
(1 row)

Searching for project gives me this estimate:
testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('projects') order by id limit 50;

   QUERY PLAN
-
 Limit  (cost=0.00..3008.54 rows=50 width=76)
   -  Index Scan using reference_pkey on reference  
(cost=0.00..3762544.72 rows=62531 width=76)

 Filter: (document_tsvector @@ to_tsquery('projects'::text))
(3 rows)

whereas searching for proj:* gives:

testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('proj:*') order by id limit 50;

 QUERY PLAN
-
 Limit  (cost=73.56..73.58 rows=6 width=76)
   -  Sort  (cost=73.56..73.58 rows=6 width=76)
 Sort Key: id
 -  Bitmap Heap Scan on reference  (cost=34.55..73.49 rows=6 
width=76)
   Recheck Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))
   -  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.54 rows=6 width=0)
 Index Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))

(7 rows)

There are abouvios challenges in getting statistics for submatches where 
there are no real information
in the pg_stats table, but there will also be a huge amount of cases 
where a fairly reliable

guess can be extracted.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 12:34, Jesper Krogh wrote:
 On 2010-05-28 23:47, Jan Urbański wrote:
 On 28/05/10 22:22, Tom Lane wrote:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.

 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.

 I think you should skip the assumption about stop-words, users may
 use something where they are needed in the index or have a language
 than the typical.  (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

 Isn't it the same type of logic that is used for collecting statistics
 for
 array-types, say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

 We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

 
 Im not sure I get this one.. does this mean that we prune everytime
 we have collected 167 new datapoints .. that would seem too often
 for me since that would roughly be once per row.

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like the would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such squashing of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

 After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

 So assuming that on average a tsvector has 154 elements and that we went
 through 35017 rows (as it would be in Jesper's case, before he raised
 the stats target from 100 to 1000), we will remove lexemes with f
 0.054 * 35017 * 154 that is f  291201.37

 I wonder what would happen if Jasper's case if we did that... And I
 wonder how sound that maths is

 
 If it means that I would get an accurate MCE-histogram for all
 things that have an occourance of more than 5.4% of the rows
 (given the samples chosen), then I think that would be really
 reasonable.

Here's the spreadsheet spat out.

The variables are:
 * the statistics target
 * top stopwords
 * error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f  (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with  173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

 I can fairly easy try out patches or do other kind of testing.

I'll try to come up 

Re: [HACKERS] pg_trgm

2010-05-29 Thread Greg Stark
On Sat, May 29, 2010 at 9:13 AM, Tatsuo Ishii is...@postgresql.org wrote:
 ! #define iswordchr(c)  (lc_ctype_is_c()? \
 !                                                               ((*(c)  
 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \


Surely isspace(c) will always be false for non-ascii characters in C locale?

Now it might be sensible to just treat any non-ascii character as a
word-character in addition to alpha and digits, so what might make
sense is

   t_isalpha(c) || t_isdigit(c)) || (lc_ctype_is_c()  *(c)0x80)

Though I wonder whether it wouldn't be generally more useful to users
to provide the non-space version as an option. I could see that being
useful for people in other circumstances aside from working around
this locale problem.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm

2010-05-29 Thread Tom Lane
Tatsuo Ishii is...@postgresql.org writes:
 After thinking a little bit more, I think following patch would not
 break existing behavior and also adopts mutibyte + C locale case. What
 do you think?

This is still ignoring the point: arbitrarily changing the module's
longstanding standard behavior isn't acceptable.  You need to provide
a way for the user to control the behavior.  (Once you've done that,
I think it can be just either alnum or !isspace, but maybe some
other behaviors would be interesting.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.

 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.

 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there.  It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law.  The interesting property is the rule
the k'th most common element occurs 1/k as often as the most common one.
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001.  That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Hm, I am now thinking that maybe this theory is flawed, because tsvecors
 contain only *unique* words, and Zipf's law is talking about words in
 documents in general. Normally a word like the would appear lots of
 times in a document, but (even ignoring the fact that it's a stopword
 and so won't appear at all) in a tsvector it will be present only once.
 This may or may not be a problem, not sure if such squashing of
 occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents.  In any case we are not really depending on Zipf's
law heavily with this approach.  The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:09, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.
 
 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.
 
 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
 
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 29/05/10 17:09, Tom Lane wrote:
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.

 Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:34, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 29/05/10 17:09, Tom Lane wrote:
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.
 
 Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
 
 Um, apparently I can't do simple arithmetic first thing in the morning
 either, cause I got my number wrong too ;-)
 
 After a bit more research: if you use the basic form of Zipf's law
 with a 1/k distribution, the first frequency has to be about 0.07
 to make the total come out to 1.0 for a reasonable number of words.
 So we could use s = 0.07 / K when we wanted a final list of K words.
 Some people (including the LC paper) prefer a higher exponent, ie
 1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
 seems awfully high for the type of data we're working with, so I think
 the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts  1887. For s/3, s/4 etc the numbers look like this:

s/114280
s/228571887
s/342852516
s/457142831
s/571423019
s/685713145
s/71   3235
s/811428   3302
s/912857   3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 [ e of ] s/2 or s/3 look reasonable.

The examples in the LC paper seem to all use e = s/10.  Note the stated
assumption e  s.

 So, should I just write a patch that sets the bucket width and pruning
 count using 0.07 as the assumed frequency of the most common word and
 epsilon equal to s/2 or s/3?

I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for
a first cut to experiment with.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PG 9.0 release timetable

2010-05-29 Thread Bruce Momjian
Assuming we want a release Postgres 9.0 by mid-August, here is how the
timetable would look:

Need RC release to be stable for 1-2 weeks before final
RC must be released by August 1
Beta must be stable for 2-3 weeks before RC
Stable beta must be released by early July

So, we have 5-6 weeks to get a stable beta.  Looking at the open issues:


http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues

it looks like we are doing OK, but we must continue progressing.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance problem in textanycat/anytextcat

2010-05-29 Thread Robert Haas
On Mon, May 17, 2010 at 9:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, May 17, 2010 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Perhaps this is a backpatchable bug fix.  Comments?

 I can't say whether this is safe enough to back-patch, but the way
 this is set up, don't we also need to fix some catalog entries and, if
 yes, isn't that problematic?

 The only catalog entries at issue, AFAICT, are the textanycat/anytextcat
 ones.  I am not sure whether we should attempt to back-patch changes for
 them, but this patch wouldn't make the situation in the back branches
 worse.  In particular, if we apply this patch but don't change the
 catalog entries, then nothing would change at all about the problematic
 cases, because the planner would decide it couldn't safely inline the
 function.  The only cases where inlining will happen is where the
 expression's apparent volatility stays the same or decreases, so as far
 as that issue is concerned this patch will never make CREATE INDEX
 reject a case it would have accepted otherwise.  The patch *will* make
 CREATE INDEX reject cases with volatile default arguments hiding under
 non-volatile functions, but that's got nothing to do with any built-in
 functions; and that's the case I claim is clearly a bug fix.

This is still on the 9.0 open items list, but ISTM you fixed it with
two commits on May 27th.  Is that correct?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas
On Sat, May 29, 2010 at 4:19 PM, Bruce Momjian br...@momjian.us wrote:
 Assuming we want a release Postgres 9.0 by mid-August, here is how the
 timetable would look:

        Need RC release to be stable for 1-2 weeks before final
                RC must be released by August 1
        Beta must be stable for 2-3 weeks before RC
                Stable beta must be released by early July

 So, we have 5-6 weeks to get a stable beta.  Looking at the open issues:

        
 http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues

 it looks like we are doing OK, but we must continue progressing.

This is a really short list.  Several of these items have already been
fixed, and others have been discussed extensively and are just a
question of making a final decision.  The thorniest question we have
yet to resolve is what to do about max_standby_delay - I think we need
Tom and Heikki to review this patch by Simon:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01666.php

The real question in terms of release, I think, is how long we want to
wait for more bugs to be found, and/or how much time do we want to
allow for Tom and others to do further review of the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance problem in textanycat/anytextcat

2010-05-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 This is still on the 9.0 open items list, but ISTM you fixed it with
 two commits on May 27th.  Is that correct?

Oh, sorry, forgot to update the open items.  Done now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas
On Sat, May 29, 2010 at 5:09 PM, Robert Haas robertmh...@gmail.com wrote:
 This is a really short list.

Thoughts on a few of the remaining items:

Type Mismatch Error in Set Returning Functions - tgl says this is a
deliberate change per link I just added to the wiki.  do we think more
is required here to prevent cranky users?
Should we revert the default output format for bytea to the old style
before shipping 9.0.0? - Consensus seems to be no, thus no action is
required.
don't rename index columns behavior has already broken JDBC - As I
understand it, this is not a code issue, but just something that
driver authors need to be aware of.
Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
the machine just ran out of disk space - not sure we need to do
anything here.
move 'long long' check to c.h - Is this perhaps addressed by Michael
Meskes commits on May 25th?
Mergejoin null handling - I think this is done:
http://archives.postgresql.org/pgsql-committers/2010-05/msg00332.php
Timeline for removal of older than 7.4 links to docs - link on the
wiki page is broken and this doesn't seem like a 9.0 issue anyway.
suggest we remove it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Thoughts on a few of the remaining items:

 Should we revert the default output format for bytea to the old style
 before shipping 9.0.0? - Consensus seems to be no, thus no action is
 required.

I think we should leave that there for awhile, though I agree it's
likely that the final decision will be no change.

 don't rename index columns behavior has already broken JDBC - As I
 understand it, this is not a code issue, but just something that
 driver authors need to be aware of.

There had been a section on the page about information we needed to
communicate to third-party authors.  Someone seems to have removed
that, but that seems like where this belongs.

 Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
 the machine just ran out of disk space - not sure we need to do
 anything here.

It's a bit weird though, because UpdateControlFile should always update
in place; why would there be any risk of out of disk space?  I would
like to find out exactly what happened, though I have no clear ideas
how to investigate it.

 move 'long long' check to c.h - Is this perhaps addressed by Michael
 Meskes commits on May 25th?
 Mergejoin null handling - I think this is done:

Yup, both done, I moved them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas
On Fri, May 28, 2010 at 10:32 PM, Bruce Momjian br...@momjian.us wrote:
 Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
  Currently, the check for exclusion constraints performs a sanity check
  that's slightly too strict -- it assumes that a tuple will conflict with
  itself. That is not always the case: the operator might be , in
  which case it's perfectly valid for the search for conflicts to not find
  itself.

  This patch simply removes that sanity check, and leaves a comment in
  place.

 I'm a bit uncomfortable with removing the sanity check; it seems like a
 good thing to have, especially since this code hasn't even made it out
 of beta yet.  AFAIK the  case is purely hypothetical, because we
 have no index opclasses supporting such an operator, no?  How about just
 documenting that we'd need to remove the sanity check if we ever did add
 support for such a case?

 Done, with attached, applied patch.

The only disadvantage I see of just documenting this is that someone
might write a user-defined index opclass that works like this, and
they won't be able to use this until at least 9.1 (or at least, not
without patching the source).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 The only disadvantage I see of just documenting this is that someone
 might write a user-defined index opclass that works like this, and
 they won't be able to use this until at least 9.1 (or at least, not
 without patching the source).

I don't actually think that anyone's very likely to write a -like index
operator.  It's approximately useless to use an index for such a query.

Or, to put it differently: if nobody's done that in the past twenty
years, why is it likely to happen before 9.1?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread David Fetter
On Sat, May 29, 2010 at 06:11:57PM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  The only disadvantage I see of just documenting this is that
  someone might write a user-defined index opclass that works like
  this, and they won't be able to use this until at least 9.1 (or at
  least, not without patching the source).
 
 I don't actually think that anyone's very likely to write a -like
 index operator.  It's approximately useless to use an index for such
 a query.
 
 Or, to put it differently: if nobody's done that in the past twenty
 years, why is it likely to happen before 9.1?

Because there's a fundamentally new way to use them now, namely with
exclusion constraints :)

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regression testing for psql

2010-05-29 Thread Selena Deckelmann
On Wed, May 26, 2010 at 6:25 PM, Stephen Frost sfr...@snowman.net wrote:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 There might be some value in psql backslash command tests that
 are designed to depend on just one or a few tables (or other appropriate
 objects).

 Updated, much much smaller, patch attached.  Also available, again, at
 http://snowman.net/~sfrost/psql-regress-help.patch

 Basically, I removed anything that would produce data directly from
 the catalogs by trying to find a 'none' object which matched.  This
 still goes through alot of the same setup and query, it's just that
 there aren't any results.

Is this something to be added to 2010-07 commitfest?

-selena


-- 
http://chesnok.com/daily - me

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii
 This is still ignoring the point: arbitrarily changing the module's
 longstanding standard behavior isn't acceptable.  You need to provide
 a way for the user to control the behavior.  (Once you've done that,
 I think it can be just either alnum or !isspace, but maybe some
 other behaviors would be interesting.)

To be honest I don't know what module's longstanding standard
behavior should be. It's not documented anywhere. If you mean that is
whatever the current implementation is, then any effort to touch the
module should be prohibited.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii
  Wait. This works fine for me with stock pg_trgm. local is C and
  encoding is UTF8. What version of PostgreSQL are you using? Mine is
  8.4.4.
 
 This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
 you run this in 8.4, you're just comparing a sequence of ASCII letters
 and digits.

Hum. Still I prefer 8.4's behavior since anything is better than
returning NaN. It seems 9.0 does not have any escape route for
multibyte+C locale users.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specification for Trusted PLs?

2010-05-29 Thread Bruce Momjian
Robert Haas wrote:
 On Sat, May 22, 2010 at 4:53 PM, C?dric Villemain
 cedric.villemain.deb...@gmail.com wrote:
  2010/5/21 Jan Wieck janwi...@yahoo.com:
  The original idea was that a trusted language does not allow an 
  unprivileged
  user to gain access to any object or data, he does not have access to
  without that language.
 
  This does not include data transformation functionality, like string
  processing or the like. As long as the user had legitimate access to the
  input datum, then every derived form thereof is OK.
 
  I find the current doc enough, add this prose from Jan as a comment
  might help people perhaps.
 
 Yeah, Jan's description is very clear and to the point.

The attached, applied patch clarifies the meaning of trusted language
in the documentation using Jan's description.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
Index: doc/src/sgml/xplang.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/xplang.sgml,v
retrieving revision 1.37
diff -c -c -r1.37 xplang.sgml
*** doc/src/sgml/xplang.sgml	3 Apr 2010 07:22:56 -	1.37
--- doc/src/sgml/xplang.sgml	30 May 2010 02:21:53 -
***
*** 151,158 
  optionalVALIDATOR replaceablevalidator_function_name/replaceable/optional ;
  /synopsis
The optional key word literalTRUSTED/literal specifies that
!   ordinary database users that have no superuser privileges should
!   be allowed to use this language to create functions and trigger
procedures. Since PL functions are executed inside the database
server, the literalTRUSTED/literal flag should only be given
for languages that do not allow access to database server
--- 151,160 
  optionalVALIDATOR replaceablevalidator_function_name/replaceable/optional ;
  /synopsis
The optional key word literalTRUSTED/literal specifies that
!   the language does not grant access to data that the user would
!   not otherwise have.  Trusted languages are designed for ordinary
!   database users (those without superuser privilege) and allows them
!   to safely create of functions and trigger
procedures. Since PL functions are executed inside the database
server, the literalTRUSTED/literal flag should only be given
for languages that do not allow access to database server
Index: doc/src/sgml/ref/create_language.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/create_language.sgml,v
retrieving revision 1.50
diff -c -c -r1.50 create_language.sgml
*** doc/src/sgml/ref/create_language.sgml	3 Apr 2010 07:22:58 -	1.50
--- doc/src/sgml/ref/create_language.sgml	30 May 2010 02:21:53 -
***
*** 104,114 
  
   listitem
para
!literalTRUSTED/literal specifies that
!the language is safe, that is, it does not offer an
!unprivileged user any functionality to bypass access
!restrictions. If this key word is omitted when registering the
!language, only users with the
 productnamePostgreSQL/productname superuser privilege can
 use this language to create new functions.
/para
--- 104,113 
  
   listitem
para
!literalTRUSTED/literal specifies that the language does
!not grant access to data that the user would not otherwise
!have.  If this key word is omitted
!when registering the language, only users with the
 productnamePostgreSQL/productname superuser privilege can
 use this language to create new functions.
/para

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Bruce Momjian

Added to TODO:

Consider a faster CRC32 algorithm

* http://archives.postgresql.org/pgsql-hackers/2010-05/msg01112.php 

---

Andres Freund wrote:
 Hi,
 
 I started to analyze XLogInsert because it was the major bottleneck when 
 creating some materialized view/cached tables/whatever.
 Analyzing it I could see that content of the COMP_CRC32 macro was taking most 
 of the time which isn't immediately obvious when you profile because it 
 obviously doesn't show up as a separate function.
 I first put it into functions to make it easier to profile. I couldn't 
 measure 
 any difference for COPY, CTAS and a simple pgbench run on 3 kinds of hardware 
 (Core2, older Xeon, older Sparc systems).
 
 I looked a bit around for faster implementations of CRC32 and found one in 
 zlib. After adapting it (pg uses slightly different computation (non-
 inverted)) I found that it increases the speed of the CRC32 calculation 
 itself 
 3 fold.
 It does that by not only using one lookup table but four (one for each byte 
 of 
 a word). Those four calculations are independent and thus are considerably 
 faster on somewhat recent hardware.
 Also it does memory lookups in 4 byte steps instead of 1 byte as the pg 
 version (thats only about ~8% benefit in itself).
 
 I wrote a preliminary patch which includes both, the original implementation 
 and the new one switchable via an #define.
 
 
 I tested performance differences in a small number of scenarios:
 - CTAS/INSERT ... SELECT (8-30%)
 - COPY (3-20%)
 - pgbench (no real difference unless directly after a checkpoint)
 
 Setup:
 
 CREATE TABLE blub (ai int, bi int, aibi int);
 CREATE TABLE speedtest (ai int, bi int, aibi int);
 
 
 INSERT ... SELECT:
 
 Statement:
 INSERT INTO blub SELECT a.i, b.i, a.i *b.i FROM generate_series(1, 1) 
 a(i), generate_series(1, 1000) b(i);
 
 legacy crc:
 
 11526.588
 11406.518
 11412.182
 11430.245
 
 zlib:
 9977.394
 9945.408
 9840.907
 9842.875
 
 
 COPY:
 Statement:
 ('blub' enlarged here 4 times, as otherwise the variances were to large)
 
 COPY blub TO '/tmp/b' BINARY;
 ...
 CHECKPOINT;TRUNCATE speedtest; COPY speedtest FROM '/tmp/b' BINARY;
 
 legacy:
 44835.840
 44832.876
 
 zlib:
 39530.549
 39365.109
 39295.167
 
 The performance differences are bigger if the table rows are significantly 
 bigger. 
 
 Do you think something like that is sensible? If yes, I will make it into a 
 proper patch and such.
 
 Thanks,
 
 Andres
 
 INSERT ... SELECT profile before patch:
 
 20.22% postgres  postgres   [.] comp_crc32
  5.77% postgres  postgres   [.] XLogInsert
  5.55% postgres  postgres   [.] LWLockAcquire
  5.21% postgres  [kernel.   [k] copy_user_generic_string
  4.64% postgres  postgres   [.] LWLockRelease
  4.39% postgres  postgres   [.] ReadBuffer_common
  2.75% postgres  postgres   [.] heap_insert
  2.22% postgres  libc-2.1   [.] memcpy
  2.09% postgres  postgres   [.] UnlockReleaseBuffer
  1.85% postgres  postgres   [.] hash_any
  1.77% postgres  [kernel.   [k] clear_page_c
  1.69% postgres  postgres   [.] hash_search_with_hash_value
  1.61% postgres  postgres   [.] heapgettup_pagemode
  1.50% postgres  postgres   [.] PageAddItem
  1.42% postgres  postgres   [.] MarkBufferDirty
  1.28% postgres  postgres   [.] RelationGetBufferForTuple
  1.15% postgres  postgres   [.] ExecModifyTable
  1.06% postgres  postgres   [.] RelationPutHeapTuple
 
 
 After:
 
  9.97% postgres  postgres   [.] comp_crc32
  5.95% postgres  [kernel.   [k] copy_user_generic_string
  5.94% postgres  postgres   [.] LWLockAcquire
  5.64% postgres  postgres   [.] XLogInsert
  5.11% postgres  postgres   [.] LWLockRelease
  4.63% postgres  postgres   [.] ReadBuffer_common
  3.45% postgres  postgres   [.] heap_insert
  2.54% postgres  libc-2.1   [.] memcpy
  2.03% postgres  postgres   [.] UnlockReleaseBuffer
  1.94% postgres  postgres   [.] hash_search_with_hash_value
  1.84% postgres  postgres   [.] hash_any
  1.73% postgres  [kernel.   [k] clear_page_c
  1.68% postgres  postgres   [.] PageAddItem
  1.62% postgres  postgres   [.] heapgettup_pagemode
  1.52% postgres  postgres   [.] RelationGetBufferForTuple
  1.47% postgres  postgres   [.] MarkBufferDirty
  1.30% postgres  postgres   [.] ExecModifyTable
  1.23% postgres  postgres   [.] RelationPutHeapTuple

[ Attachment, skipping... ]

 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas
On Sat, May 29, 2010 at 6:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 The only disadvantage I see of just documenting this is that someone
 might write a user-defined index opclass that works like this, and
 they won't be able to use this until at least 9.1 (or at least, not
 without patching the source).

 I don't actually think that anyone's very likely to write a -like index
 operator.  It's approximately useless to use an index for such a query.

 Or, to put it differently: if nobody's done that in the past twenty
 years, why is it likely to happen before 9.1?

Hmm.  Well suppose we bet a dollar on whether that will happen or not.
 In fact, if you promise not to read
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
make it two dollars.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Greg Stark
This sounds familiar. If you search back in the archives around 2004
or so I think you'll find a similar discussion when we replaced the
crc32 implementation with what we have now. We put a fair amount of
effort into searching for faster implementations so if you've found
one 3x faster I'm pretty startled. Are you sure it's faster on all
architectures and not a win sometimes and a loss other times? And are
you sure it's faster in our use case where we're crcing small
sequences of data often and not crcing a large block?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Greg Stark
On Sun, May 30, 2010 at 3:56 AM, Greg Stark gsst...@mit.edu wrote:
 This sounds familiar. If you search back in the archives around 2004
 or so I think you'll find a similar discussion when we replaced the
 crc32 implementation with what we have now.

Fwiw here's the thread (from 2005):

http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-05-29 Thread Fujii Masao
On Fri, May 28, 2010 at 11:12 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, May 27, 2010 at 11:13 PM, Robert Haas robertmh...@gmail.com wrote:
 I guess this happens because the frequency of checkpoint on the standby is
 too lower than that on the master. In the master, checkpoint occurs for 
 every
 consumption of three segments because of checkpoint_segments = 3. On the
 other hand, in the standby, only checkpoint_timeout has effect, so 
 checkpoint
 occurs for every 30 minutes because of checkpoint_timeout = 30min.

 The walreceiver should signal the bgwriter to start checkpoint if it has
 received more than checkpoint_segments WAL files, like normal processing?

 Is this also an issue when using log shipping, or just with SR?

 When using log shipping, checkpoint_segments always doesn't trigger a
 checkpoint. So recovery after the standby crashes might take unexpectedly
 long since redo starting point might be old.

 But in file-based log shipping, since WAL files don't accumulate in
 pg_xlog directory on the standby, even if the frequency of checkpoint
 is very low, pg_xlog will not be filled with many WAL files. That
 accumulation occurs only when using SR.

 If we should avoid low frequency of checkpoint itself rather than
 accumulation of WAL files, the bgwriter instead of the walreceiver
 should check if we've consumed too much WAL, I think. Thought?

I attached the patch, which changes the startup process so that it signals
bgwriter to perform a restartpoint if we've already replayed too much WAL
files. This leads checkpoint_segments to trigger a restartpoint.

This patch is worth applying for 9.0? If not, I'll add it into the next CF.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 508,513  static bool reachedMinRecoveryPoint = false;
--- 508,516 
  
  static bool InRedo = false;
  
+ /* We've already launched bgwriter to perform restartpoint? */
+ static bool bgwriterLaunched = false;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***
*** 550,555  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
--- 553,559 
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  XLogRecPtr *lsn, BkpBlock *bkpb);
  static bool AdvanceXLInsertBuffer(bool new_segment);
+ static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  	   bool find_free, int *max_advance,
***
*** 1554,1567  AdvanceXLInsertBuffer(bool new_segment)
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling the open log file (so that
!  * openLogId/openLogSeg are valid).  We measure the distance from RedoRecPtr
!  * to the open log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(void)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
--- 1558,1571 
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling or reading the log file (so that
!  * the given logid/logseg are valid).  We measure the distance from RedoRecPtr
!  * to the log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(uint32 logid, uint32 logseg)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
***
*** 1577,1584  XLogCheckpointNeeded(void)
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg;
! 	new_highbits = openLogId / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno = old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
--- 1581,1588 
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
! 	new_highbits = logid / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno = old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
***
*** 1782,1791  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
   * update RedoRecPtr and recheck.
   */
  if (IsUnderPostmaster 
! 	

Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas
On Sat, May 29, 2010 at 5:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Thoughts on a few of the remaining items:

 Should we revert the default output format for bytea to the old style
 before shipping 9.0.0? - Consensus seems to be no, thus no action is
 required.

 I think we should leave that there for awhile, though I agree it's
 likely that the final decision will be no change.

 don't rename index columns behavior has already broken JDBC - As I
 understand it, this is not a code issue, but just something that
 driver authors need to be aware of.

 There had been a section on the page about information we needed to
 communicate to third-party authors.  Someone seems to have removed
 that, but that seems like where this belongs.

 Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
 the machine just ran out of disk space - not sure we need to do
 anything here.

 It's a bit weird though, because UpdateControlFile should always update
 in place; why would there be any risk of out of disk space?  I would
 like to find out exactly what happened, though I have no clear ideas
 how to investigate it.

Well, I think at a minimum the first two of these need to go into a
section that is not called code: the first is just a decision we
might change our mind about, and the second is a communication issue,
not a code issue.  I'd argue that the third one is probably not
something we're going to hold up the release for, either, and
therefore while it might belong on a list of known open bugs it
doesn't really belong on a list of 9.0 open items.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Mohammad Heykal Abdillah
Hi all,

Right now i am trying to understand how SQL parser is work. 

My question is there anyway to get list of table name and its atribut
before raw parser is analyze?

Because i like to understand how PostgreSQL break-down the asterik at
target list, specialy in natural join case where PostgreSQL can give
query result that have unabigous attribut and match the right join key.

Thank You.

-- 
Mohammad Heykal Abdillah heykal.abdil...@gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, May 29, 2010 at 6:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Or, to put it differently: if nobody's done that in the past twenty
 years, why is it likely to happen before 9.1?

 Hmm.  Well suppose we bet a dollar on whether that will happen or not.
  In fact, if you promise not to read
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
 make it two dollars.

Yes, I've seen Jeff's example.  It's a cute hack but somehow I doubt
that there is going to be a land rush to implement such things.
Can you point to any pre-existing example where anyone actually asked
for the ability to do that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Sun, May 30, 2010 at 3:56 AM, Greg Stark gsst...@mit.edu wrote:
 This sounds familiar. If you search back in the archives around 2004
 or so I think you'll find a similar discussion when we replaced the
 crc32 implementation with what we have now.

 Fwiw here's the thread (from 2005):
 http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811

I read through that thread and couldn't find much discussion of
alternative CRC implementations --- we spent all our time on arguing
about whether we needed 64-bit CRC or not.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas
On Sat, May 29, 2010 at 11:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, May 29, 2010 at 6:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Or, to put it differently: if nobody's done that in the past twenty
 years, why is it likely to happen before 9.1?

 Hmm.  Well suppose we bet a dollar on whether that will happen or not.
  In fact, if you promise not to read
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
 make it two dollars.

 Yes, I've seen Jeff's example.  It's a cute hack but somehow I doubt
 that there is going to be a land rush to implement such things.
 Can you point to any pre-existing example where anyone actually asked
 for the ability to do that?

I've often wished for the ability to constrain a tale to hold just one
row, so I don't find that use case implausible at all.  As to whether
Jeff's use case is a cute hack or something that people will really
want to do, I think the jury's still out on that one.  But I don't
think we should make it not work unless we have a concrete reason, and
I haven't heard one yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Andrew Dunstan



Mohammad Heykal Abdillah wrote:

Hi all,

Right now i am trying to understand how SQL parser is work. 


My question is there anyway to get list of table name and its atribut
before raw parser is analyze?

Because i like to understand how PostgreSQL break-down the asterik at
target list, specialy in natural join case where PostgreSQL can give
query result that have unabigous attribut and match the right join key.


  


AIUI, expanding '*' is not done in the parser. The parser just makes an 
A_star node.


See src/backend/gram.y for the parser spec.

Maybe you need to explain what exactly you're trying to do.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Mohammad Heykal Abdillah
On Min, 2010-05-30 at 00:44 -0400, Andrew Dunstan wrote: 
 
 Mohammad Heykal Abdillah wrote:
  Hi all,
 
  Right now i am trying to understand how SQL parser is work. 
 
  My question is there anyway to get list of table name and its atribut
  before raw parser is analyze?
 
  Because i like to understand how PostgreSQL break-down the asterik at
  target list, specialy in natural join case where PostgreSQL can give
  query result that have unabigous attribut and match the right join key.
 
 

 
 AIUI, expanding '*' is not done in the parser. The parser just makes an 
 A_star node.
 
 See src/backend/gram.y for the parser spec.
 
 Maybe you need to explain what exactly you're trying to do.
 
 cheers
 
 andrew

Yes that i know, expanding '*' is done in analyzer part. I am try to do
is, move the expanding process to before raw_parser that produce by
gram.y is processed by analyzer. Like this :

sql query - gram.y - raw_parse_tree - (expand the '*') - analyze

In this way analyzer part only receiving the expanded '*' parse tree
they wont expand the '*' again. I am aksing this because i didnt know
how to (analyzer) scan database so it can return list of attribut from
table. I have read the source code, and try manualy to replicate the
process (from what i got is the RTE function) with no result.

Thank You.
-- 
Mohammad Heykal Abdillah heykal.abdil...@gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers