Re: [HACKERS] Collect frequency statistics for arrays

2012-03-12 Thread Alexander Korotkov
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  True. If (max count - min count + 1) is small, enumerating of frequencies
  is both more compact and more precise representation. Simultaneously,
  if (max count - min count + 1) is large, we can run out of
  statistics_target with such representation. We can use same
 representation
  of count distribution as for scalar column value: MCV and HISTOGRAM, but
 it
  would require additional statkind and statistics slot. Probably, you've
  better ideas?

 I wasn't thinking of introducing two different representations,
 but just trimming the histogram length when it's larger than necessary.

 On reflection my idea above is wrong; for example assume that we have a
 column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
 what I said, we'd reduce the histogram to {1,2}, which might accurately
 capture the set of lengths present but fails to show that 1 is much more
 common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
 would capture the situation perfectly in one-tenth the space that the
 current logic does.

 More generally, by limiting the histogram to statistics_target entries,
 we are already accepting errors of up to 1/(2*statistics_target) in the
 accuracy of the bin-boundary representation.  What the above example
 shows is that sometimes we could meet the same accuracy requirement with
 fewer entries.  I'm not sure how this could be mechanized but it seems
 worth thinking about.


I can propose following representation of histogram.

If (max_count - min_count + 1) = statistics_target then
1) store max_count and min_count in stavalues
2) store frequencies from min_count ot max_count in numvalues

If (max_count - min_count + 1)  statistics_target then
store histogram in current manner. I think in this case it's unlikely to be
many repeating values.

I can propose patch which change histogram representation to this.

Comments?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-08 Thread Noah Misch
On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
 Alexander Korotkov aekorot...@gmail.com writes:
  True. If (max count - min count + 1) is small, enumerating of frequencies
  is both more compact and more precise representation. Simultaneously,
  if (max count - min count + 1) is large, we can run out of
  statistics_target with such representation. We can use same representation
  of count distribution as for scalar column value: MCV and HISTOGRAM, but it
  would require additional statkind and statistics slot. Probably, you've
  better ideas?
 
 I wasn't thinking of introducing two different representations,
 but just trimming the histogram length when it's larger than necessary.
 
 On reflection my idea above is wrong; for example assume that we have a
 column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
 what I said, we'd reduce the histogram to {1,2}, which might accurately
 capture the set of lengths present but fails to show that 1 is much more
 common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
 would capture the situation perfectly in one-tenth the space that the
 current logic does.

Granted.  When the next sample finds 899/101 instead, though, the optimization
vanishes.  You save 90% of the space, perhaps 10% of the time.  If you want to
materially narrow typical statistics, Alexander's proposal looks like the way
to go.  I'd guess array columns always having DEC = default_statistics_target
are common enough to make that representation the dominant representation, if
not the only necessary representation.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-08 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
 On reflection my idea above is wrong; for example assume that we have a
 column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
 what I said, we'd reduce the histogram to {1,2}, which might accurately
 capture the set of lengths present but fails to show that 1 is much more
 common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
 would capture the situation perfectly in one-tenth the space that the
 current logic does.

 Granted.  When the next sample finds 899/101 instead, though, the optimization
 vanishes.

No, you missed my next point.  That example shows that sometimes a
smaller histogram can represent the situation with zero error, but in
all cases a smaller histogram can represent the situation with perhaps
more error than a larger one.  Since we already have a defined error
tolerance, we should try to generate a histogram that is as small as
possible while still not exceeding the error tolerance.

Now, it might be that doing that is computationally impractical, or
too complicated to be reasonable.  But it seems to me to be worth
looking into.

 If you want to materially narrow typical statistics, Alexander's
 proposal looks like the way to go.  I'd guess array columns always
 having DEC = default_statistics_target are common enough to make that
 representation the dominant representation, if not the only necessary
 representation.

Well, I don't want to have two representations; I don't think it's worth
the complexity.  But certainly we could consider switching to a
different representation if it seems likely to usually be smaller.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-08 Thread Noah Misch
On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
  On reflection my idea above is wrong; for example assume that we have a
  column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
  what I said, we'd reduce the histogram to {1,2}, which might accurately
  capture the set of lengths present but fails to show that 1 is much more
  common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
  would capture the situation perfectly in one-tenth the space that the
  current logic does.
 
  Granted.  When the next sample finds 899/101 instead, though, the 
  optimization
  vanishes.
 
 No, you missed my next point.  That example shows that sometimes a
 smaller histogram can represent the situation with zero error, but in
 all cases a smaller histogram can represent the situation with perhaps
 more error than a larger one.  Since we already have a defined error
 tolerance, we should try to generate a histogram that is as small as
 possible while still not exceeding the error tolerance.
 
 Now, it might be that doing that is computationally impractical, or
 too complicated to be reasonable.  But it seems to me to be worth
 looking into.

Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

  If you want to materially narrow typical statistics, Alexander's
  proposal looks like the way to go.  I'd guess array columns always
  having DEC = default_statistics_target are common enough to make that
  representation the dominant representation, if not the only necessary
  representation.
 
 Well, I don't want to have two representations; I don't think it's worth
 the complexity.  But certainly we could consider switching to a
 different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use arr 
const, arr @ const or arr @ const searches?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-07 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Couldn't we reduce the histogram size when there aren't many
 different counts?
 
 It seems fairly obvious to me that we could bound the histogram
 size with (max count - min count + 1), but maybe something even
 tighter would work; or maybe I'm missing something and this would
 sacrifice accuracy.

 True. If (max count - min count + 1) is small, enumerating of frequencies
 is both more compact and more precise representation. Simultaneously,
 if (max count - min count + 1) is large, we can run out of
 statistics_target with such representation. We can use same representation
 of count distribution as for scalar column value: MCV and HISTOGRAM, but it
 would require additional statkind and statistics slot. Probably, you've
 better ideas?

I wasn't thinking of introducing two different representations,
but just trimming the histogram length when it's larger than necessary.

On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.

More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation.  What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries.  I'm not sure how this could be mechanized but it seems
worth thinking about.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-05 Thread Alexander Korotkov
On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 BTW, one other thing about the count histogram: seems like we are
 frequently generating uselessly large ones.  For instance, do ANALYZE
 in the regression database and then run

 select tablename,attname,elem_count_histogram from pg_stats
  where elem_count_histogram is not null;

 You get lots of entries that look like this:

  pg_proc | proallargtypes |
 {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}
  pg_proc | proargmodes|
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.6}
  pg_proc | proargnames|
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}
  pg_proc | proconfig  |
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
  pg_class| reloptions |
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}

 which seems to me to be a rather useless expenditure of space.
 Couldn't we reduce the histogram size when there aren't many
 different counts?

 It seems fairly obvious to me that we could bound the histogram
 size with (max count - min count + 1), but maybe something even
 tighter would work; or maybe I'm missing something and this would
 sacrifice accuracy.


True. If (max count - min count + 1) is small, enumerating of frequencies
is both more compact and more precise representation. Simultaneously,
if (max count - min count + 1) is large, we can run out of
statistics_target with such representation. We can use same representation
of count distribution as for scalar column value: MCV and HISTOGRAM, but it
would require additional statkind and statistics slot. Probably, you've
better ideas?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Alexander Korotkov
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 1. I'm still unhappy about the loop that fills the count histogram,
 as I noted earlier today.  It at least needs a decent comment and some
 overflow protection, and I'm not entirely convinced that it doesn't have
 more bugs than the overflow issue.


Attached patch is focused on fixing this. The frac variable overflow is
evaded by making it int64. I hope comments is clarifying something. In
general this loop copies behaviour of histogram constructing loop of
compute_scalar_stats function. But instead of values array we've array of
unique DEC and it's frequency.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/utils/adt/array_typanalyze.c
--- b/src/backend/utils/adt/array_typanalyze.c
***
*** 581,587  compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			DECountItem **sorted_count_items;
  			int			count_item_index;
  			int			delta;
! 			int			frac;
  			float4	   *hist;
  
  			/* num_hist must be at least 2 for the loop below to work */
--- 581,587 
  			DECountItem **sorted_count_items;
  			int			count_item_index;
  			int			delta;
! 			int64		frac;
  			float4	   *hist;
  
  			/* num_hist must be at least 2 for the loop below to work */
***
*** 612,633  compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			hist[num_hist] = (double) element_no / (double) nonnull_cnt;
  
  			/*
! 			 * Construct the histogram.
! 			 *
! 			 * XXX this needs work: frac could overflow, and it's not clear
! 			 * how or why the code works.  Even if it does work, it needs
! 			 * documented.
  			 */
  			delta = analyzed_rows - 1;
  			count_item_index = 0;
! 			frac = sorted_count_items[0]-frequency * (num_hist - 1);
  			for (i = 0; i  num_hist; i++)
  			{
  while (frac = 0)
  {
  	count_item_index++;
  	Assert(count_item_index  count_items_count);
! 	frac += sorted_count_items[count_item_index]-frequency * (num_hist - 1);
  }
  hist[i] = sorted_count_items[count_item_index]-count;
  frac -= delta;
--- 612,642 
  			hist[num_hist] = (double) element_no / (double) nonnull_cnt;
  
  			/*
! 			 * Construct the histogram of DECs. The object of this loop is to
! 			 * copy the max and min DECs and evenly-spaced DECs in between
! 			 * (space here is number of arrays corresponding to DEC). If we
! 			 * imagine ordered array of DECs where each input array have a
! 			 * corresponding DEC item, i'th value of histogram will be 
! 			 * DECs[i * (analyzed_rows - 1) / (num_hist - 1)]. But instead
! 			 * of such array we've sorted_count_items which holds unique DEC
! 			 * values with their frequencies. We can imagine frac variable as
! 			 * an (index in DECs corresponding to next sorted_count_items
! 			 * element - index in DECs corresponding to last histogram value) *
! 			 * (num_hist - 1). In this case negative fraction leads us to
! 			 * iterate over sorted_count_items. 
  			 */
  			delta = analyzed_rows - 1;
  			count_item_index = 0;
! 			frac = (int64)sorted_count_items[0]-frequency * 
!    (int64)(num_hist - 1);
  			for (i = 0; i  num_hist; i++)
  			{
  while (frac = 0)
  {
  	count_item_index++;
  	Assert(count_item_index  count_items_count);
! 	frac += (int64)sorted_count_items[count_item_index]-frequency * 
! 		(int64)(num_hist - 1);
  }
  hist[i] = sorted_count_items[count_item_index]-count;
  frac -= delta;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Alexander Korotkov
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 2. The tests in the above-mentioned message show that in most cases
 where mcelem_array_contained_selec falls through to the rough
 estimate, the resulting rowcount estimate is just 1, ie we are coming
 out with very small selectivities.  Although that path will now only be
 taken when there are no stats, it seems like we'd be better off to
 return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
 must be something wrong with the rough estimate logic.  Could you
 recheck that?


I think the wrong think with rough estimate is that assumption about
independent occurrences of items is very unsuitable even for rough
estimate. The following example shows that rough estimate really works
in the case of independent occurrences of items.

Generate test table where item occurrences are really independent.

test=# create table test as select ('{'||(select string_agg(s,',') from
(select case when (t*0 + random())  0.1 then i::text else null end from
generate_series(1,100) i) as x(s))||'}')::int[] AS val  from
generate_series(1,1) t;

SELECT 1

test=# analyze test;
ANALYZE

Do some test.

test=# explain analyze select * from test where val @
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

  QUERY PLAN


--
 Seq Scan on test  (cost=0.00..239.00 rows=151 width=61) (actual
time=0.325..32.556 rows=163 loops=1
)
   Filter: (val @
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 32.806 ms
(4 rows)

Delete DECHIST statistics.

test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 =
null, stavalues1 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind1 = 5;
UPDATE 0
test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 =
null, stavalues2 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind2 = 5;
UPDATE 0
test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 =
null, stavalues3 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind3 = 5;
UPDATE 0
test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 =
null, stavalues4 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind4 = 5;
UPDATE 1
test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 =
null, stavalues5 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind5 = 5;
UPDATE 0

Do another test.

test=# explain analyze select * from test where val @
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

  QUERY PLAN


--
 Seq Scan on test  (cost=0.00..239.00 rows=148 width=61) (actual
time=0.332..32.952 rows=163 loops=1)
   Filter: (val @
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 33.225 ms
(4 rows)

It this particular case rough estimate is quite accurate. But in most
part of cases it behaves really bad. It is why I started to invent
calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
we've some better ideas.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 2. The tests in the above-mentioned message show that in most cases
 where mcelem_array_contained_selec falls through to the rough
 estimate, the resulting rowcount estimate is just 1, ie we are coming
 out with very small selectivities.  Although that path will now only be
 taken when there are no stats, it seems like we'd be better off to
 return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
 must be something wrong with the rough estimate logic.  Could you
 recheck that?

 I think the wrong think with rough estimate is that assumption about
 independent occurrences of items is very unsuitable even for rough
 estimate. The following example shows that rough estimate really works
 in the case of independent occurrences of items. ...
 It this particular case rough estimate is quite accurate. But in most
 part of cases it behaves really bad. It is why I started to invent
 calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
 we've some better ideas.

OK.  Looking again at that code, I notice that it also punts and returns
DEFAULT_CONTAIN_SEL if it's not given MCELEM stats, which it more or
less has to because without even a minfreq the whole calculation is just
hot air.  And there are no plausible scenarios where compute_array_stats
would produce an MCELEM slot but no count histogram.  So that says there
is no point in sweating over this case, unless you have an idea how to
produce useful results without MCELEM.

So I think it's sufficient to punt at the top of the function if no
histogram, and take out the various attempts to cope with the case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 1. I'm still unhappy about the loop that fills the count histogram,
 as I noted earlier today.  It at least needs a decent comment and some
 overflow protection, and I'm not entirely convinced that it doesn't have
 more bugs than the overflow issue.

 Attached patch is focused on fixing this. The frac variable overflow is
 evaded by making it int64. I hope comments is clarifying something. In
 general this loop copies behaviour of histogram constructing loop of
 compute_scalar_stats function. But instead of values array we've array of
 unique DEC and it's frequency.

OK, I reworked this a bit and committed it.  Thanks.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Tom Lane
BTW, one other thing about the count histogram: seems like we are
frequently generating uselessly large ones.  For instance, do ANALYZE
in the regression database and then run

select tablename,attname,elem_count_histogram from pg_stats
  where elem_count_histogram is not null;

You get lots of entries that look like this:

 pg_proc | proallargtypes | 
{1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}
 pg_proc | proargmodes| 
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.6}
 pg_proc | proargnames| 
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}
 pg_proc | proconfig  | 
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
 pg_class| reloptions | 
{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}

which seems to me to be a rather useless expenditure of space.
Couldn't we reduce the histogram size when there aren't many
different counts?

It seems fairly obvious to me that we could bound the histogram
size with (max count - min count + 1), but maybe something even
tighter would work; or maybe I'm missing something and this would
sacrifice accuracy.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-03 Thread Tom Lane
... BTW, could you explain exactly how that Fill histogram by hashtab
loop works?  It's way too magic for my taste, and does in fact have bugs
in the currently submitted patch.  I've reworked it to this:

/* Fill histogram by hashtab. */
delta = analyzed_rows - 1;
count_item_index = 0;
frac = sorted_count_items[0]-frequency * (num_hist - 1);
for (i = 0; i  num_hist; i++)
{
while (frac = 0)
{
count_item_index++;
Assert(count_item_index  count_items_count);
frac += sorted_count_items[count_item_index]-frequency * 
(num_hist - 1);
}
hist[i] = sorted_count_items[count_item_index]-count;
frac -= delta;
}
Assert(count_item_index == count_items_count - 1);

The asserts don't fire in any test case I've tried, which seems to
indicate that it *does* work in the sense that the first histogram entry
is always the smallest count and the last histogram entry is always
the largest one.  But it's extremely unclear why it manages to stop
exactly at the last count_items array entry, or for that matter why it's
generating a representative histogram at all.  I'm suspicious that the
-1 bits represent off-by-one bugs.

I also don't especially like the fact that frac is capable of
overflowing (since worst case frequency is 300 * 1 and worst case
num_hist is 1, with the current limits on statistics_target).
We could work around that by doing the frac arithmetic in int64, but I
wonder whether that couldn't be avoided.  In any case, first I'd like
an explanation why this code works at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-03 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 [ array statistics patch ]

I've committed this after a fair amount of editorialization.  There are
still some loose ends to deal with, but I felt it was ready to go into
the tree for wider testing.

The main thing I changed that wasn't in the nature of cleanup/bugfixing
was that I revised the effort-limiting logic in
mcelem_array_contained_selec.  The submitted code basically just punted
if the estimated work was too much, but as was already noted in
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php
that can result in really bad estimates.  What I did instead is
something closer to Robert's original suggestion: trim the number of
element values taken into consideration from the array constant to a
value that fits within the desired effort limit.  If we consider just
the N most common values from the array constant, we still get a pretty
good estimate (since the trimmed N will still be close to 100 for the
values we're talking about).

I redid the tests in the above-mentioned message and see no cases where
the estimate is off by more than a factor of 2, and very few where it's
off by more than 20%, so this seems to work pretty well now.

The remaining loose ends IMO are:

1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today.  It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.

2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the rough
estimate, the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities.  Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
must be something wrong with the rough estimate logic.  Could you
recheck that?

3. As I mentioned yesterday, I think it'd be a good idea to make some
provisions to reduce the width of pg_statistic rows for array columns
by not storing the scalar-style and/or array-style stats, if the DBA
knows that they're not going to be useful for a particular column.
I have not done anything about that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-02 Thread Tom Lane
Still working through this patch ... there are some things that bother
me about the entries being made in pg_statistic:

1. You re-used STATISTIC_KIND_MCELEM for something that, while similar
to tsvector's usage, is not the same.  In particular, tsvector adds two
extra elements to the stanumbers array, but this patch adds four (and
doesn't bother documenting that in pg_statistic.h).  I don't think it's
acceptable to re-use the same stakind value for something that's not
following the same specification.  I see Nathan complained of this way,
way upthread, but nothing was done about it.

I think we should either assign a different stakind code for this
definition, or change things around so that tsvector actually is using
the same stats kind definition as arrays are.  (We could get away with
redefining the tsvector stats format, because pg_upgrade doesn't try to
copy pg_statistic rows to the new database.)  Now, of the two new
elements added by the patch, it seems to me to be perfectly reasonable
to add a null-element frequency to the kind specification; the fact that
it wasn't there to start with is kind of an oversight born of the fact
that tsvectors don't contain any null lexemes.  But the other new
element is the average distinct element count, which really does not
belong here at all, as it is *entirely* unrelated to element
frequencies.  It seems to me that that more nearly belongs in the
element-count histogram slot.  So my preference is to align the two
definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency
to tsvector's usage (where it'll always be zero) and getting rid of the
average distinct element count here.

2. I think STATISTIC_KIND_LENGTH_HISTOGRAM is badly named and
confusingly documented.  The stats are not about anything I would call a
length --- rather we're considering the counts of numbers of distinct
element values present in each array value.  An ideal name perhaps would
be STATISTIC_KIND_DISTINCT_ELEMENTS_COUNT_HISTOGRAM, but of course
that's unreasonably long.  Considering the way that the existing stats
kind names are abbreviated, maybe STATISTIC_KIND_DECHIST would do.
Anybody have a better idea?

3. I also find it a bit odd that you chose to store the length (count)
histogram as an integer array in stavalues.  Usually we've put such data
in stanumbers.  That would make the entries float4 not integer, but that
doesn't seem unreasonable to me --- values would still be exact up to
2^24 or so on typical machines, and if we ever do have values larger
than that, it seems to me that having headroom to go above 2^32 would
be a good thing.  In any case, if we're going to move the average
distinct-element count over here, that would have to go into stanumbers.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-02 Thread Tom Lane
I wrote:
 ... So my preference is to align the two
 definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency
 to tsvector's usage (where it'll always be zero) and getting rid of the
 average distinct element count here.

Actually, there's a way we can do this without code changes in the
tsvector stuff.  Since the number of MCELEM stanumber items that provide
frequencies of stavalue items is obviously equal to the length of
stavalues, we could define stanumbers as containing those matching
entries, then two min/max entries, then an *optional* entry for the
frequency of null elements (with the frequency presumed to be zero if
omitted).  This'd be non-ambiguous given access to stavalues.  I'm not
sure though if making the null frequency optional wouldn't introduce
complexity elsewhere that outweighs not having to touch the tsvector
code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Alexander Korotkov
On Thu, Mar 1, 2012 at 1:19 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 That seems like a pretty narrow, uncommon use-case.  Also, to get
 accurate stats for such queries that way, you'd need really enormous
 histograms.  I doubt that the existing parameters for histogram size
 will permit meaningful estimation of more than the first array entry
 (since we don't make the histogram any larger than we do for a scalar
 column).

 The real point here is that the fact that we're storing btree-style
 stats for arrays is an accident, backed into by having added btree
 comparators for arrays plus analyze.c's habit of applying default
 scalar-oriented analysis functions to any type without an explicit
 typanalyze entry.  I don't recall that we ever thought hard about
 it or showed that those stats were worth anything.


 OK. I don't object to removing btree stats from arrays.
 What do you thinks about pg_stats view in this case? Should it combine
 values histogram and array length histogram in single column like do for
 MCV and MCELEM?


Btree statistics for arrays and additional statistics slot are removed from
attached version of patch. pg_stats view is untouched for while.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.13.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Robert Haas
On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nathan Boley npbo...@gmail.com writes:
 On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

 If I understand you're suggestion, queries of the form

 SELECT * FROM rel
 WHERE ARRAY[ 1,2,3,4 ] = x
      AND x =ARRAY[ 1, 2, 3, 1000];

 would no longer use an index. Is that correct?

 No, just that we'd no longer have statistics relevant to that, and would
 have to fall back on default selectivity assumptions.  Do you think that
 such applications are so common as to justify bloating pg_statistic for
 everybody that uses arrays?

I confess I am nervous about ripping this out.  I am pretty sure we
will get complaints about it.  Performance optimizations that benefit
group A at the expense of group B are always iffy, and I'm not sure
the case of using an array as a path indicator is as uncommon as you
seem to think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Alvaro Herrera

Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
 On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:

  No, just that we'd no longer have statistics relevant to that, and would
  have to fall back on default selectivity assumptions.  Do you think that
  such applications are so common as to justify bloating pg_statistic for
  everybody that uses arrays?
 
 I confess I am nervous about ripping this out.  I am pretty sure we
 will get complaints about it.  Performance optimizations that benefit
 group A at the expense of group B are always iffy, and I'm not sure
 the case of using an array as a path indicator is as uncommon as you
 seem to think.

Maybe we should keep it as an option.  I do think it's quite uncommon,
but for those rare users, it'd be good to provide the capability while
not bloating everyone else's stat catalog.  The thing is, people using
arrays as path indicators and such are likely using relatively small
arrays; people storing real data are likely to store much bigger arrays.
Just a hunch though.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
 On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I confess I am nervous about ripping this out.  I am pretty sure we
 will get complaints about it.  Performance optimizations that benefit
 group A at the expense of group B are always iffy, and I'm not sure
 the case of using an array as a path indicator is as uncommon as you
 seem to think.

 Maybe we should keep it as an option.

How would we make it optional?  There's noplace I can think of to stick
such a knob ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Alvaro Herrera

Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012:
 
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Robert Haas's message of jue mar 01 12:00:08 -0300 2012:
  On Wed, Feb 29, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I confess I am nervous about ripping this out.  I am pretty sure we
  will get complaints about it.  Performance optimizations that benefit
  group A at the expense of group B are always iffy, and I'm not sure
  the case of using an array as a path indicator is as uncommon as you
  seem to think.
 
  Maybe we should keep it as an option.
 
 How would we make it optional?  There's noplace I can think of to stick
 such a knob ...

Uhm, attoptions?

alter table foo alter column bar set extended_array_stats to on
or something like that?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
 What about MCV's? Will those be removed as well?

 Sure.  Those seem even less useful.

Ya, this will destroy the performance of several queries without some
heavy tweaking.

Maybe this is bad design, but I've gotten in the habit of storing
sequences as arrays and I commonly join on them. I looked through my
code this morning, and I only have one 'range' query ( of the form
described up-thread ), but there are tons of the form

SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

I can provide some examples if that would make my argument more compelling.

Sorry to be difficult,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Tom Lane
Nathan Boley npbo...@gmail.com writes:
 Maybe this is bad design, but I've gotten in the habit of storing
 sequences as arrays and I commonly join on them. I looked through my
 code this morning, and I only have one 'range' query ( of the form
 described up-thread ), but there are tons of the form

 SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

What do you mean by storing sequences as arrays?  Can you demonstrate
that the existing stats are relevant at all to the query you're worried
about?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue mar 01 18:51:38 -0300 2012:
 How would we make it optional?  There's noplace I can think of to stick
 such a knob ...

 Uhm, attoptions?

Oh, I had forgotten we had that mechanism already.  Yeah, that might
work.  I'm a bit tempted to design the option setting so that you can
select whether to keep the btree stats, the new stats, or both or
neither --- after all, there are probably plenty of databases where
nobody cares about the array-containment operators either.

That leaves the question of which setting should be the default ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
[ sorry Tom, reply all this time... ]

 What do you mean by storing sequences as arrays?

So, a simple example is, for transcripts ( sequences of DNA that are
turned into proteins ), we store each of the connected components as
an array of the form:

exon_type in [1,6]
splice_type = [1,3]

and then the array elements are

[ exon_type, splice_type, exon_type ]

~ 99% of the elements are of the form [ [1,3], 1, [1,3] ],

so I almost always get a hash or merge join ( correctly ) but for the
rare junction types ( which are usually more interesting as well ) I
correctly get nest loops with an index scan.

 Can you demonstrate
 that the existing stats are relevant at all to the query you're worried
 about?

Well, if we didn't have mcv's and just relied on ndistinct to estimate
the '=' selectivities, either my low selectivity quals would use the
index, or my high selectivity quals would use a table scan, either of
which would be wrong.

I guess I could wipe out the stats and get some real numbers tonight,
but I can't see how the planner would be able to distinguish *without*
mcv's...

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:
 I've attached a new version that includes the UINT64_FMT fix, some edits of
 your newest comments, and a rerun of pgindent on the new files.  I see no
 other issues precluding commit, so I am marking the patch Ready for
 Committer.
 If I made any of the comments worse, please post another update.

 Changes looks reasonable for me. Thanks!

I am starting to look at this patch now.  I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff.  The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider.  I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.

In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Alexander Korotkov
On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.  The pg_statistic rows for array columns
 tend to be unreasonably wide already, and as-is this patch will make
 them quite a lot wider.  I think it requires more than a little bit of
 evidence to continue storing stats that seem to have only small
 probability of usefulness.

 In particular, if we didn't store that stuff, we'd not need to widen the
 number of columns in pg_statistic, which would noticeably reduce both
 the footprint of the patch and the probability of breaking external
 code.


Initially, I used existing slots for new statistics, but I've changed this
after the first review:
http://archives.postgresql.org/pgsql-hackers/2011-07/msg00780.php

Probably, btree statistics really does matter for some sort of arrays? For
example, arrays representing paths in the tree. We could request a subtree
in a range query on such arrays.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,

 Probably, btree statistics really does matter for some sort of arrays? For
 example, arrays representing paths in the tree. We could request a subtree
 in a range query on such arrays.

That seems like a pretty narrow, uncommon use-case.  Also, to get
accurate stats for such queries that way, you'd need really enormous
histograms.  I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).

The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry.  I don't recall that we ever thought hard about
it or showed that those stats were worth anything.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Alexander Korotkov
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I am starting to look at this patch now.  I'm wondering exactly why the
  decision was made to continue storing btree-style statistics for arrays,

  Probably, btree statistics really does matter for some sort of arrays?
 For
  example, arrays representing paths in the tree. We could request a
 subtree
  in a range query on such arrays.

 That seems like a pretty narrow, uncommon use-case.  Also, to get
 accurate stats for such queries that way, you'd need really enormous
 histograms.  I doubt that the existing parameters for histogram size
 will permit meaningful estimation of more than the first array entry
 (since we don't make the histogram any larger than we do for a scalar
 column).

 The real point here is that the fact that we're storing btree-style
 stats for arrays is an accident, backed into by having added btree
 comparators for arrays plus analyze.c's habit of applying default
 scalar-oriented analysis functions to any type without an explicit
 typanalyze entry.  I don't recall that we ever thought hard about
 it or showed that those stats were worth anything.


OK. I don't object to removing btree stats from arrays.
What do you thinks about pg_stats view in this case? Should it combine
values histogram and array length histogram in single column like do for
MCV and MCELEM?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:
 I've attached a new version that includes the UINT64_FMT fix, some edits of
 your newest comments, and a rerun of pgindent on the new files.  I see no
 other issues precluding commit, so I am marking the patch Ready for
 Committer.
 If I made any of the comments worse, please post another update.

 Changes looks reasonable for me. Thanks!

 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

If I understand you're suggestion, queries of the form

SELECT * FROM rel
WHERE ARRAY[ 1,2,3,4 ] = x
 AND x =ARRAY[ 1, 2, 3, 1000];

would no longer use an index. Is that correct?

Are you suggesting removing MCV's in lieu of MCE's as well?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Tom Lane
Nathan Boley npbo...@gmail.com writes:
 On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

 If I understand you're suggestion, queries of the form

 SELECT * FROM rel
 WHERE ARRAY[ 1,2,3,4 ] = x
  AND x =ARRAY[ 1, 2, 3, 1000];

 would no longer use an index. Is that correct?

No, just that we'd no longer have statistics relevant to that, and would
have to fall back on default selectivity assumptions.  Do you think that
such applications are so common as to justify bloating pg_statistic for
everybody that uses arrays?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nathan Boley npbo...@gmail.com writes:
 On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

 If I understand you're suggestion, queries of the form

 SELECT * FROM rel
 WHERE ARRAY[ 1,2,3,4 ] = x
      AND x =ARRAY[ 1, 2, 3, 1000];

 would no longer use an index. Is that correct?

 No, just that we'd no longer have statistics relevant to that, and would
 have to fall back on default selectivity assumptions.

Which, currently, would mean queries of that form would typically use
a table scan, right?

 Do you think that
 such applications are so common as to justify bloating pg_statistic for
 everybody that uses arrays?

I have no idea, but it seems like it will be a substantial regression
for the people that are.

What about MCV's? Will those be removed as well?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Tom Lane
Nathan Boley npbo...@gmail.com writes:
 On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nathan Boley npbo...@gmail.com writes:
 If I understand you're suggestion, queries of the form
 SELECT * FROM rel
 WHERE ARRAY[ 1,2,3,4 ] = x
      AND x =ARRAY[ 1, 2, 3, 1000];
 would no longer use an index. Is that correct?

 No, just that we'd no longer have statistics relevant to that, and would
 have to fall back on default selectivity assumptions.

 Which, currently, would mean queries of that form would typically use
 a table scan, right?

No, it doesn't.

 What about MCV's? Will those be removed as well?

Sure.  Those seem even less useful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-23 Thread Noah Misch
On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote:
 Updated patch is attached. I've updated comment
 of mcelem_array_contained_selec with more detailed description of
 probability distribution assumption. Also, I found that rest behavious
 should be better described by Poisson distribution, relevant changes were
 made.

Thanks.  That makes more of the math clear to me.  I do not follow all of it,
but I feel that the comments now have enough information that I could go about
doing so.

 + /* Take care about events with low probabilities. */
 + if (rest  DEFAULT_CONTAIN_SEL)
 + {

Why the change from rest  0 to this in the latest version?

 + /* emit some statistics for debug purposes */
 + elog(DEBUG3, array: target # mces = %d, bucket width = %d, 
 +  # elements = %llu, hashtable size = %d, usable 
 entries = %d,
 +  num_mcelem, bucket_width, element_no, i, track_len);

That should be UINT64_FMT.  (I introduced that error in v0.10.)


I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files.  I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
If I made any of the comments worse, please post another update.

Thanks,
nm


arrayanalyze-0.13.patch.gz
Description: application/gunzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-23 Thread Alexander Korotkov
On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:

  + /* Take care about events with low probabilities. */
  + if (rest  DEFAULT_CONTAIN_SEL)
  + {

 Why the change from rest  0 to this in the latest version?

Ealier addition of rest distribution require O(m) time. Now there is a
more accurate and proved estimate, but it takes O(m^2) time.It doesn't make
general assymptotical time worse, but it significant. That's why I decided
to skip for low values of rest which don't change distribution
significantly.



  + /* emit some statistics for debug purposes */
  + elog(DEBUG3, array: target # mces = %d, bucket width =
 %d, 
  +  # elements = %llu, hashtable size = %d, usable
 entries = %d,
  +  num_mcelem, bucket_width, element_no, i,
 track_len);

 That should be UINT64_FMT.  (I introduced that error in v0.10.)


 I've attached a new version that includes the UINT64_FMT fix, some edits of
 your newest comments, and a rerun of pgindent on the new files.  I see no
 other issues precluding commit, so I am marking the patch Ready for
 Committer.

Great!


 If I made any of the comments worse, please post another update.

Changes looks reasonable for me. Thanks!

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-22 Thread Alexander Korotkov
Hi!

Updated patch is attached. I've updated comment
of mcelem_array_contained_selec with more detailed description of
probability distribution assumption. Also, I found that rest behavious
should be better described by Poisson distribution, relevant changes were
made.

On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch n...@leadboat.com wrote:

 By summary frequency of elements, do you mean literally P_0 + P_1 ... +
 P_N?
 If so, I can follow the above argument for column  const and column @
 const, but not for column @ const.  For column @ const, selectivity
 cannot exceed the smallest frequency among const elements.  A number of
 high-frequency elements will drive up the sum of the frequencies without
 changing the true selectivity much at all.

Referencing to summary frequency is not really correct. It would be more
correct to reference to number of element in const. When there are many
elements in const, column @ const selectivity tends to be close to 0
and  column @ const tends to be close to 1. Surely, it's true when
elements have some kind of middle values of frequencies (not very close to
0 and not very close to 1). I've replaced summary frequency of elements
by number of elements.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.12.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-17 Thread Alexander Korotkov
Hi!

Thanks for your fixes to the patch. Them looks correct to me. I did some
fixes in the patch. The proof of some concepts is still needed. I'm going
to provide it in a few days.

On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch n...@leadboat.com wrote:

  I'm not sure about shared lossy counting module, because part of shared
  code would be relatively small. Part of compute_array_stats function
 which
  is taking care about array decompression, distinct occurence calculation,
  disting element count histogram, packing statistics slots etc is much
  larger than lossy counting algorithm itself. May be, there is some other
  opinions in community?

 True; it would probably increase total lines of code.  The benefit, if any,
 lies in separation of concerns; the business of implementing this
 algorithm is
 quite different from the other roles of these typanalyze functions.  I
 won't
 insist that you try it, though.

I'd prefer to try it as separate patch.


+ /*
+  * The probability of no occurence of events which
 forms
  rest
+  * probability have a limit of exp(-rest) when number
 of
  events fo to
+  * infinity. Another simplification is to replace that
  events with one
+  * event with (1 - exp(-rest)) probability.
+  */
+ rest = 1.0f - exp(-rest);
  
   What is the name of the underlying concept in probability theory?
 
  The most closest concept to caculated distribution is multinomial
  distribution. But it's not exactly same, because multinomial distribution
  gives probability of particular count of each event occurece, not
  probability of summary occurence. Actually, distribution is caclulated
 just
  from assumption of events independence. The most closest concept of rest
  probability is approximation by exponential distribution. It's quite
 rough
  approximation, but I can't invent something better with low calculation
  complexity.

 Do you have a URL of a tutorial or paper that explains the method in more
 detail?  If, rather, this is a novel synthesis, could you write a proof to
 include in the comments?

Unfortunately I don't have relevant paper for it. I'll try to search
it. Otherwise I'll try to do some proof.


 + /*
+  * Array selectivity estimation based on most common elements
  statistics for
+  * column @ const case. Assumption that element occurences are
  independent
+  * means certain distribution of array lengths. Typically real
  distribution
+  * of lengths is significantly different from it. For example, if
  even we
+  * have set of arrays with 1 integer element in range [0;10] each,
  element
+  * occurences are not independent. Because in the case of
  independence we
  
   Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
   '{6,19,4}' cannot appear?
 
  I refer column where only one element exists, i.e. only possible values
 are
  '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
  '{10}'. That is a corner case. But similar situation occurs when, for
  example, we've distribution of distinct element count between 1 and 3. It
  significantly differs from distribution from independent occurence.

 Oh, I think I see now.  If each element 1..10 had frequency 0.1
 independently,
 column values would have exactly one distinct element just 39% of the time?

Yes, it's right.


 If probability theory has a prototypical problem resembling this, it would
 be
 nice to include a URL to a thorough discussion thereof.  I could not think
 of
 the search terms to find one, though.

Actually, usage of both distinct element count histogram and element
occurrence frequencies produce some probability distribution (which is more
complex than just independent element occurrence). If real distribution is
close this distribution, then estimate is accurate. I didn't met such
distributions is papers, actually I've just invented it in my tries to do
accurate column @ const estimation at least in simple cases. I'll try to
search for similar things in papers, but I doubt I'll succeed. Otherwise
I'll try to do some more detailed proof.


  *** /dev/null
  --- b/src/backend/utils/adt/array_selfuncs.c

  + Selectivity
  + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool
 orClause,
  + Oid operator)
  + {
  + Oid elemtype;
  + Selectivity selec;
  + TypeCacheEntry *typentry;
  + Datum  *hist;
  + int nhist;
  + FunctionCallInfoData cmpfunc;
  +
  + elemtype = get_base_element_type(vardata-vartype);
  +
  +
  + /* Get default comparison function */
  + typentry = lookup_type_cache(elemtype,
  +TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO |
 TYPECACHE_EQ_OPR);
  +
  + /* Handle only = operator. Return default selectivity in other
 cases. */
  + if (operator != 

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-17 Thread Noah Misch
On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote:
 Thanks for your fixes to the patch. Them looks correct to me. I did some
 fixes in the patch. The proof of some concepts is still needed. I'm going
 to provide it in a few days.

Your further fixes look good.  Could you also answer my question about the
header comment of mcelem_array_contained_selec()?

/*
 * Estimate selectivity of column @ const based on most common element
 * statistics.  Independent element occurrence would imply a particular
 * distribution of distinct element counts among matching rows.  Real data
 * usually falsifies that assumption.  For example, in a set of 1-element
 * integer arrays having elements in the range [0;10], element occurrences are
 * not independent.  If they were, a sufficiently-large set would include all
 * distinct element counts 0 through 11.  We correct for this using the
 * histogram of distinct element counts.
 *
 * In the column @ const and column  const cases, we usually have
 * const with low summary frequency of elements (otherwise we have
 * selectivity close to 0 or 1 correspondingly).  That's why the effect of
 * dependence related to distinct element counts distribution is negligible
 * there.  In the column @ const case, summary frequency of elements is
 * high (otherwise we have selectivity close to 0).  That's why we should do
 * correction due to array distinct element counts distribution.
 */

By summary frequency of elements, do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for column  const and column @
const, but not for column @ const.  For column @ const, selectivity
cannot exceed the smallest frequency among const elements.  A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-07 Thread Alexander Korotkov
Hi!

Patch where most part of issues are fixed is attached.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch n...@leadboat.com wrote:
 I find distressing the thought of having two copies of the lossy sampling
 code, each implementing the algorithm with different variable names and
levels
 of generality.  We might someday extend this to hstore, and then we'd
have yet
 another copy.  Tom commented[1] that ts_typanalyze() and
array_typanalyze()
 should remain distinct, and I agree.  However, they could call a shared
 counting module.  Is that practical?  Possible API:

 typedef struct LossyCountCtl;
 LossyCountCtl *LossyCountStart(float s,
   float epsilon,
   int2 typlen,
   bool typbyval,
   Oid eqfunc); /*
+ hash func, a few others */
 void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
 TrackItem **LossyCountGetAll(LossyCountCtl *ctl);

 [1]
http://archives.postgresql.org/message-id/12406.1298055...@sss.pgh.pa.us

I'm not sure about shared lossy counting module, because part of shared
code would be relatively small. Part of compute_array_stats function which
is taking care about array decompression, distinct occurence calculation,
disting element count histogram, packing statistics slots etc is much
larger than lossy counting algorithm itself. May be, there is some other
opinions in community?

 I think this is an improvement, but some code out there may rely on the
 ability to get stakind = 4 data from the most_common_vals column.  We'll
need
 to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require
something in a patch itself?

  + /*
  +  * Let be n independent events with probabilities p. This function
calculates
  +  * probabilities of exact k of events occurence for k in [0;m].
  +  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
  +  * probability that exact j of first i events occurs. Obviously
M[0,0] = 1.
  +  * Each next event increase total number of occured events if it
occurs and
  +  * leave last value of that number if it doesn't occur. So, by the
law of
  +  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j
- 1] * p[i]
  +  * for i  0, j  0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i  0.
  +  * Also there could be some events with low probabilities. Their
summary
  +  * probability passed in the rest parameter.
  +  */
  + static float *
  + calc_distr(float *p, int n, int m, float rest)
  + {

  + /* Take care about events with low probabilities. */
  + if (rest  0.0f)
  + {
  + /*
  +  * The probability of no occurence of events which forms
rest
  +  * probability have a limit of exp(-rest) when number of
events fo to
  +  * infinity. Another simplification is to replace that
events with one
  +  * event with (1 - exp(-rest)) probability.
  +  */
  + rest = 1.0f - exp(-rest);

 What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial
distribution. But it's not exactly same, because multinomial distribution
gives probability of particular count of each event occurece, not
probability of summary occurence. Actually, distribution is caclulated just
from assumption of events independence. The most closest concept of rest
probability is approximation by exponential distribution. It's quite rough
approximation, but I can't invent something better with low calculation
complexity.

  + /*
  +  * Array selectivity estimation based on most common elements
statistics for
  +  * column @ const case. Assumption that element occurences are
independent
  +  * means certain distribution of array lengths. Typically real
distribution
  +  * of lengths is significantly different from it. For example, if
even we
  +  * have set of arrays with 1 integer element in range [0;10] each,
element
  +  * occurences are not independent. Because in the case of
independence we

 Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
 '{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are
'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
'{10}'. That is a corner case. But similar situation occurs when, for
example, we've distribution of distinct element count between 1 and 3. It
significantly differs from distribution from independent occurence.

  +  * have probabilities of length of 0, 1, 2 etc. In the column @
const
  +  * and column  const cases we usually have const with low
summary
  +  * frequency of elements (otherwise we have selectivity close to 0 or
1
  +  * correspondingly). That's why effect of dependence related to
lengths
  +  

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-06 Thread Noah Misch
Corrections:

On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote:
 On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:
  +  *We set s to be the estimated frequency of the K'th element in a 
  natural
  +  *language's frequency table, where K is the target number of 
  entries in
  +  *the MCELEM array. We assume that the distribution of element 
  frequencies
  +  *follows Zipf's law with an exponent of 1.
  +  *
  +  *Assuming Zipfian distribution, the frequency of the K'th 
  element is equal
  +  *to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is 
  the number of
  +  *elements in the language.   Putting W as one million, we 
  get roughly
  +  *0.07/K. This gives s = 0.07/K.  We set epsilon = s/10, which 
  gives bucket
  +  *width w = K/0.007 and maximum expected hashtable size of about 
  1000 * K.
 
 These last two paragraphs, adapted from ts_typanalyze.c, assume natural
 language documents.  To what extent do these parameter choices remain sensible
 for arbitrary data such as users may place in arrays?  In any event, we need a
 different justification, even if it's just a hand-wavy justification.
 
 If I'm following this correctly, this choice of s makes the algorithm
 guaranteed to find only elements constituting = 7% of the input elements.
 Incidentally, isn't that far too high for natural language documents?  If the
 English word the only constitutes 7% of typical documents, then this s
 value would permit us to discard practically every word; we'd be left with
 words read while filling the last bucket and words that happened to repeat
 exceedingly often in the column.  I haven't tried to make a test case to
 observe this problem; am I missing something?  (This question is largely
 orthogonal to your patch.)

No, we'll find elements of frequency at least 0.07/(default_statistics_target
* 10) -- in the default configuration, 0.007%.  Also, ts_typanalyze() counts
the number of documents that contain one or more instances of each lexeme,
ignoring the number of appearances within each document.  The word the may
constitute 7% of a typical document, but it will appear at least once in
nearly 100% of documents.  Therefore, this s value is adequate even for the
pathological case of each document containing just one lexeme.

  +  *
  +  *Note: in the above discussion, s, epsilon, and f/N are in terms 
  of a
  +  *element's frequency as a fraction of all elements seen in the 
  input.
  +  *However, what we actually want to store in the finished 
  pg_statistic
  +  *entry is each element's frequency as a fraction of all rows 
  that it occurs
  +  *in. Elements might be repeated in the same array. Since 
  operators
  +  *@, , @ takes care only about element occurence itself and 
  not about
  +  *occurence count, function takes additional care about 
  uniqueness of
  +  *counting. Also we need to change the divisor from N to 
  nonnull_cnt to get
  +  *the number we want.
 
 On the same tangent, why does ts_typanalyze() not deduplicate the same way?
 The @@ operator has the same property.

Answer: to_tsvector() will have already done so.

  +   /*
  +* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
  +* comment above.
  +*/
  +   bucket_width = num_mcelem * 1000 / 7;
 
 The addend mentioned is not present in the code or discussed in the comment
 above.  (I see the comment is copied verbatim from ts_typanalyze(), where the
 addend *is* present, though again the preceding comment says nothing of it.)

The addend rationale in fact does appear in the ts_typanalyze() comment.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Alexander Korotkov
Hi!

Thanks for your great work on reviewing this patch. Now I'm trying to find
memory corruption bug. Unfortunately it doesn't appears on my system. Can
you check if this bug remains in attached version of patch. If so, please
provide me information about system you're running (processor, OS etc.).

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.8.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Noah Misch
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
 Thanks for your great work on reviewing this patch. Now I'm trying to find
 memory corruption bug. Unfortunately it doesn't appears on my system. Can
 you check if this bug remains in attached version of patch. If so, please
 provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version.  Opteron processor, operating
system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Alexander Korotkov
On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch n...@leadboat.com wrote:

 On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
  Thanks for your great work on reviewing this patch. Now I'm trying to
 find
  memory corruption bug. Unfortunately it doesn't appears on my system. Can
  you check if this bug remains in attached version of patch. If so, please
  provide me information about system you're running (processor, OS etc.).

 I get the same diagnostic from this version.  Opteron processor, operating
 system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?

Oh, actually no. Thanks for point.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2011-12-27 Thread Noah Misch
On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote:
 On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley npbo...@gmail.com wrote:
 
  FYI, I've added myself as the reviewer for the current commitfest.
 
 How is going review now?

I will examine this patch within the week.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2011-12-20 Thread Alexander Korotkov
Hi!

On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley npbo...@gmail.com wrote:

 FYI, I've added myself as the reviewer for the current commitfest.

How is going review now?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collect frequency statistics for arrays

2011-11-15 Thread Nathan Boley
 Rebased with head.

FYI, I've added myself as the reviewer for the current commitfest.

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2011-11-09 Thread Alexander Korotkov
Rebased with head.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers