subject:"\[HACKERS\] WIP\: Fast GiST index build"


On 26.08.2011 17:18, Alexander Korotkov wrote:

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a test
server.



1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.


Thanks. Meanwhile, I hacked together my own set of test scripts, and let 
them run over the weekend. I'm still running tests with ordered data, 
but here are some preliminary results:


   testname   |   nrows   |duration | accesses
-+---+-+--
 points unordered auto   | 25000 | 08:08:39.174956 |  3757848
 points unordered buffered   | 25000 | 09:29:16.47012  |  4049832
 points unordered unbuffered | 25000 | 03:48:10.999861 |  4564986

As you can see, the results are very disappointing :-(. The buffered 
builds take a lot *longer* than unbuffered ones. I was expecting the 
buffering to be very helpful at least in these unordered tests. On the 
positive side, the buffering made index quality somewhat better 
(accesses column, smaller is better), but that's not what we're aiming at.


What's going on here? This data set was large enough to not fit in RAM, 
the table was about 8.5 GB in size (and I think the index is even larger 
than that), and the box has 4GB of RAM. Does the buffering only help 
with even larger indexes that exceed the cache size even more?



Test methodology


These tests consist of creating a gist index using the point datatype.

 Table public.points
  Column |  Type   | Modifiers
+-+---
  x  | integer |
  y  | integer |

CREATE INDEX testindex ON points_ordered USING gist (point(x,y)) WITH 
(buffering = 'on');


The points in the table are uniformly distributed. In the 'unordered' 
tests, they are in random order. The ordered tests use the exact same 
data, but sorted by x, y coordinates.


The 'accesses' column measures the quality of the produced index. 
Smaller is better. It is calculated by performing a million queries on 
the table, selecting points within a small square at evenly spaced 
locations. Like:


(SELECT COUNT(*) FROM points WHERE point(x,y) @ box(point(xx-20, 
yy-20), point(xx+20, yy+20)));


The number of index pages touched by those queries are count from 
pg_statio_user_indexes, and that number is reported in the 'accesses' 
column.


I've attached the whole script used. Pass the number of rows to use in 
the test as argument, and the script does the rest.


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com



rungisttests.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build


On 30.08.2011 12:08, Heikki Linnakangas wrote:

What's going on here? This data set was large enough to not fit in RAM,
the table was about 8.5 GB in size (and I think the index is even larger
than that), and the box has 4GB of RAM. Does the buffering only help
with even larger indexes that exceed the cache size even more?


The tests are still running, so I decided to try oprofile. The build is 
in the final emptying phase, according to the log, and has been for over 
half an hour now. Oprofile output looks very interesting:


samples  %image name   symbol name
228590   30.3045  postgres pg_qsort
200558   26.5882  postgres gistBuffersFreeBlocksCmp
49397 6.5486  postgres gistchoose
45563 6.0403  postgres gist_box_penalty
31425 4.1661  postgres AllocSetAlloc
24182 3.2058  postgres FunctionCall3Coll
22671 3.0055  postgres rt_box_union
20147 2.6709  postgres gistpenalty
17007 2.2546  postgres DirectFunctionCall2Coll
15790 2.0933  no-vmlinux   /no-vmlinux
14148 1.8756  postgres XLogInsert
10612 1.4068  postgres gistdentryinit
10542 1.3976  postgres MemoryContextAlloc
9466  1.2549  postgres FunctionCall1Coll
9190  1.2183  postgres gist_box_decompress
6681  0.8857  postgres med3
4941  0.6550  libc-2.12.so isnanf

So, over 50% of the CPU time is spent in choosing a block from the 
temporary files. That should be pretty easy to improve..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-30 Thread Alexander Korotkov

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 So, over 50% of the CPU time is spent in choosing a block from the
 temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-30 Thread Alexander Korotkov

On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:


 Thanks. Meanwhile, I hacked together my own set of test scripts, and let
 them run over the weekend. I'm still running tests with ordered data, but
 here are some preliminary results:

   testname   |   nrows   |duration | accesses
 -+**---+-+**--
  points unordered auto   | 25000 | 08:08:39.174956 |  3757848
  points unordered buffered   | 25000 | 09:29:16.47012  |  4049832
  points unordered unbuffered | 25000 | 03:48:10.999861 |  4564986

 As you can see, the results are very disappointing :-(. The buffered builds
 take a lot *longer* than unbuffered ones. I was expecting the buffering to
 be very helpful at least in these unordered tests. On the positive side, the
 buffering made index quality somewhat better (accesses column, smaller is
 better), but that's not what we're aiming at.

 What's going on here? This data set was large enough to not fit in RAM, the
 table was about 8.5 GB in size (and I think the index is even larger than
 that), and the box has 4GB of RAM. Does the buffering only help with even
 larger indexes that exceed the cache size even more?

This seems pretty strange for me. Time of unbuffered index build shows that
there is not bottleneck at IO. That radically differs from my
experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be
somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
rerun tests on your test setup with dataset generation on the backend like
this?
CREATE TABLE points AS (SELECT point(random(), random() FROM
generate_series(1,1000));

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build


On 30.08.2011 13:29, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..


Yes, probably we can just remove free blocks sorting.


I'm re-running the tests with that change now. It seems like using the 
list of free blocks as a simple stack would be better anyway, it 
probably yields a better cache hit ratio when we re-use blocks that have 
just been released.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build


On 30.08.2011 13:38, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:



Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data, but
here are some preliminary results:

   testname   |   nrows   |duration | accesses
-+**---+-+**--
  points unordered auto   | 25000 | 08:08:39.174956 |  3757848
  points unordered buffered   | 25000 | 09:29:16.47012  |  4049832
  points unordered unbuffered | 25000 | 03:48:10.999861 |  4564986

As you can see, the results are very disappointing :-(. The buffered builds
take a lot *longer* than unbuffered ones. I was expecting the buffering to
be very helpful at least in these unordered tests. On the positive side, the
buffering made index quality somewhat better (accesses column, smaller is
better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM, the
table was about 8.5 GB in size (and I think the index is even larger than
that), and the box has 4GB of RAM. Does the buffering only help with even
larger indexes that exceed the cache size even more?


This seems pretty strange for me. Time of unbuffered index build shows that
there is not bottleneck at IO. That radically differs from my
experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be
somewhat bad. Thereby unordered dataset behave like the ordered one.


Oh. Doing a simple SELECT * FROM points LIMIT 10, it looks pretty 
random to me. The data should be uniformly distributed in a rectangle 
from (0, 0) to (10, 10).



 Can you
rerun tests on your test setup with dataset generation on the backend like
this?
CREATE TABLE points AS (SELECT point(random(), random() FROM
generate_series(1,1000));


Ok, I'll queue up that test after the ones I'm running now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build


On 30.08.2011 13:29, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..


Yes, probably we can just remove free blocks sorting.


Ok, the first results are in for that:

 testname  |   nrows   |duration | accesses
---+---+-+--
 points unordered buffered | 25000 | 06:00:23.707579 |  4049832

From the previous test runs, the unbuffered index build took under 4 
hours, so even though this is a lot better than with the sorting, it's 
still a loss compared to non-buffered build.


I had vmstat running during most of this index build. At a quick glance, 
it doesn't seem to be CPU bound anymore. I suspect the buffers in the 
temporary file gets very fragmented. Or, we're reading it in backwards 
order because the buffers work in a LIFO fashion. The system seems to be 
doing about 5 MB/s of I/O during the build, which sounds like a figure 
you'd get for more or less random I/O.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-30 Thread Alexander Korotkov

On Tue, Aug 30, 2011 at 9:29 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 30.08.2011 13:29, Alexander Korotkov wrote:

 On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas
 heikki.linnakangas@**enterprisedb.comheikki.linnakan...@enterprisedb.com
  wrote:

  So, over 50% of the CPU time is spent in choosing a block from the
 temporary files. That should be pretty easy to improve..

  Yes, probably we can just remove free blocks sorting.


 Ok, the first results are in for that:

 testname  |   nrows   |duration | accesses
 ---+--**-+-+--**
  points unordered buffered | 25000 | 06:00:23.707579 |  4049832

 From the previous test runs, the unbuffered index build took under 4 hours,
 so even though this is a lot better than with the sorting, it's still a loss
 compared to non-buffered build.

 I had vmstat running during most of this index build. At a quick glance, it
 doesn't seem to be CPU bound anymore. I suspect the buffers in the temporary
 file gets very fragmented. Or, we're reading it in backwards order because
 the buffers work in a LIFO fashion. The system seems to be doing about 5
 MB/s of I/O during the build, which sounds like a figure you'd get for more
 or less random I/O.


So, we still have two questions:
1) Why buffering build algorithm doesn't show any benefit on these tests?
2) Why test results on your test setup differs from test results on my test
setup?

I can propose following answers now:
1) I think it's because high overlaps in the tree. As I mentioned before
high overlaps can cause only fraction of the tree to be used for actual
inserts. For comparison, with my split algorithm (which produce almost no
overlaps on uniform dataset) buffering index build took 4 hours, while
regular build is still running (already more than 8 days = 192 hours)!
2) Probably it's because different behavour of OS cache. For example, on my
test setup OS displace unused pages from cache too slowly. Thereby buffering
algorithm showed benefit nevertheless.

Also it seems to me that I start to understand problem of new linear
splitting algorithm. On dataset with 1M rows it produces almost no overlaps
while it produces significant overlaps already on 10M rows (drama!).
Probably nobody tested it on large enough datasets (neither while original
research or before commit). I'll dig it in more details and provide some
testing results.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-29 Thread Heikki Linnakangas


On 26.08.2011 17:18, Alexander Korotkov wrote:

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a test
server.



1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.


Thanks! Meanwhile, I hacked together a script of my own to do 
performance testing. I let it run over the weekend, but I just realized 
that I forgot to vacuum the test tables so the results are not worth 
much. I'm rerunning them now, stay tuned!


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-26 Thread Alexander Korotkov

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Could you share the test scripts, patches and data sets etc. needed to
 reproduce the tests you've been running? I'd like to try them out on a test
 server.


1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.

--
With best regards,
Alexander Korotkov.


fastbuild_test.php.gz
Description: GNU Zip compressed data


plantuner.patch.gz
Description: GNU Zip compressed data


my_split.patch.gz
Description: GNU Zip compressed data
select pg_stat_statements_reset();
set log_statement_stats = on;
set synchronize_seqscans = off;

create index uniform_idx1 on uniform using gist(point) with (buffering=on);
create index uniform_idx2 on uniform using gist(point) with (buffering=auto);
create index uniform_idx3 on uniform using gist(point) with (buffering=off);

create index usnoa2_idx1 on usnoa2 using gist(point) with (buffering=on);
create index usnoa2_idx2 on usnoa2 using gist(point) with (buffering=auto);
create index usnoa2_idx3 on usnoa2 using gist(point) with (buffering=off);

create index usnoa2_shuffled_idx1 on usnoa2_shuffled using gist(point) with (buffering=on);
create index usnoa2_shuffled_idx2 on usnoa2_shuffled using gist(point) with (buffering=auto);
create index usnoa2_shuffled_idx3 on usnoa2_shuffled using gist(point) with (buffering=off);


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-26 Thread Alexander Korotkov

On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:


 In the tests on the first version of patch I found index quality of
 regular
 build much better than it of buffering build (without neighborrelocation).
 Now it's similar, though it's because index quality of regular index build
 become worse. There by in current tests regular index build is faster than
 in previous. I see following possible causes of it:
  1) I didn't save source random data. So, now it's a new random data.
 2) Some environment parameters of my test setup may alters, though I
 doubt.
 Despite these possible explanation it seems quite strange for me.


 That's pretty surprising. Assuming the data is truly random, I wouldn't
 expect a big difference in the index quality of one random data set over
 another. If the index quality depends so much on, say, the distribution of
 the few first tuples that are inserted to it, that's a quite interesting
 find on its own, and merits some further research.

Yeah, it's pretty strange. Using same random datasets in different tests can
help to exclude onepossible cause of difference.


  In order to compare index build methods on more qualitative indexes, I've
 tried to build indexes with my double sorting split method (see:
 http://syrcose.ispras.ru/2011/**files/SYRCoSE2011_Proceedings.**
 pdf#page=36http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36).
 So
 on uniform dataset search is faster in about 10 times! And, as it was
 expected, regular index build becomes much slower. It runs more than 60
 hours and while only 50% of index is complete (estimated by file sizes).

 Also, automatic switching to buffering build shows better index quality
 results in all the tests. While it's hard for me to explain that.


 Hmm, makes me a bit uneasy that we're testing with a modified page
 splitting algorithm. But if the new algorithm is that good, could you post
 that as a separate patch, please?

I've post it in another message and I will try to get it into more
appropriate form. Let me clarify this a little. I don't think my split
algorithm is 10 times better than state of the art algorithms. I think that
currently used new linear split shows unreasonably bad results in may cases.
For example, uniformly distributed data is pretty easy case. And with almost
any splitting algorithm we can get index with almost zero overlaps. But new
linear split produces huge overlaps in this case. That's why I decided to
make some experiments with another split algorithm.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-25 Thread Heikki Linnakangas


On 24.08.2011 16:57, Alexander Korotkov wrote:

I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes
more time than I expect.
Some notes about it.

Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is
fast.
2) I found that worse index is faster to build. I mean worse index is index
with higher overlaps. Function gistchoose selects the first index tuple with
zero penalty if any. Thus, with higher overlap in root page only few index
tuples of it will be choosed for insert. And, recursively, only small part
of the tree will be used for actual inserts. And that part of tree can
easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
as searches expensiver.


As an extreme case, a trivial penalty function that just always returns 
0 will make index build fast - but the index will be useless for querying.



In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
  1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.


That's pretty surprising. Assuming the data is truly random, I wouldn't 
expect a big difference in the index quality of one random data set over 
another. If the index quality depends so much on, say, the distribution 
of the few first tuples that are inserted to it, that's a quite 
interesting find on its own, and merits some further research.



In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.


Hmm, makes me a bit uneasy that we're testing with a modified page 
splitting algorithm. But if the new algorithm is that good, could you 
post that as a separate patch, please?


That said, I don't see any new evidence that the buffering build 
algorithm would be significantly worse. There's the case of ordered data 
that we already knew about, and will have to just accept for now.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-24 Thread Alexander Korotkov

I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes
more time than I expect.
Some notes about it.

Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is
fast.
2) I found that worse index is faster to build. I mean worse index is index
with higher overlaps. Function gistchoose selects the first index tuple with
zero penalty if any. Thus, with higher overlap in root page only few index
tuples of it will be choosed for insert. And, recursively, only small part
of the tree will be used for actual inserts. And that part of tree can
easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
as searches expensiver.

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
 1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-22 Thread Alexander Korotkov

On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas 
 heikki.linnakan...@enterprisedb.com wrote:

 On 16.08.2011 22:10, Heikki Linnakangas wrote:

 Here's an version of the patch with a bunch of minor changes:


 And here it really is, this time with an attachment...

 Thanks a lot. I'm going to start rerunning the tests now.


First bunch of test results will be available soon (tests running and
results processing take some time). While there is a patch with few small
bugfixes.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.14.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-17 Thread Alexander Korotkov

On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 16.08.2011 22:10, Heikki Linnakangas wrote:

 Here's an version of the patch with a bunch of minor changes:


 And here it really is, this time with an attachment...

Thanks a lot. I'm going to start rerunning the tests now.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov

I found that I forgot to remove levelstep and buffersize from reloptions.c.
Updated patch is attached.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.14.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas


Looking at the calculation of levelStep:


+   /*
+* Calculate levelStep by available amount of memory. We should be able 
to
+* load into main memory one page of each underlying node buffer (which
+* are in levelStep below). That give constraint over
+* maintenance_work_mem. Also we should be able to have subtree of
+* levelStep level in cache. That give constraint over
+* effective_cache_size.
+*
+* i'th underlying level of sub-tree can consists of
+* i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
+* levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
+* pages. We use some more reserve due to we probably can't take whole
+* effective cache and use formula 4 * maxIndexTuplesPerPage ^ 
levelStep =
+* effectiveCache. We use similar logic with maintenance_work_mem. We
+* should be able to store at least last pages of all buffers where we 
are
+* emptying current buffer to.
+*/
+   effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
+ effective_cache_size);
+   levelStep = (int) log((double) effectiveMemory / 4.0) /
+   log((double) maxIndexTuplesPerPage);
+


I can see that that's equal to the formula given in the paper, 
log_B(M/4B), but I couldn't see any explanation for that formula in the 
paper. Your explanation makes sense, but where did it come from?


It seems a bit pessimistic: while it's true that the a subtree can't be 
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter 
upper bound on it. The max size of a subtree of depth n can be 
calculated as the geometric series:


r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but 
for a large r and small n (which is typical), the 2 * 
maxIndexTuplesPerPage^levelStep formula gives a value that's almost 
twice as large as the real max size of a subtree.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov

On Tue, Aug 16, 2011 at 4:04 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 I can see that that's equal to the formula given in the paper, log_B(M/4B),
 but I couldn't see any explanation for that formula in the paper. Your
 explanation makes sense, but where did it come from?

I didn't find it too. But it has to reservse memory for both sub-tree and
active buffers. While we'are reserving memory for sub-tree in
effective_cache_size and memory for last pages of buffers in
maintenance_work_mem.


 It seems a bit pessimistic: while it's true that the a subtree can't be
 larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter
 upper bound on it. The max size of a subtree of depth n can be calculated as
 the geometric series:

 r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

 where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but for
 a large r and small n (which is typical), the 2 * 
 maxIndexTuplesPerPage^**levelStep
 formula gives a value that's almost twice as large as the real max size of a
 subtree.

Thus, we can calculate:
levelstep = min(log_r(1 + effective_cache_size_in_pages*(r - 1)) - 1,
log_r(maintenance_work_mem_in_pages - 1))
and get more precise result. But also we need at least very rough estimate
of memory occupied by node buffers hash tab and path items.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas

Why is there ever a buffer on the root node? It seems like a waste of 
time to load N tuples from the heap into the root buffer, only to empty 
the buffer after it fills up. You might as well pull tuples directly 
from the heap.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov

On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Why is there ever a buffer on the root node? It seems like a waste of time
 to load N tuples from the heap into the root buffer, only to empty the
 buffer after it fills up. You might as well pull tuples directly from the
 heap.

Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas


On 16.08.2011 21:46, Alexander Korotkov wrote:

On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Why is there ever a buffer on the root node? It seems like a waste of time
to load N tuples from the heap into the root buffer, only to empty the
buffer after it fills up. You might as well pull tuples directly from the
heap.


Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.


Here's an version of the patch with a bunch of minor changes:

* No more buffer on root node. Aside from the root buffer being 
pointless, this simplifies gistRelocateBuildBuffersOnSplit slightly as 
it doesn't need the special case for root block anymore.


* Moved the code to create new root item from gistplacetopage() to 
gistRelocateBuildBuffersOnSplit(). Seems better to keep the 
buffering-related code away from the normal codepath, for the sake of 
readability.


* Changed the levelStep calculation to use the more accurate upper bound 
on subtree size that we discussed.


* Changed the levelStep calculation so that it doesn't do just 
min(maintenance_work_mem, effective_cache_size) and calculate the 
levelStep from that. Maintenance_work_mem matters determines the max. 
number of page buffers that can be held in memory at a time, while 
effective_cache_size determines the max size of the subtree. Those are 
subtly different things.


* Renamed NodeBuffer to GISTNodeBuffer, to avoid cluttering the namespace

* Plus misc comment, whitespace, formatting and naming changes.

I think this patch is in pretty good shape now. Could you re-run the 
performance tests you have on the wiki page, please, to make sure the 
performance hasn't regressed? It would also be nice to get some testing 
on the levelStep and pagesPerBuffer estimates, and the point where we 
switch to the buffering method. I'm particularly interested to know if 
there's any corner-cases with very skewed data distributions or strange 
GUC settings, where the estimates fails badly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-14 Thread Alexander Korotkov

Hi!

Thank you for your notes.

On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  [ new patch ]

 Some random comments:

 - It appears that the noFollowFight flag is really supposed to be
 called noFollowRight.

Fixed.

- In gist_private.h you've written halt-filled where you really mean
 half-filled.

Fixed.


 - It seems like you're using reloptions to set parameters that are
 only going to do anything at index creation time.  IIUC, BUFFERING,
 LEVELSTEP and BUFFERSIZE have no permanent meaning for that index;
 they're just used ephemerally while constructing it.  If we're going
 to expose such things as options, maybe they should be GUCs, not
 reloptions.

Having these as index parameters may be helpful when you reindex. It's
likely that you would like to rebuild it with same parameters as it was
created. Actually, we have the same situation with FILLFACTOR: it is used
only during index creation.


 - Function names should begin with gist or some other, appropriate
 prefix, especially if they are non-static.  decreasePathRefcount(),
 getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
 getNodeBufferBusySize() violate this rule, and it might be good to
 change the static functions to follow it, too, just for consistency,
 and to avoid renaming things if something that's currently static
 later needs to be made non-static.

Fixed.


 - validateBufferOption needs to use ereport(), not elog().

Fixed.


 - This needs a bit of attention:

 +   /* TODO: Write the WAL record */
 +   if (RelationNeedsWAL(state-r))
 +   recptr = gistXLogSplit(state-r-rd_node,
 blkno, is_leaf,
 +   dist,
 oldrlink, oldnsn, InvalidBuffer, true);
 +   else
 +   recptr = GetXLogRecPtrForTemp();
 +

 I don't think the comment matches the code, since gistXLogSplit() does
 in fact call XLogInsert(). Also, you should probably move the

RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
 single caller of gistXLogSplit() is going to need to do the same
 dance.


 - In gistBufferingPlaceToPage, you've got a series of loops that look like
 this:

 +   for (ptr = dist; ptr; ptr = ptr-next)

 The first loop allocates a bunch of buffers.  The second loop sets up
 downlink pointers.   Then there's some other code.  Then there's a
 third loop, which adds items to each page in turn and sets up right
 links.  Then there's a fourth loop, which marks all those buffers
 dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
 all the LSNs and TLIs, and a sixth loop, which does
 UnlockReleaseBuffer() on each valid buffer in the list.  All of this
 seems like it could be simplified.  In particular, the third and
 fourth loops can certainly be merged - you should set the dirty bit at
 the same time you're adding items to the page.  And the fifth and
 sixth loops can also be merged.  You certainly don't need to set all
 the LSNs and TLIs before releasing any of the buffer locks  pins.
 I'm not sure if there's any more merging that can be done than that,
 but you might want to have a look.

 I'm also wondering how long this linked list can be.  It's not good to
 have large numbers of buffers locked for a long period of time.  At
 the very least, some comments are in order here.

 Another general comment about this function is that it seems like it
 is backwards.  The overall flow of the function is:

 if (is_split)
 {
/* complicated stuff */
 }
 else
 {
/* simple case */
 }

 It seems like it might be better to flip that around and do this:

 if (!is_split)
 {
/* simple case */
return result;
 }
 /* complicated stuff */

 It's easier to read and avoids having the indentation get too deep.

 - As I look at this more, I see that a lot of the logic in
 gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
 would be nice to move the common bits to common subroutines that both
 functions can call, instead of duplicating the code.

 - On a related note, gistBufferingBuildPlaceToPage needs to do
 START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
 sequence, as gistplacetopage() does.

While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage().
Now I'm trying some more refactoring.


 - gistFindCorrectParent() seems to rely heavily on the assumption that
 there's no concurrent activity going on in this index.  Otherwise,
 it's got to be unsafe to release the buffer lock before using the
 answer the function computes.  Some kind of comment seems like it
 would be a good idea.

Corresponding comment was added.


 - On a more algorithmic note, I don't really understand why we attach
 buffers to all pages on a level or none of them.  If it isn't
 necessary to have buffers on every internal page in the tree, why do
 we have

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Heikki Linnakangas


On 11.08.2011 23:30, Alexander Korotkov wrote:

On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


On 10.08.2011 22:44, Alexander Korotkov wrote:


Manual and readme updates.



Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users?
They've been useful during testing, but I'm thinking we should be able to
guess good enough values for them automatically, and just remove the
options. It's pretty much impossible for a user to tune them correctly, it
would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should
still process the first 1 or so tuples with simple inserts. That way we
always have a sample of tuples to calculate the average tuple size from.
It's plausible that if the input data is ordered, looking at the first N
tuples will give skewed sample, but I don't think there's much danger of
that in practice. Even if the data is ordered, the length of GiST tuples
shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong?
How sensitive is the algorithm to that? Or will we run out of memory? Would
it be feasible to adjust those in the middle of the index build, if we e.g
exceed the estimated memory usage greatly?



I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large:
   1) If subtree doesn't fit effective_cache, much more IO then should be
(because of cache misses during buffer emptying)
   2) If last pages of buffers don't fit to maintenance_work_mem, possible
OOM


Hmm, we could avoid running out of memory if we used a LRU cache 
replacement policy on the buffer pages, instead of explicitly unloading 
the buffers. 1) would still apply, though.



For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in
comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be
with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
it. Pessimistic estimate has following logic:
largest sub-tree =  maximal tuples per page =  minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude
greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I
propose to use size of first 1 tuples for estimate.


Yep, sounds reasonable.

I think it would also be fairly simple to decrease levelstep and/or 
adjust buffersize on-the-fly. The trick would be in figuring out the 
heuristics on when to do that.


Another thing occurred to me while looking at the buffer emptying 
process: At the moment, we stop emptying after we've flushed 1/2 buffer 
size worth of tuples. The point of that is to avoid overfilling a 
lower-level buffer, in the case that the tuples we emptied all landed on 
the same lower-level buffer. Wouldn't it be fairly simple to detect that 
case explicitly, and stop the emptying process only if one of the 
lower-level buffers really fills up? That should be more efficient, as 
you would have swap between different subtrees less often.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Alexander Korotkov

On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 I think it would also be fairly simple to decrease levelstep and/or adjust
 buffersize on-the-fly. The trick would be in figuring out the heuristics on
 when to do that.

I would be simple to decrease levelstep to the it's divider. It seems quite
hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to
detect that sub-tree actually doen't fit to the cache. I don't see much
difficulties in buffersize runtime tuning.


 Another thing occurred to me while looking at the buffer emptying process:
 At the moment, we stop emptying after we've flushed 1/2 buffer size worth of
 tuples. The point of that is to avoid overfilling a lower-level buffer, in
 the case that the tuples we emptied all landed on the same lower-level
 buffer. Wouldn't it be fairly simple to detect that case explicitly, and
 stop the emptying process only if one of the lower-level buffers really
 fills up? That should be more efficient, as you would have swap between
 different subtrees less often.

 Yes, it seems reasonable to me.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Robert Haas

On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 [ new patch ]

Some random comments:

- It appears that the noFollowFight flag is really supposed to be
called noFollowRight.

- In gist_private.h you've written halt-filled where you really mean
half-filled.

- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time.  IIUC, BUFFERING,
LEVELSTEP and BUFFERSIZE have no permanent meaning for that index;
they're just used ephemerally while constructing it.  If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.

- Function names should begin with gist or some other, appropriate
prefix, especially if they are non-static.  decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.

- validateBufferOption needs to use ereport(), not elog().

- This needs a bit of attention:

+   /* TODO: Write the WAL record */
+   if (RelationNeedsWAL(state-r))
+   recptr = gistXLogSplit(state-r-rd_node,
blkno, is_leaf,
+   dist,
oldrlink, oldnsn, InvalidBuffer, true);
+   else
+   recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert().  Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+   for (ptr = dist; ptr; ptr = ptr-next)

The first loop allocates a bunch of buffers.  The second loop sets up
downlink pointers.   Then there's some other code.  Then there's a
third loop, which adds items to each page in turn and sets up right
links.  Then there's a fourth loop, which marks all those buffers
dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list.  All of this
seems like it could be simplified.  In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page.  And the fifth and
sixth loops can also be merged.  You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks  pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be.  It's not good to
have large numbers of buffers locked for a long period of time.  At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards.  The overall flow of the function is:

if (is_split)
{
/* complicated stuff */
}
else
{
/* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
/* simple case */
return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.

- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index.  Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes.  Some kind of comment seems like it
would be a good idea.

- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them.  If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Heikki Linnakangas


Split of an internal node works like this:

1. Gather all the existing tuples on the page, plus the new tuple being 
inserted.

2. Call picksplit on the tuples, to divide them into pages
3. Go through all tuples on the buffer associated with the page, and 
divide them into buffers on the new pages. This is done by calling 
penalty function on each buffered tuple.


I wonder if it would be better for index quality to pass the buffered 
tuples to picksplit in the 2nd step, so that they too can affect the 
split decision. Maybe it doesn't make much difference in practice..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Alexander Korotkov

On Thu, Aug 11, 2011 at 10:21 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Split of an internal node works like this:

 1. Gather all the existing tuples on the page, plus the new tuple being
 inserted.
 2. Call picksplit on the tuples, to divide them into pages
 3. Go through all tuples on the buffer associated with the page, and divide
 them into buffers on the new pages. This is done by calling penalty function
 on each buffered tuple.

 I wonder if it would be better for index quality to pass the buffered
 tuples to picksplit in the 2nd step, so that they too can affect the split
 decision. Maybe it doesn't make much difference in practice..


I had this idea. But:
1) Buffer contain much more tuples than page plus new tuple.
2) Picksplit method can easily be quadratic for example.

Let's see the complexity of picksplit algorithms:
1) geometric datatypes (point, box etc) - O(n) (BTW, I have serious doubts
about it, i.e. O(n*log(n)) algorithm can be in times better in many cases)
2) pg_trgm and fts - O(n^2)
3) seg - O(n*log(n))
4) cube - O(n^2)

Thus, I believe such feature should be an optional. We can try it as add-on
patch.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Alexander Korotkov

On Wed, Aug 10, 2011 at 11:45 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 unloadNodeBuffers() is now dead code.

processEmptyingStack calls it.

LEAF_PAGES_STATS_* are unused now.

Removed.


 Should avoid calling smgrnblocks() on every tuple, the overhead of that
 could add up.

Now calling at each BUFFERING_MODE_SWITCH_CHECK_STEP(256) tuples.


 I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage(**)
 with the gistplacetopage() function used in the main codepath? There's very
 little difference between them, and it would be nice to maintain just one
 function. At the very least I think there should be a comment in both along
 the lines of NOTE: if you change this function, make sure you update 
 (the other function) as well!

I doubt they can be effectively merged, but will try.


 In gistbuild(), in the final emptying stage, there's this special-case
 handling for the root block before looping through the buffers in the
 buffersOnLevels lists:

 for (;;)
{
nodeBuffer = getNodeBuffer(gfbb,
 buildstate.giststate, GIST_ROOT_BLKNO,

 InvalidOffsetNumber, NULL, false);
if (!nodeBuffer || nodeBuffer-blocksCount = 0)
break;
MemoryContextSwitchTo(gfbb-**context);
gfbb-bufferEmptyingStack = lcons(nodeBuffer,
 gfbb-bufferEmptyingStack);
MemoryContextSwitchTo(**buildstate.tmpCtx);
processEmptyingStack(**buildstate.giststate,
 insertstate);
}


 What's the purpose of that? Wouldn't the loop through buffersOnLevels lists
 take care of the root node too?

I was worried about node buffer deletion from list while scanning that
list gistbuild(). That's why I avoided deletion from lists.
Now I've added additional check for root node in loop over list.


 The calculations in initBuffering() desperately need comments. As does the
 rest of the code too, but the heuristics in that function are particularly
 hard to understand without some explanation.

Some comments were added. I'm working on more of them.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.13.0.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Heikki Linnakangas


On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.


Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users? 
They've been useful during testing, but I'm thinking we should be able 
to guess good enough values for them automatically, and just remove the 
options. It's pretty much impossible for a user to tune them correctly, 
it would require deep knowledge of the buffering algorithm.


I'm thinking that even when you explicitly turn buffering on, we should 
still process the first 1 or so tuples with simple inserts. That way 
we always have a sample of tuples to calculate the average tuple size 
from. It's plausible that if the input data is ordered, looking at the 
first N tuples will give skewed sample, but I don't think there's much 
danger of that in practice. Even if the data is ordered, the length of 
GiST tuples shouldn't vary much.


What happens if we get the levelstep and pagesPerBuffer estimates wrong? 
How sensitive is the algorithm to that? Or will we run out of memory? 
Would it be feasible to adjust those in the middle of the index build, 
if we e.g exceed the estimated memory usage greatly?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Heikki Linnakangas


On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.


I went through these, and did some editing and rewording. Attached is an 
updated README, and an updated patch of the doc changes. Let me know if 
I screwed up something.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 78171cf..244a2e2 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -642,6 +642,35 @@ my_distance(PG_FUNCTION_ARGS)
 
   /variablelist
 
+ sect2 id=gist-buffering-build
+  titleGiST buffering build/title
+  para
+   Building large GiST indexes that don't fit in cache by simply inserting
+   all the tuples tends to be slow, because if the index tuples are scattered
+   across the index, a large fraction of the insertions need to perform
+   I/O. The exception is well-ordered datasets, where the part of the index
+   where new insertions go to stays well cached.
+   PostgreSQL from version 9.2 supports a more efficient method to build
+   GiST indexes based on buffering, which can dramatically reduce number of
+   random I/O needed.
+  /para
+
+  para
+   However, buffering index build needs to call the functionpenalty/
+   function more often, which consumes some extra CPU resources. Also, it can
+   infuence the quality of the produced index, in both positive and negative
+   directions. That influence depends on various factors, like the
+   distribution of the input data and operator class implementation.
+  /para
+
+  para
+   By default, the index build switches to the buffering method when the
+   index size reaches xref linkend=guc-effective-cache-size. It can
+   be manually turned on or off by the literalBUFFERING/literal parameter
+   to the CREATE INDEX clause.
+  /para
+
+ /sect2
 /sect1
 
 sect1 id=gist-examples
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 1a1e8d6..1b2969e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -341,6 +341,52 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable class=parametername/
/varlistentry
 
/variablelist
+   para
+GiST indexes accept the following parameter:
+   /para
+
+   variablelist
+
+   varlistentry
+termliteralBUFFERING//term
+listitem
+para
+ Determines whether the buffering build technique described in
+ xref linkend=gist-buffering-build is used to build the index. With
+ literalOFF/ it is disabled, with literalON/ it is enabled, and
+ with literalAUTO/ (default) it is initially disabled, but turned on
+ on-the-fly once the index size reaches xref linkend=guc-effective-cache-size.
+/para
+/listitem
+   /varlistentry
+
+   varlistentry
+termliteralLEVELSTEP//term
+listitem
+para
+ In buffering build buffers located at tree levels i * literalLEVELSTEP/, 
+ i  0 (we use upward level numbering, level = 0 corresponds to leaf pages).
+ By default literalLEVELSTEP/ is calculated so that sub-tree
+ of literalLEVELSTEP/ height fits xref linkend=guc-effective-cache-size
+ and xref linkend=guc-maintenance-work-mem.
+/para
+/listitem
+   /varlistentry
+
+   varlistentry
+termliteralBUFFERSIZE//term
+listitem
+para
+ Maximum size of node buffer in pages. By default it is calculated so that
+ half emptying of node buffer fill in average one page per underlying node
+ buffer. This ratio guarantees effective IO usage. In some cases lower
+ literalBUFFERSIZE/ can give comparable IO economy with less CPU
+ overhead.
+/para
+/listitem
+   /varlistentry
+
+   /variablelist
   /refsect2
 
   refsect2 id=SQL-CREATEINDEX-CONCURRENTLY
src/backend/access/gist/README

GiST Indexing
=

This directory contains an implementation of GiST indexing for Postgres.

GiST stands for Generalized Search Tree. It was introduced in the seminal paper
Generalized Search Trees for Database Systems, 1995, Joseph M. Hellerstein,
Jeffrey F. Naughton, Avi Pfeffer:

http://www.sai.msu.su/~megera/postgres/gist/papers/gist.ps

and implemented by J. Hellerstein and P. Aoki in an early version of
PostgreSQL (more details are available from The GiST Indexing Project
at Berkeley at http://gist.cs.berkeley.edu/). As a university
project it had a limited number of features and was in rare use.

The current implementation of GiST supports:

  * Variable length keys
  * Composite keys (multi-key)
  * Ordered search (nearest-neighbor search)
  * provides NULL-safe interface to GiST core
  * Concurrency
  * Recovery support via WAL logging
  * Buffering build algorithm

The support for concurrency implemented in PostgreSQL was developed based on
the paper Access Methods for Next-Generation Database Systems by
Marcel Kornaker:


http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz

Buffering build algorithm for GiST was

Re: [HACKERS] WIP: Fast GiST index build

2011-08-11 Thread Alexander Korotkov

On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 10.08.2011 22:44, Alexander Korotkov wrote:

 Manual and readme updates.


 Thanks, I'm reviewing these now.

 Do we want to expose the level-step and buffersize parameters to users?
 They've been useful during testing, but I'm thinking we should be able to
 guess good enough values for them automatically, and just remove the
 options. It's pretty much impossible for a user to tune them correctly, it
 would require deep knowledge of the buffering algorithm.

 I'm thinking that even when you explicitly turn buffering on, we should
 still process the first 1 or so tuples with simple inserts. That way we
 always have a sample of tuples to calculate the average tuple size from.
 It's plausible that if the input data is ordered, looking at the first N
 tuples will give skewed sample, but I don't think there's much danger of
 that in practice. Even if the data is ordered, the length of GiST tuples
 shouldn't vary much.

 What happens if we get the levelstep and pagesPerBuffer estimates wrong?
 How sensitive is the algorithm to that? Or will we run out of memory? Would
 it be feasible to adjust those in the middle of the index build, if we e.g
 exceed the estimated memory usage greatly?


I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large:
  1) If subtree doesn't fit effective_cache, much more IO then should be
(because of cache misses during buffer emptying)
  2) If last pages of buffers don't fit to maintenance_work_mem, possible
OOM

For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in
comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be
with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
it. Pessimistic estimate has following logic:
largest sub-tree = maximal tuples per page = minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude
greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I
propose to use size of first 1 tuples for estimate.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-10 Thread Alexander Korotkov

Hi!

Here is last verion of the patch.
List of changes:
1) Neighbor relocation and prefetch were removed. They will be supplied as
separate patches.
2) Final emptying now using standart lists of all buffers by levels.
3) Automatic switching again use simple comparison of index size and
effective_cache_size.
4) Some renames. In particular GISTLoadedPartItem
to GISTBufferingInsertStack.
5) Some comments were corrected and some were added.
6) pgindent
7) rebased with head

Readme update and user documentation coming soon.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.11.0.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-10 Thread Alexander Korotkov

Manual and readme updates.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.12.0.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-10 Thread Heikki Linnakangas


On 10.08.2011 13:19, Alexander Korotkov wrote:

Hi!

Here is last verion of the patch.
List of changes:
1) Neighbor relocation and prefetch were removed. They will be supplied as
separate patches.


unloadNodeBuffers() is now dead code.


2) Final emptying now using standart lists of all buffers by levels.
3) Automatic switching again use simple comparison of index size and
effective_cache_size.


LEAF_PAGES_STATS_* are unused now. Should avoid calling smgrnblocks() on 
every tuple, the overhead of that could add up.



4) Some renames. In particular GISTLoadedPartItem
to GISTBufferingInsertStack.
5) Some comments were corrected and some were added.
6) pgindent
7) rebased with head

Readme update and user documentation coming soon.


I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage() 
with the gistplacetopage() function used in the main codepath? There's 
very little difference between them, and it would be nice to maintain 
just one function. At the very least I think there should be a comment 
in both along the lines of NOTE: if you change this function, make sure 
you update  (the other function) as well!


In gistbuild(), in the final emptying stage, there's this special-case 
handling for the root block before looping through the buffers in the 
buffersOnLevels lists:



for (;;)
{
nodeBuffer = getNodeBuffer(gfbb, buildstate.giststate, 
GIST_ROOT_BLKNO,
   
InvalidOffsetNumber, NULL, false);
if (!nodeBuffer || nodeBuffer-blocksCount = 0)
break;
MemoryContextSwitchTo(gfbb-context);
gfbb-bufferEmptyingStack = lcons(nodeBuffer, 
gfbb-bufferEmptyingStack);
MemoryContextSwitchTo(buildstate.tmpCtx);
processEmptyingStack(buildstate.giststate, 
insertstate);
}


What's the purpose of that? Wouldn't the loop through buffersOnLevels 
lists take care of the root node too?


The calculations in initBuffering() desperately need comments. As does 
the rest of the code too, but the heuristics in that function are 
particularly hard to understand without some explanation.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Heikki Linnakangas


On 07.08.2011 22:28, Alexander Korotkov wrote:

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.


Ok.


2) Neighbor relocation was added.


Ok. I think we're going to need some sort of a heuristic on when to 
enable neighbor relocation. If I remember the performance tests 
correctly, it improves the quality of the resulting index, but incurs a 
significant CPU overhead.


Actually, looking at the performance numbers on the wiki page again 
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it 
looks like neighbor relocation doesn't help very much with the index 
quality - sometimes it even results in a slightly worse index. Based on 
those results, shouldn't we just remove it? Or is there some other data 
set where it helps significantly?



3) Subtree prefetching was added.


I'm inclined to leave out the prefetching code for now. Unless you have 
some performance numbers that show that it's critical for the overall 
performance. But I don't think that was the case, it's just an 
additional optimization for servers with big RAID arrays. So, please 
separate that into an add-on patch. It needs to be performance tests and 
reviewed separately.



4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.


I was trying to get rid of that complexity during index build. Some 
extra code in the final pass would be easier to understand than extra 
work that needs to be done through the index build. It's not a huge 
amount of code, but still.


I'm not worried about the extra CPU overhead of scanning the data 
structures at the final pass. I guess in my patch you had to do extra 
I/O as well, because the buffers were not emptied in strict top-down 
order, so let's avoid that. How about:


Track all buffers in the lists, not only those that are non-empty. Add 
the buffer to the right list at getNodeBuffer(). That way in the final 
stage, you need to scan through all buffers instead of just the 
non-empty ones. But the overhead of that to should be minimal in 
practice, scanning some in-memory data structures is pretty cheap 
compared to building an index. That way you wouldn't need to maintain 
the lists during the index build, except for adding each buffer to 
correct lists in getNodeBuffer().


BTW, please use List for the linked lists. No need to re-implement the 
wheel.



5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).


Yeah, that's a surprisingly hard problem. I don't much like the method 
used in the patch either.



2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.


I'm not very worried about that in practice. If you have a very large 
index, you presumably have a fair amount of memory too. Otherwise the 
machine is horrendously underpowered to build or do anything useful with 
the index anyway. Nevertheless it would nice to have some figures on 
that. If you have, say, an index of 1 TB in size, how much memory will 
building the index need?


Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious 
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like 
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the 
normal insertion code. The loaded part nomenclature is obsolete, as 
the patch doesn't explicitly load parts of the tree into memory anymore. 
Think about the names of other structs, variables and functions too, 
GISTLoadedPartItem just caught my eye first but there's probably others 
that could have better names.

* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler

Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Alexander Korotkov

On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to enable
neighbor relocation. If I remember the performance tests correctly, it
improves the quality of the resulting index, but incurs a significant CPU
overhead.

Actually, looking at the performance numbers on the wiki page again (
http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**GSoC_2011http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011),
it looks like neighbor relocation doesn't help very much with the index
quality - sometimes it even results in a slightly worse index. Based on
those results, shouldn't we just remove it? Or is there some other data set
where it helps significantly?

Oh, actually I didn't add some results with neighborrelocation = off. I
would like to rerun some tests with current version of patch.

3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have
some performance numbers that show that it's critical for the overall
performance. But I don't think that was the case, it's just an additional
optimization for servers with big RAID arrays. So, please separate that into
an add-on patch. It needs to be performance tests and reviewed separately.

I though that prefetch helps even on separate hard disks by ordering of IOs.

4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some extra
code in the final pass would be easier to understand than extra work that
needs to be done through the index build. It's not a huge amount of code,
but still.

I'm not worried about the extra CPU overhead of scanning the data
structures at the final pass. I guess in my patch you had to do extra I/O as
well, because the buffers were not emptied in strict top-down order, so
let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add the
buffer to the right list at getNodeBuffer(). That way in the final stage,
you need to scan through all buffers instead of just the non-empty ones. But
the overhead of that to should be minimal in practice, scanning some
in-memory data structures is pretty cheap compared to building an index.
That way you wouldn't need to maintain the lists during the index build,
except for adding each buffer to correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the
wheel.

Ok.

5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).

Yeah, that's a surprisingly hard problem. I don't much like the method used
in the patch either.

Is it possible to make buffering build a user defined option until we have a
better idea?

2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.

I'm not very worried about that in practice. If you have a very large
index, you presumably have a fair amount of memory too. Otherwise the
machine is horrendously underpowered to build or do anything useful with the
index anyway. Nevertheless it would nice to have some figures on that. If
you have, say, an index of 1 TB in size, how much memory will building the
index need?

I think with points it would be about 1 million of buffers and about 100-300
megabytes of RAM depending on space utilization. It may be ok, because 1 TB
is really huge index. But if maintenance_work_mem is low we can run out of
it. Though maintenance_work_mem is quite strange for system with 1 TB
indexes.

Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious whitespace
in the patch.
* How about renaming GISTLoadedPartItem to something like

Re: [HACKERS] WIP: Fast GiST index build

2011-08-07 Thread Alexander Korotkov

Hi!

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.
2) Neighbor relocation was added.
3) Subtree prefetching was added.
4) Final emptying algorithm was reverted to the original one. My experiments
shows that typical number of passes in final emptying in your version of
patch is about 5. It may be significant itself. Also I haven't estimate of
number of passes and haven't guarantees that it will not be high in some
corner cases. I.e. I prefer more predictable single-pass algorithm in spite
of it's a little more complex.
5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It seems to
me that keeping one page of each buffer in memory may be memory consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based on
very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation I
need to somehow detect actual IO (not just buffer read but miss of OS
cache).
2) I'm worrying about possible size of nodeBuffersTab and path items. If we
imagine extremely large tree with levelstep = 1 size of this datastructures
will be singnificant. And it's hard to predict that size without knowing of
tree size.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.10.0.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-04 Thread Alexander Korotkov

Uhh, my bad, really stupid bug. Many thanks.

--
With best regards,
Alexander Korotkov.

On Wed, Aug 3, 2011 at 8:31 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 03.08.2011 11:18, Alexander Korotkov wrote:

 I found that in previous version of patch I missed PageSetLSN
 and PageSetTLI, but huge amount of WAL is still here. Also I found that
 huge
 amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
 messages FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
 flushed only to 44/9C518750 appears. Seems that there is some totally
 wrong
 use of WAL if even optimization level does matter...


 Try this:

 diff --git a/src/backend/access/gist/**gistbuild.c
 b/src/backend/access/gist/**gistbuild.c
 index 5099330..5a441e0 100644
 --- a/src/backend/access/gist/**gistbuild.c
 +++ b/src/backend/access/gist/**gistbuild.c
 @@ -478,7 +478,7 @@ bufferingbuildinsert(**GISTInsertState *state,
/* Write the WAL record */
if (RelationNeedsWAL(state-r))
{
 -   gistXLogUpdate(state-r-rd_**node, buffer,
 oldoffnum, noldoffnum,
 +   recptr = gistXLogUpdate(state-r-rd_**node,
 buffer, oldoffnum, noldoffnum,

itup, ntup, InvalidBuffer);
PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID);



 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] WIP: Fast GiST index build

2011-08-03 Thread Alexander Korotkov

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750 appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.9.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-03 Thread Robert Haas

On Wed, Aug 3, 2011 at 4:18 AM, Alexander Korotkov aekorot...@gmail.com wrote:
 I found that in previous version of patch I missed PageSetLSN
 and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
 amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
 messages FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
 flushed only to 44/9C518750 appears. Seems that there is some totally wrong
 use of WAL if even optimization level does matter...

Try setting wal_debug=true to see what records are getting emitted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-03 Thread Heikki Linnakangas


On 03.08.2011 11:18, Alexander Korotkov wrote:

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750 appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...


Try this:

diff --git a/src/backend/access/gist/gistbuild.c 
b/src/backend/access/gist/gistbuild.c

index 5099330..5a441e0 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state,
/* Write the WAL record */
if (RelationNeedsWAL(state-r))
{
-   gistXLogUpdate(state-r-rd_node, buffer, oldoffnum, 
noldoffnum,
+			recptr = gistXLogUpdate(state-r-rd_node, buffer, oldoffnum, 
noldoffnum,


itup, ntup, InvalidBuffer);
PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID);


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 01.08.2011 13:44, Heikki Linnakangas wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

Did you want me to write the patch for 9.0?


I'm looking at it now.


So, in 9.0, we currently leave the rightlink and NSN invalid when 
replaying a page split. To set them correctly, we'd need the old 
rightlink and NSN from the page being split, but the WAL record doesn't 
currently include them. I can see two solutions to this:


1. Add them to the page split WAL record. That's straightforward, but 
breaks WAL format compatibility with older minor versions.


2. Read the old page version, and copy the rightlink and NSN from there. 
Since we're restoring what's basically a full-page image of the page 
after the split, in crash recovery the old contents might be a torn 
page, or a newer version of the page. I believe that's harmless, because 
we only care about the NSN and rightlink in hot standby mode, but it's a 
bit ugly.


If we change the WAL record, we have to make it so that the new version 
can still read the old format, which complicates the implementation a 
bit. Neverthelss, I'm leaning towards option 1.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

2011-08-02 Thread Simon Riggs

On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.08.2011 13:44, Heikki Linnakangas wrote:

 On 01.08.2011 13:13, Simon Riggs wrote:

 Did you want me to write the patch for 9.0?

 I'm looking at it now.

 So, in 9.0, we currently leave the rightlink and NSN invalid when replaying
 a page split. To set them correctly, we'd need the old rightlink and NSN
 from the page being split, but the WAL record doesn't currently include
 them. I can see two solutions to this:

 1. Add them to the page split WAL record. That's straightforward, but breaks
 WAL format compatibility with older minor versions.

 2. Read the old page version, and copy the rightlink and NSN from there.
 Since we're restoring what's basically a full-page image of the page after
 the split, in crash recovery the old contents might be a torn page, or a
 newer version of the page. I believe that's harmless, because we only care
 about the NSN and rightlink in hot standby mode, but it's a bit ugly.

 If we change the WAL record, we have to make it so that the new version can
 still read the old format, which complicates the implementation a bit.
 Neverthelss, I'm leaning towards option 1.

We may as well do (1), with two versions of the WAL record.

Hmm, the biggest issue is actually that existing GIST indexes are
corrupted, from the perspective of being unusable during HS.

We can fix the cause but that won't repair the existing damage. So the
requirement is for us to re/create new indexes, which can then use a
new WAL record format. We probably need to store some information in
the metapage saying whether or not the index has been maintained only
with v2 WAL records, or with a mixture of v1 and v2 records. If the
latter, then issue a WARNING to rebuild the index.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 02.08.2011 14:36, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

If we change the WAL record, we have to make it so that the new version can
still read the old format, which complicates the implementation a bit.
Neverthelss, I'm leaning towards option 1.


We may as well do (1), with two versions of the WAL record.


Actually I think we can append the new information to the end of the 
page split record, so that an old version server can read WAL generated 
by new version, too. It just won't set the right link and NSN correctly, 
so hot standby will be broken like it is today.



Hmm, the biggest issue is actually that existing GIST indexes are
corrupted, from the perspective of being unusable during HS.

We can fix the cause but that won't repair the existing damage. So the
requirement is for us to re/create new indexes, which can then use a
new WAL record format.


No-no, it's not that bad. The right-links and NSNs are only needed to 
handle scans concurrent with page splits. The existing indexes are fine, 
you only have a problem if you run queries in hot standby mode, while 
you replay page splits on it. As soon as you upgrade the master and 
standby to new minor version with the fix, that will work too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

2011-08-02 Thread Simon Riggs

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 02.08.2011 14:36, Simon Riggs wrote:

 On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 If we change the WAL record, we have to make it so that the new version
 can
 still read the old format, which complicates the implementation a bit.
 Neverthelss, I'm leaning towards option 1.

 We may as well do (1), with two versions of the WAL record.

 Actually I think we can append the new information to the end of the page
 split record, so that an old version server can read WAL generated by new
 version, too.

Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-02 Thread Alexander Korotkov

Hi!

I'm now working on adding features to your version of patch. Current version
is attached. Somehow this version produce huge amount of WAL and that makes
it slow. Though count and avg. length of WAL records is similar to that of
non-buffering build.

test=# create table points as (select point(random(),random()) from
generate_series(1,100));
SELECT 100
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
   pg_xlogfile_name_offset
-
 (000100400073,15005048)
(1 row)

test=# create index points_idx on points using gist (point) with
(buffering=off);CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
   pg_xlogfile_name_offset
-
 (00010040007E,13764024)
 (1 row)

test=# create index points_idx2 on points using gist (point) with
(buffering=on, neighborrelocation=off);
INFO:  Level step = 1, pagesPerBuffer = 406
NOTICE:  final emptying
NOTICE:  final emptying
NOTICE:  final emptying
NOTICE:  final emptying
CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
   pg_xlogfile_name_offset
-
 (0001004000D2,10982288)
(1 row)

May be you have any ideas about it?

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.9.0.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 02.08.2011 15:18, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 02.08.2011 14:36, Simon Riggs wrote:
Actually I think we can append the new information to the end of the page
split record, so that an old version server can read WAL generated by new
version, too.


Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.


Here's a patch to do what I meant. The new fields are stored at the very 
end of the WAL record, and you check the length to see if they're there 
or not. The nice thing about this is that it's compatible in both 
directions.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..60fc173 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -151,7 +151,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)-rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -222,16 +221,28 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	bool		isrootsplit = false;
+	Buffer	   *buffers;
 
+	/*
+	 * If this split inserted a downlink for a child at lower level, we can
+	 * now set the NSN and clear the follow-right flag on that child. It's
+	 * OK to do this before locking the parent page. If a concurrent scan
+	 * reads this parent page after we've already cleared the follow-right
+	 * flag on the child, it'll still follow the rightlink because of the
+	 * NSN.
+	 */
 	if (BlockNumberIsValid(xldata-leftchild))
 		gistRedoClearFollowRight(xldata-node, lsn, xldata-leftchild);
 	decodePageSplitRecord(xlrec, record);
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data-npage * sizeof(Buffer));
 	for (i = 0; i  xlrec.data-npage; i++)
 	{
 		NewPage*newpage = xlrec.page + i;
-		int			flags;
 
 		if (newpage-header-blkno == GIST_ROOT_BLKNO)
 		{
@@ -239,8 +250,19 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			isrootsplit = true;
 		}
 
-		buffer = XLogReadBuffer(xlrec.data-node, newpage-header-blkno, true);
-		Assert(BufferIsValid(buffer));
+		buffers[i] = XLogReadBuffer(xlrec.data-node,
+	newpage-header-blkno,
+	true);
+		Assert(BufferIsValid(buffers[i]));
+	}
+
+	/* Write out all the pages */
+	for (i = 0; i  xlrec.data-npage; i++)
+	{
+		NewPage*newpage = xlrec.page + i;
+		int			flags;
+
+		buffer = buffers[i];
 		page = (Page) BufferGetPage(buffer);
 
 		/* ok, clear buffer */
@@ -277,6 +299,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 		MarkBufferDirty(buffer);
 		UnlockReleaseBuffer(buffer);
 	}
+
+	pfree(buffers);
 }
 
 static void

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 02.08.2011 20:06, Alvaro Herrera wrote:

Excerpts from Heikki Linnakangas's message of mar ago 02 11:59:24 -0400 2011:

On 02.08.2011 15:18, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com   wrote:

On 02.08.2011 14:36, Simon Riggs wrote:
Actually I think we can append the new information to the end of the page
split record, so that an old version server can read WAL generated by new
version, too.


Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.


Here's a patch to do what I meant. The new fields are stored at the very
end of the WAL record, and you check the length to see if they're there
or not. The nice thing about this is that it's compatible in both
directions.


Err, did you attach the wrong patch?


Yes, sorry about that. Here's the right patch.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 82ba726..71c145d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -377,9 +377,18 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			state-ituplen++;
 		}
 
-		/* saves old rightlink */
+		/* save old rightlink and NSN */
 		if (state-stack-blkno != GIST_ROOT_BLKNO)
+		{
 			rrlink = GistPageGetOpaque(dist-page)-rightlink;
+			oldnsn = GistPageGetOpaque(dist-page)-nsn;
+		}
+		else
+		{
+			/* if root split we should put initial value */
+			rrlink = InvalidBlockNumber;
+			oldnsn = PageGetLSN(dist-page);
+		}
 
 		START_CRIT_SECTION();
 
@@ -407,7 +416,8 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			XLogRecData *rdata;
 
 			rdata = formSplitRdata(state-r-rd_node, state-stack-blkno,
-   is_leaf, (state-key), dist);
+   is_leaf, (state-key), dist,
+   rrlink, oldnsn);
 
 			recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_SPLIT, rdata);
 
@@ -425,12 +435,6 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			}
 		}
 
-		/* set up NSN */
-		oldnsn = GistPageGetOpaque(dist-page)-nsn;
-		if (state-stack-blkno == GIST_ROOT_BLKNO)
-			/* if root split we should put initial value */
-			oldnsn = PageGetLSN(dist-page);
-
 		for (ptr = dist; ptr; ptr = ptr-next)
 		{
 			/* only for last set oldnsn */
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 7f5dd99..cdd8aaf 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -39,6 +39,8 @@ typedef struct
 {
 	gistxlogPageSplit *data;
 	NewPage*page;
+	BlockNumber origrlink;
+	XLogRecPtr	orignsn;
 } PageSplitRecord;
 
 /* track for incomplete inserts, idea was taken from nbtxlog.c */
@@ -250,7 +252,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record, bool isnewroot)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)-rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -310,6 +311,26 @@ decodePageSplitRecord(PageSplitRecord *decoded, XLogRecord *record)
 			j++;
 		}
 	}
+
+	/*
+	 * Starting with 9.0.5, the original NSN and rightlink on the split page
+	 * are stored here. It would've been more logical to add them to the
+	 * gistxlogPageSplit struct, but that would've broken compatibility with
+	 * the pre-9.0.5 WAL format.
+	 */
+	if (ptr - begin  record-xl_len)
+	{
+		memcpy(decoded-origrlink, ptr, sizeof(BlockNumber));
+		ptr += sizeof(BlockNumber);
+		memcpy(decoded-orignsn, ptr, sizeof(XLogRecPtr));
+	}
+	else
+	{
+		/* pre-9.0.5 format, no rightlink/NSN information */
+		decoded-origrlink = InvalidBlockNumber;
+		decoded-orignsn.xlogid = 0;
+		decoded-orignsn.xrecoff = 0;
+	}
 }
 
 static void
@@ -320,17 +341,32 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	int			flags;
+	Buffer	   *buffers;
 
 	decodePageSplitRecord(xlrec, record);
 	flags = xlrec.data-origleaf ? F_LEAF : 0;
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data-npage * sizeof(Buffer));
 	for (i = 0; i  xlrec.data-npage; i++)
 	{
 		NewPage*newpage = xlrec.page + i;
 
-		buffer = XLogReadBuffer(xlrec.data-node, newpage-header-blkno, true);
-		Assert(BufferIsValid(buffer));
+		buffers[i] = XLogReadBuffer(xlrec.data-node,
+	newpage-header-blkno,
+	true);
+		page = (Page) BufferGetPage(buffers[i]);
+	}
+
+	/* Write out all the pages */
+	for (i = 0; i  xlrec.data-npage; i++)
+	{
+		NewPage*newpage = xlrec.page + i;
+
+		buffer = buffers[i];
 		page = (Page) BufferGetPage(buffer);
 
 		/* ok, clear buffer */
@@ -339,6 +375,18 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord

Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 27.07.2011 17:43, Alexander Korotkov wrote:

OK, thanks. I also found behaviour of GiST(without patch) with streaming
replication that seems strange for me. On master there are only few
rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
hack gevel for getting index structure on slave (accessing tree without
AccessExclusiveLock).

On master:
# create table test as (select point(random(),random()) from
generate_series(1,10));
# create index test_idx on test using gist(point);
# \copy (select gist_tree('test_idx')) to 'tree1r.txt';

On slave:
# \copy (select gist_tree('test_idx')) to 'tree2r.txt';

In bash:
# cat tree1r.txt | sed 's/\\n/\n/g'  tree1.txt
# cat tree2r.txt | sed 's/\\n/\n/g'  tree2.txt
# diff tree1.txt tree2.txt

2,89c2,89
  1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
  1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
(OK)
  2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
(OK)
  3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
(OK)
  4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
(OK)
  5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
(OK)
  6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
(OK)
  7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
(OK)
.
---

 1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295

(InvalidBlockNumber)

 1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

rightlink:4294967295 (InvalidBlockNumber)

 2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

 3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

rightlink:4294967295 (InvalidBlockNumber)

 4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

rightlink:4294967295 (InvalidBlockNumber)

 5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

 6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

rightlink:4294967295 (InvalidBlockNumber)

 7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

rightlink:4294967295 (InvalidBlockNumber)
.

Isn't it a bug?


Yeah, it sure looks like a bug. I was certain that I had broken this in 
the recent changes to GiST handling of page splits, but in fact it has 
been like that forever.


The rightlinks are not needed after crash recovery, because all the 
downlinks should be there. A scan will find all pages through the 
downlinks, and doesn't need to follow any rightlinks. I'm not sure why 
we explicitly clear them, it's not like the rightlinks would do any harm 
either, but for crash recovery that's harmless.


But a scan during hot standby can see those intermediate states, just 
like concurrent scans can on the master. The locking on replay of page 
split needs to be fixed, too. At the moment, it locks and writes out 
each page separately, so a concurrent scan could overtake the WAL 
replay while following rightlinks, and miss tuples on the right half.


Attached is a patch for that for 9.1/master. The 9.0 GiST replay code 
was quite different, it will require a separate patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..60fc173 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -151,7 +151,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)-rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -222,16 +221,28 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	bool		isrootsplit = false;
+	Buffer	   *buffers;
 
+	/*
+	 * If this split inserted a downlink for a child at lower level, we can
+	 * now set the NSN and clear the follow-right flag on that child. It's
+	 * OK to do this before locking the parent page. If a concurrent scan
+	 * reads this parent page after we've already cleared the follow-right
+	 * flag on the child, it'll still follow the rightlink because of the
+	 * NSN.
+	 */
 	if (BlockNumberIsValid(xldata-leftchild))
 		gistRedoClearFollowRight(xldata-node, lsn, xldata-leftchild);
 	decodePageSplitRecord(xlrec, record);
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data-npage * sizeof(Buffer));
 	for (i = 0; i  xlrec.data-npage; i++)
 	{
 		NewPage*newpage = xlrec.page + i;
-		int			flags;
 
 		if (newpage-header-blkno == GIST_ROOT_BLKNO)
 		{
@@ -239,8 +250,19 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			isrootsplit = true;
 		}
 
-		buffer =

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 27.07.2011 17:43, Alexander Korotkov wrote:

 OK, thanks. I also found behaviour of GiST(without patch) with streaming
 replication that seems strange for me. On master there are only few
 rightlinks are InvalidBlockNumber while on slave there are a lot of them.
 I
 hack gevel for getting index structure on slave (accessing tree without
 AccessExclusiveLock).

 On master:
 # create table test as (select point(random(),random()) from
 generate_series(1,10));
 # create index test_idx on test using gist(point);
 # \copy (select gist_tree('test_idx')) to 'tree1r.txt';

 On slave:
 # \copy (select gist_tree('test_idx')) to 'tree2r.txt';

 In bash:
 # cat tree1r.txt | sed 's/\\n/\n/g'  tree1.txt
 # cat tree2r.txt | sed 's/\\n/\n/g'  tree2.txt
 # diff tree1.txt tree2.txt

 2,89c2,89
       1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637
 (OK)
           1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
 (OK)
           2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
 (OK)
           3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
 (OK)
           4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
 (OK)
           5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
 (OK)
           6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
 (OK)
           7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
 (OK)
 .
 ---

     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%)
 rightlink:4294967295

 (InvalidBlockNumber)

         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

 rightlink:4294967295 (InvalidBlockNumber)

         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

 rightlink:4294967295 (InvalidBlockNumber)

         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

 rightlink:4294967295 (InvalidBlockNumber)

         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

 rightlink:4294967295 (InvalidBlockNumber)

         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

 rightlink:4294967295 (InvalidBlockNumber)

         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

 rightlink:4294967295 (InvalidBlockNumber)

         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

 rightlink:4294967295 (InvalidBlockNumber)
 .

 Isn't it a bug?

 Yeah, it sure looks like a bug. I was certain that I had broken this in the
 recent changes to GiST handling of page splits, but in fact it has been like
 that forever.

 The rightlinks are not needed after crash recovery, because all the
 downlinks should be there. A scan will find all pages through the downlinks,
 and doesn't need to follow any rightlinks. I'm not sure why we explicitly
 clear them, it's not like the rightlinks would do any harm either, but for
 crash recovery that's harmless.

 But a scan during hot standby can see those intermediate states, just like
 concurrent scans can on the master. The locking on replay of page split
 needs to be fixed, too. At the moment, it locks and writes out each page
 separately, so a concurrent scan could overtake the WAL replay while
 following rightlinks, and miss tuples on the right half.

 Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
 quite different, it will require a separate patch.

Hmm, I was assured no changes would be required for Hot Standby for
GIST and GIN. Perhaps we should check GIN code also.

Does the order of locking of the buffers matter? I'm sure it does.

Did you want me to write the patch for 9.0?

And what does NSN stand for? :-)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 01.08.2011 13:13, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
quite different, it will require a separate patch.


Hmm, I was assured no changes would be required for Hot Standby for
GIST and GIN. Perhaps we should check GIN code also.


Yeah, we probably should.


Does the order of locking of the buffers matter? I'm sure it does.


Yep.


Did you want me to write the patch for 9.0?


I'm looking at it now.


And what does NSN stand for? :-)


Hmm, I don't know actually know what NSN is an acronym for :-). But in 
case you want an explanation of what it does:


The NSN is field in the GiST page header, used to detect concurrent page 
splits. Whenever a page is split, its NSN is set to the LSN of the page 
split record. To be precise: the NSN of the resulting left page(s) is 
set, the resulting rightmost half keeps the NSN of the original page.


When you scan a parent page and decide to move down, it's possible that 
the child page is split before you read it, but after you read the 
parent page. So you didn't see the downlink for the right half when you 
scanned the parent page. To reach the right half, you need to follow the 
rightlink from the left page, but first you need to detect that the page 
was split. To do that, when you scan the parent page you remember the 
LSN on the parent. When you scan the child, you compare the parent's LSN 
you saw with the NSN of the child. If the child's NSN  parent's LSN, 
the page was split after you scanned the parent, so you need to follow 
the rightlink.


The B-tree code has similar move-right logic, but it uses the high key 
on each page to decide when it needs to move right. There's no high key 
on GiST pages, so we rely on the NSN for that.


In 9.1, I added the F_FOLLOW_RIGHT flag to handle the case of an 
incomplete split correctly. If the flag is set on a page, its rightlink 
needs to be followed regardless of the NSN.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

2011-08-01 Thread Thom Brown

On 1 August 2011 11:44, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.08.2011 13:13, Simon Riggs wrote:
 And what does NSN stand for? :-)

 Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown t...@linux.com wrote:
 On 1 August 2011 11:44, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 01.08.2011 13:13, Simon Riggs wrote:
 And what does NSN stand for? :-)

 Hmm, I don't know actually know what NSN is an acronym for :-).

 Node Sequence Number.

Do you have a reference to support that explanation?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

2011-08-01 Thread Thom Brown

On 1 August 2011 12:25, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown t...@linux.com wrote:
 On 1 August 2011 11:44, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 01.08.2011 13:13, Simon Riggs wrote:
 And what does NSN stand for? :-)

 Hmm, I don't know actually know what NSN is an acronym for :-).

 Node Sequence Number.

 Do you have a reference to support that explanation?

Here's one reference to it:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg00294.php

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

2011-08-01 Thread Alexander Korotkov

On Mon, Aug 1, 2011 at 3:25 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown t...@linux.com wrote:
  On 1 August 2011 11:44, Heikki Linnakangas
  heikki.linnakan...@enterprisedb.com wrote:
  On 01.08.2011 13:13, Simon Riggs wrote:
  And what does NSN stand for? :-)
 
  Hmm, I don't know actually know what NSN is an acronym for :-).
 
  Node Sequence Number.

 Do you have a reference to support that explanation?


See Access Methods for Next-Generation Database Systems by Marcel
Kornacker, Chapter 4 Concurrency and Recovery for GiSTs.
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz


--
With best regards,
Alexander Korotkov.

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Does the order of locking of the buffers matter? I'm sure it does.

 Yep.

Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 01.08.2011 14:35, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Does the order of locking of the buffers matter? I'm sure it does.


Yep.


Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?


I just meant that yes, the order of locking of the buffers does matter.

I believe we code acquire the locks in right order already, and the 
patch I posted fixes the premature release of locks at page split.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.08.2011 14:35, Simon Riggs wrote:

 On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Does the order of locking of the buffers matter? I'm sure it does.

 Yep.

 Do you mean that the BlockNumbers are already in correct sequence, or
 that you will be adding this code to redo?

 I just meant that yes, the order of locking of the buffers does matter.

 I believe we code acquire the locks in right order already, and the patch I
 posted fixes the premature release of locks at page split.

Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)


On 01.08.2011 17:26, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

I believe we code acquire the locks in right order already, and the patch I
posted fixes the premature release of locks at page split.


Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.


All but the first page are unused pages, grabbed with either P_NEW or 
from the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for 
the case that someone else chooses the same victim buffer, and picks 
another page.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hot standby and GiST page splits (was Re: [HACKERS] WIP: Fast GiST index build)