Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci
Simon Riggs wrote
 On 5 November 2013 14:28, Leonardo Francalanci lt;

 m_lists@

 gt; wrote:
 
 Either my sql is not correct (likely), or my understanding of the minmax
 index is
 not correct (even more likely), or the minmax index is not usable in a
 random inputs
 scenario.
 
 Please show the real world SQL you intend to run, so we can comment on
 it. Inventing a use case that breaks effectiveness of any optimisation
 is always easy, but the question is whether the use case is likely or
 even desirable.

The use case is pretty simple.
Think it as the NSA, as it would be much easier.
Collect all calls made/received by any user on a mobile network.
(in fact, it's something more than calls, so in fact is 2-10 rows per call).
Keep the data for 20 days.
That's the insert part.

Query:
search calls made/received by the user using IMSI (caller id) or IMEI
(phone id). Date range is usually days (past 4 days, from 10 days ago
to 5 days ago...)

The result is just a very small percentage of the rows present in the
table: a single user doesn't call that much!
Searches are made by a human, so no that many request per second.

It's not a write mostly scenario, it's a 99% write 1% read scenario.

Problem? having 4 btree indexes on random values (imsi+imei * 2,
since we have calling and caller) kills the performance in insertion
after a while.
Solution so far? partition every 15 minutes, create the indexes in bulk.


Simon Riggs wrote
 If we have a query to show the most recent calls by a particular caller
 
 SELECT *
 FROM cdr
 WHERE callerid = X
 ORDER BY call_timestamp DESC
 LIMIT 100
 
 then this could potentially be optimised using a minmax index, by
 traversing the data ranges in call_timestamp order. That is not part
 of the code in this initial release, since the main use case is for
 WHERE call_timestamp = X, or WHERE primarykey = Y

I don't understand how a index on call_timestamp would help
in the query above.


Simon Riggs wrote
 I don't believe there is a credible business case for running that
 same query but without the ORDER BY and LIMIT, since it could
 potentially return gazillions of rows


Gazillion of rows??? We're talking about calls made/received by
one user here. How many calls do you make in 10 days???


Simon Riggs wrote
  so it isn't surprising at all
 that it would access a large % of the table. 

In fact, the query I use return a fraction of the table, and only
a very small amount of users get searched.

Simon, you keep on talking about these minmax indexes, and
I still don't see any reference to some performance tests.

And, again, I think that random values insertion is the worst
use case for minmax indexes.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778092.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci
Simon Riggs wrote
 So in the use case you describe, the min max index would require a
 scan of only 25% of the table, not the 80% described earlier for
 random inserts. In my experience, people wish to keep data for much
 longer periods and so the percentage of scan required would drop lower
 than 25%, possibly to 5% or less for many applications.
 
 The plan would use sequential I/O so could still work quickly; given
 the low read rate, longer query times could be acceptable.


Quickly??? Seq scan 25% of a table of TB is not quick.



Simon Riggs wrote
 Minmax indexes are simply a way to make this use case happen
 automatically, without the need for manual partitioning of the table.

You logic assumes that we don't index anything but the call_timestamp.
That would lead to huge query response times.
Plus: btree doesn't have a big problem to keep up in sequential insertion
scenario (such as the call_timestamp). So I still don't see the gain
in using the minmax indexes: again, could you point me to some
performance tests of *any* use case?


Simon Riggs wrote
 They are not the answer to every prayer, but with respect they are
 better than you had claimed they would be. (25% not 80%, in your use
 case). I saw this was likely to be the case and this is why I
 challenged you to describe in more detail. Thank you.

I claimed they would scan the 80% of the table because I assumed I
had to use them in the random fields; not in the call_timestamp field.
I don't need a better index in the call_timestamp, because it's sequential,
I don't have problems with that. But it's useless: I don't want to seq scan
25% of a multi-TB table.



Simon Riggs wrote
 Performance tests are only possible with a clear use case. 

Well, so I can add my weird_index patch in postgresql source code,
and it would be committed right away??? I assumed you had
to prove somehow that the new index was better than what
is already available, at least for some cases.

Or, in other words: what are you going to write in the minmax index
documentation, try and see if they work better for you?


Simon Riggs wrote
 Please see that Alvaro and I have gone out of our way to provide a new
 facility to help you and others, and that it requires changing how we
 think about the solution. I accept it may not provide for every case
 but it requires careful analysis before deciding that is so.

If I came out too rough, I ask your pardon. I always appreciate people
taking their time to help someone else for free. 

Plus, I'm *very* interested in the minmax index, especially for
call_timestamp (some queries are date-range only, such as give me
all the calls in this particular 2 secs range) or the id column I have.
But, at the same time, I don't see any evidence that they work
better than btrees (except for the size of the index). 
I would like to see some numbers.

I worked a little in the bitmap index implementation, and I stopped because
on the large tables these indexes are supposed to be used, the heap
lookup took so much time that the (slightly) faster index access didn't
really help, because it was a fraction of the query time... I'm afraid
it would be the same with minmax indexes, that's why I wanted to 
see some numbers...


Simon Riggs wrote
 And, again, I think that random values insertion is the worst
 use case for minmax indexes.
 
 I agree, it is. What we have disagreed on is the extent to which that
 scenario exists for use cases on very large tables, which are
 typically append-mostly with most queries accessing a subset of the
 table, e.g. date range.

Mhh... maybe this is this point we don't understand each other?

I query the table by userid + date range.
The date range is *not* selective enough (it's, as you said, 25%
of the multi-TB table). The userid is selective *a lot*. 

I'm looking for a better index for the userid column(s).

The new indexes I mentioned in the OP claim they are better
in this scenario (but I don't blindly believe them)

I don't see how indexing the call_timestamp only could be
of any interest, since it would require seq-scanning 25%
of a huge table for every query.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci
Jeremy Harris wrote
 Surely there's good correlation between IMSI  IMEI, so have a separate
 table to translate one to (a group of) the others, and
 halve the indexes on your main table?

Yes; unfortunately not always both are available; but it's something
we are thinking about (it requires logic in the inserting application
that at the moment doesn't exist, but it is something that we'll
have to add sooner or later).
But in the end yes, trying to use less indexed-fields is a good path.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778125.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci
Simon Riggs wrote
 From our discussions here, IMHO there is a strong case for avoiding
 btrees completely for larger historical data tables. That isn't
 something I had even considered as desirable before this conversation
 but ISTM now that taking that approach will be more fruitful than
 attempting to implement LSM trees.

Eh? I don't understand this point. How can I avoid btrees, and
searching by caller_id? I don't get it...


Simon Riggs wrote
 Alvaro has given me some results for his patch. The figures I have are
 for a 2GB table.
 
 Index Build Time
 MinMax 11 s
 Btree   96s
 
 Index Size
 MinMax 2 pages + metapage
 Btree approx 200,000 pages + metapage
 
 Load time
 MinMax no overhead, same as raw COPY
 BTree - considerably slower

Great!!! This looks very promising. Were the values indexed
sequential?




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778150.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci
Simon Riggs wrote
 Can you CLUSTER
 against a minmax index?
 
 Not in this release, at least in my understanding. It's not yet
 possible to do an ordered fetch, so the cluster scan probably won't
 work.

As per the patch I helped writing, CLUSTER should use the
sequential heap scan+sort when it makes sense.
So I think that if the index is not able to do an ordered fetch, 
CLUSTER should fall back to scan+sort automatically (which is
what you want in a large table anyway).

Obviously, that should be tested.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778171.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Andres Freund-3 wrote
 On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
 On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire lt;

 klaussfreire@

 gt; wrote:
  Such a thing would help COPY, so maybe it's worth a look
 
 I have little doubt that a deferred insertion buffer of some kind
 could help performance on some workloads, though I suspect the buffer
 would have to be pretty big to make it worthwhile on a big COPY that
 generates mostly-random insertions.
 
 Even for random data presorting the to-be-inserted data appropriately
 could result in much better io patterns.

Mmh, I'm afraid that the buffer should be huge to get some real advantage.
You have to buffer enough values to avoid touching entire pages, which is
not that easy. The LSM-tree is much complicated than a simple memory-buffer
+ delayed inserts. The other index types that are supposed to help in the
random-insertion workload rely on sequential insertions (at the expense of
more writing, and slower reads) rather than re-use the btree pattern.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776963.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Simon Riggs wrote
 Everybody on this thread is advised to look closely at Min Max indexes
 before starting any further work.
 
 MinMax will give us access to many new kinds of plan, plus they are
 about as close to perfectly efficient, by which I mean almost zero
 overhead, with regard to inserts as it is possible to get.

Simon, I don't understand how minmax indexes would help in a random-inserts
scenario.
While I would love to use minmax for other columns (since we also partition
and search based on a timestamp, which is usually well clustered), I thought
minmax index would be perfect in a mostly-incremental values scenario. 



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776964.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Simon Riggs wrote
 Minmax indexes seem to surprise many people, so broad generalisations
 aren't likely to be useful.
 
 I think the best thing to do is to publish some SQL requests that
 demonstrate in detail what you are trying to achieve and test them
 against minmax indexes. That way we can discuss what does work and
 what doesn't work well enough yet.

While I do believe in testing (since In theory there is no difference
between theory and practice. In practice there is), I would like to know
the properties of the minmax index before trying it.
What is it supposed to be good at? What are the pros/cons? We can't ask all
the users to just try the index and see if it works for them. 
As I said, my understanding is that is very efficient (both in insertion and
in searching) when data is somehow ordered in the table. But maybe I got it
wrong...

Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in
the fields is a random bigint in the range 1-1000. Insertion is 5-10K
rows/second. One search every 1-5 seconds, by one of the indexed fields
(only equality, no range). There's also an index on a timestamp field, but
that's not random so it doesn't give that many problems (it's actually the
one where I wanted to try the minmax).




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776976.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Simon Riggs wrote
 On 5 November 2013 09:57, Leonardo Francalanci lt;

 m_lists@

 gt; wrote:
 While I do believe in testing (since In theory there is no difference
 between theory and practice. In practice there is), I would like to know
 the properties of the minmax index before trying it.
 What is it supposed to be good at? What are the pros/cons? We can't ask
 all
 the users to just try the index and see if it works for them.
 
 No, but then all users aren't suggesting we need a new index type are
 they?
 
 I think its reasonable for you to spend time checking whether what you
 require will be met, and if not, what precise query doesn't it help,
 so we can better design any future new-index.

I don't understand the parallel We can't ask all the users to just try the
index and see if it works for them with all users aren't suggesting we
need a new index type. Anyway:

I'm not suggesting we need a new index type. Please read my first post: I'm
asking info, fearing that there's just a lot of marketing/hype in those
indexes.

What do you mean by spend time checking whether what you require will be
met? Met by what, minmax indexes?


Simon Riggs wrote
 As I said, my understanding is that is very efficient (both in insertion
 and
 in searching) when data is somehow ordered in the table. But maybe I got
 it
 wrong...
 
 Anyway, the sql scenario is simple: a table with 4 bigint indexes; data
 in
 the fields is a random bigint in the range 1-1000. Insertion is 5-10K
 rows/second. One search every 1-5 seconds, by one of the indexed fields
 (only equality, no range). There's also an index on a timestamp field,
 but
 that's not random so it doesn't give that many problems (it's actually
 the
 one where I wanted to try the minmax).
 
 Without meaning to pick on you, imprecise analysis yields unhelpful
 new features. The clearer we are about what we are trying to solve the
 more likely we are to solve it well. 30 seconds analysis on what is
 needed is not sufficient to justify an additional man year of
 development, especially if a man year of work has already been done
 and the testing of the latest feature is now at hand.

I've never said that my analysis justifies a man year of work. As I said,
I'm actually not confident at all that even if we had those cool indexes
they would work on my scenario (marketing aside, there's not that much
data/tests out there on those indexes). I just wanted to know if the matter
was discussed in the past / getting more info.

At the same time, I'm reluctant to try a new index hoping it will work in my
case just because it's new and a man year of work has already been done.
Again: what's this minmax index supposed to be good at?
If it's indexing data in mostly-sequential order, it won't help me. From
what I got (maybe I got it wrong?) the index stores min/max values of
sequence of pages. In my case I guess that those min/max values would be
close to 1 (min) /1000 (max), because I insert data in random order. So
any query will scan the entire table anyway. Am I wrong?


Simon Riggs wrote
 The requests from the indexes field, are they ordered? 

Mmmh, I don't think I understand the question... an operator searches for
calls made by a user, so he searches in random order... 


Simon Riggs wrote
 are they
 limited? Or you really want ALL calls? What is the tolerance on those?

I want all the calls made by the user. But there won't be many calls for 1
user.
And most users will never be queried (as I said, one calls by person
search every 1-5 seconds, so a very small percentage of calls will ever be
queried/retrieved)



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776982.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Claudio Freire wrote
 Min-max indexes always require a sequential scan of the min-max index
 itself when querying. 

I'm worried about the number of heap pages that will be scanned.
My understanding is that given the random input, the index will
not be selective enough, and will end up requiring a lot of page 
scanning to get the relevant rows.

That is: the selectivity of the min/max values will be very low, given
the high cardinality and randomness of the input; I'm afraid that, in
most page ranges, the min will be lower than the queried ones,
and the max higher, resulting in lots of heap page scans.

Quick test:
a table with 1M rows, with random values from 1 to 1000:
create table t as select generate_series(1, 100) as i, trunc(random() *
1000) as n;

suppose a page range contains 100 rows (but my understanding is that minmax
index will use a much bigger row count...), let's find how many page
ranges
should be scanned to find a series of 50 values (again, random in the
1-1000 range):

with cte as 
 (select min(n) as minn, max(n) as maxn, i/100 from t group by i/100),
 inp as (select generate_series(1, 50) iinp, trunc(random() * 1000) as
s)
 select count(*) from inp
   left outer join cte on inp.s between minn and maxn group by iinp


I get  9000 pages for 49 values out of 50... which means scanning 90% of
the table.

Either my sql is not correct (likely), or my understanding of the minmax
index is
not correct (even more likely), or the minmax index is not usable in a
random inputs
scenario.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777009.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Claudio Freire wrote
 real data isn't truly random

Well, let's try normal_rand???

create table t1 as select trunc(normal_rand(100, 50, 3)) as n,
generate_series(1, 100) as i;

with cte as 
 (select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100),
 inp as (select generate_series(1, 100) iinp, trunc(normal_rand(100,
50, 3)) as s)

 select count(*),iinp from inp
   left outer join cte on inp.s between minn and maxn group by iinp;

Not that much different in my run... 



Claudio Freire wrote
 you haven't really
 analyzed update cost, which is what we were talking about in that last
 post.

I don't care for a better update cost if the cost to query is a table scan.
Otherwise, I'll just claim that no index at all is even better than minmax:
0 update cost, pretty much same query time.

Maybe there's value in minmax indexes for sequential data, but not for
random data, which is the topic of this thread.

BTW I would like to see some performance tests on the minmax indexes
vs btree for the sequential inputs... is the gain worth it? I couldn't find
any mention of performance tests in the minmax threads.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777020.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Claudio Freire wrote
 Well, of course, they're not magic pixie dust. 

Of course they aren't. I think they can make a difference in a sequential
input scenario. But I'm not the one who said that they are fit to
solve the problems me and other people are talking about in this thread.


Claudio Freire wrote
 But is your data really random? (or normal?)
 That's the thing...

I still don't understand.

Even if the data was normal, it wouldn't work. You can try and
change the 3rd parameter in normal_rand and get a narrower
distribution, but at the same time the query values should be
narrower... so you'll get the same results. 

The problem is: if the range you get between min and max is 
too big in each page range, you'll end up scanning a lot of heap
pages.

To me, the thing is: has anyone made performance tests of these
minmax indexes with an input that is not sequential?

(BTW I'd like to see tests for the sequential input scenario too...)





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777055.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci
Jeff Janes wrote
 Some experiments I did a few years ago showed that applying sorts to the
 data to be inserted could be helpful even when the sort batch size was as
 small as one tuple per 5 pages of existing index.  Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777056.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-31 Thread Leonardo Francalanci
Jeff Janes wrote
 True, but that is also true of indexes created in bulk.  It all has to
 reach disk eventually--
 [...]
 If the checkpoint interval is as long as the partitioning period, then
 hopefully the active index buffers get re-dirtied while protected in
 shared_buffers, and only get written to disk once.  

Honestly, I made a lot of tests in the past, and I don't remember if I tried
15-minute checkpoints + high shared_buffers. That might work. I'm going to
try it and see what happens.


Jeff Janes wrote
 If the buffers get read, dirtied, and evicted from a small shared_buffers
 over and over again
 then you are almost guaranteed that will get written to disk multiple
 times

(as I understand, but I might be wrong): 
high shared_buffers don't help because in such a random index writing, lots
and lots of pages get dirtied, even if the change in the page was minimal.
So, in the 15-minute period, you write the same pages over and over again.
Even if you have high shared_buffers, the same page will get sync-ed to disk
multiple times (at every checkpoint).
The idea of those other indexes is to avoid the random writing, maximizing
the writing in sequence, even if that means writing more bytes. In other
words: writing a full 8KB is no different than write 20 bytes in a page, as
we'll have to sync the whole page anyway...

I'll try a 15-minute checkpoint interval... and see what happens.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776470.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-31 Thread Leonardo Francalanci
Gavin Flower-2 wrote
 How about being able to mark indexes:
  'MEMORY ONLY' to make them not go to disk
 and
  'PERSISTENT | TRANSIENT' to mark if they should be recreated on 
 machine bootup?

I would love that. But:

1) I'd like to make some tests with a memory drive, and confirm that in
fact this would help (I'm sure I tried in the past, but I don't remember the
outcome)
2) I don't know if the fact that they are in memory should be handled by the
db or not. I was thinking about something more like RECREATE IF NOT FOUND,
that is: if the files aren't found at postgresql startup, re-create the
index...
3) I don't know how many people would be interested (and how
doable/complicated that would be, considering log-replay, replication etc
etc)





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776471.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
 Hmm, you realise Alvaro is working on MinMax indexes in this release?
 They are very efficient with regard to index inserts and specially
 designed for use on large tables.
 
 Prior work by Heikki on Grouped Item Tuples was a way of reducing the
 size of indexes, yet still allowing uniqueness checks. That is
 implemented in SQLServer already and is very useful.


Reading the implementation of those features, I don't think they can help in 
the cases handled by the index types I mentioned (insertions of random values 
in big tables).



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
 Presumably the data you are inserting isn't actually random. Please
 describe the use case you are considering in more detail and some view
 on how frequent that is, with some examples. Once we understand the
 use case and agree it is important, we might solve problems.


Collecting calls data for mobile network operators (and no, I don't work for 
the NSA...)
Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a 
problem, always increasing so no btree issues) and in called #, calling #, 
imsi, imei. The last four obviously are random, out of millions of possible 
values.
After the few first millions of records, the disks can't keep up with the 
amount of random writing in the indexes. Workaround: the table is partitioned 
every 15 minutes, and indexes created in bulk after we start the new 
15-minutes partition. Searches on current 15 minutes are not allowed (as it is 
not indexed), and searches on older data are K*log(N) (where K is the number of 
partitions). 
Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still 
think that btree just aren't fit for this kind of problem. I don't delete data, 
I don't update data, there's not that much concurrency going on. I would 
sacrifice search speed (K*log(N) is already much slower than regular btree 
usage) for realtime insertion.

I don't think I'm the only one having a big system to be indexed by random 
values. 

In fact, I didn't want to turn this thread into a help me with this workload 
thread. I just wanted to know if there was some other known issues with these 
different indexes other than not enough time to implement them correctly: I 
was afraid that someone already dismissed them as good in theory, bad in 
practice...



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
 What is the reason for needing such fast access to individual groups
 of records? Sure sounds like the NSA or similar ;-)


Users need to search all calls originated from/to a user or from/to a specific 
mobile phone to answer/analyze customers' probl... ok, I give up: I work for 
the NSA ;)

 In terms of generality, do you think its worth a man year of developer
 effort to replicate what you have already achieved? Who would pay?


1) I haven't achieved what I need: realtime indexing. I can't query the 
current 15 minutes table efficiently. Plus, K*log(N) is not that great when 
you have a lot of K.
2) I'm not suggesting that this is top priority. I'm asking if there's 
something else, other than we don't have time for this, that I don't know. In 
fact, I don't even know if those indexes types would really help in my 
(specific) case. That's why my original question was why aren't there 
developments in this area: I didn't mean to imply someone should do it. I just 
wanted to know if those indexes were already discussed (and maybe dismissed for 
some reason) in the past...



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
 LSM-trees seem patent free

I'm no expert, and I gave it just a look some time ago: it looked to me very 
complicated to get right... and as far as I remember you don't get that much 
gain, unless you go multi-level which would complicate things further

 Please somebody advise patent status of Y-trees otherwise I wouldn't bother.
 
y-trees look much more easier to get right... (and to me they also make more 
sense, but I'm not skilled enough to judge). 

There's also the FD-tree, which looks a lot like the (patented...) fractal tree:
http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
Jeff Janes wrote
 The index insertions should be fast until the size of the active part of
 the indexes being inserted into exceeds shared_buffers by some amount
 (what
 that amount is would depend on how much dirty data the kernel is willing
 to
 allow in the page cache before it starts suffering anxiety about it).  If
 you have enough shared_buffers to make that last for 15 minutes, then you
 shouldn't have a problem inserting with live indexes.

Sooner or later you'll have to checkpoint those shared_buffers... and we are
talking about GB of data (my understanding is that we change basically every
btree page, resulting in re-writing of the whole index).






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776413.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
Jeff Janes wrote
 Are partitions read-only once time has moved on, or can stragglers show up
 that need to be inserted into older partitions?
 
 You could periodically merge older partitions into larger tables, index
 those aggregated tables, then transactionally disinherit the old
 partitions
 and inherit the new aggregated one.  This would keep the value of K down,
 at the expense of re-writing data multiple times (but all method write
 data
 multiple times, some just hide it from you).

Yes, we could merge the partitions: the idea was to merge them during
night hour, when traffic is low ( and NSA people are sleeping ;) )



Jeff Janes wrote
 By the way, what is the transaction structure of your inserts?  Are they
 large batches between commits, or is each row committed?

Of course large batches (using COPY)




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776416.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
 Point being: hardware is marching along pretty fast (after 20+ years
 of stagnation) and it's dangerous (IMO) to make big software
 investments based on the situation on the ground *today*.


Yes, that's a good point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci
Jeff Janes wrote
 You could periodically merge older partitions into larger tables, index
 those aggregated tables, then transactionally disinherit the old
 partitions
 and inherit the new aggregated one.  This would keep the value of K down,
 at the expense of re-writing data multiple times (but all method write
 data
 multiple times, some just hide it from you).

Forgot to add:

I thought also that we could:

- use the RAM as tablespace for indexes, and move the indexes later (but
postgresql doesn't handle very well a machine crash in this case... it would
be cool to create an index as recreate on crash...)
- use unlogged tables and turn those to logged to speed up somehow the
insertion; I actually started to write a patch for it, but making it work
for replication was too hard (not that I'm using replication, but it
wouldn't be accepted for wal_level = minimal). But this wouldn't solve the
problem anyway.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776418.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
Hi,


I don't see much interest in insert-efficient indexes. These are the ones I've 
found:

- LSM-tree (used by Cassandra and SQLite4?)
- Y-Tree 
(http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf
 )
- Fractal indexes (TokuDB, patented)

While I understand that b*trees are still the best compromise in 
insertion/search speed, disk size, concurrency, and more in general in OLTP 
workloads, they are useless when it comes to insertion in big data tables (50M 
rows) of random values (not ordered values).

I would like to know if the lack of development in this area (not only in 
Postgresql, but in databases in general) is due to:

1) complex implementation
2) poor search performance
3) poor concurrency performance
4) not interesting for most users
5) something else???

I thought this was going to change due to the fast-insertion speeds needs of 
Social Applications, but only TokuDB seems to be the only successful player 
in the area (I don't know how much of it is due to good marketing). Most other 
DB technology claims faster insertion speed (MongoDB and the like...) but in 
the end they rely on the old b*tree + sharding instead of using different 
indexing mechanisms (with the exception of Cassandra).


Thank you in advance

Leonardo



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
 Before getting too excited about some new academic index type, it's worth
 noting the sad state in which hash indexes have languished for years.
 Nobody's bothered to add WAL support, let alone do any other real work
 on them.  The non-btree index types that have been getting love are the
 ones that offer the ability to index queries that btree can't.  I think
 a new index type whose only benefit is the claim to be faster in a narrow
 use-case is likely to end up like hash, not getting used enough to be
 properly maintained.
             regards, tom lane

Aren't hash indexes in a poor state because they are not faster than btree in 
every condition?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
 Another point to add: I don't really see btree as a barrier to
 performance for most of the problems I face.  The real barriers to
 database performance are storage, contention, and query planning.

Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) 
of people use/need. But if you try to insert rows into a 50M table with a 
couple of indexes, btrees just can't keep up. 
Of course, you can't have it all: fast at big table insertion, good contention, 
good query times... 

 Postgres btreee indexes are pretty fast and for stuff like bulk
 insertions there are some optimization techniques available (such as
 sharding or create index concurrently).


At the moment I'm relying on partitioning + creating indexes in bulk on 
latest table (the partitioning is based on time). But that means K*log(N) 
search times (where K is the number of partitions).
That's why I gave a look at these different indexing mechanisms.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
 They should, in theory, be faster than btrees -- O(1) not O(log N) page
 fetches per lookup.  In practice they don't seem to be faster, and
 nobody's bothered to find out exactly why.  Again, this isn't a terribly
 encouraging precedent for implementing some other index type that's
 supposed to (sometimes) be faster than btrees.

Yes, I understand. Which is also why I was curious to know if the claims 
those papers (and the databases using them) make were real...

Thank you everybody for your replies.

Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
 I bet you've mis-diagnosed the problem.  Btrees don't have a problem
 keeping up with 50m records; you're problem is that after a certain
 point your page cache can't keep up with the pseudo-random i/o
 patterns and you start seeing faults to storage.
 [...]   This has nothing to do the btree algorithm except to the
 extent it affects i/o patterns.


Of course; that's why those different index types aim to use more sequential 
than random writes.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci
 Hmm, you realise Alvaro is working on MinMax indexes in this release?

 They are very efficient with regard to index inserts and specially
 designed for use on large tables.
 
 Prior work by Heikki on Grouped Item Tuples was a way of reducing the
 size of indexes, yet still allowing uniqueness checks. That is
 implemented in SQLServer already and is very useful.

Ah! Didn't know that!

 Your comment about the lack of development in indexes seems counter to
 the literature that I've seen. The main problem is people keep
 patenting things, making it fairly difficult for everyone.

Mmh, maybe I wasn't clear: I meant lack of development (maybe I should have 
said implementation?) in postgresql and in the other sql databases of the 
fast-insertion indexes you can find in literature.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-22 Thread Leonardo Francalanci

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).



My point is that it looks like this is going to take 1-2 years in 
postgresql, so it looks like a huge job... but at the same time I 
understand we can't hope other filesystems will catch up!



I guess this feature will be tunable (off/on)?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as 
the one available in ZFS and btrfs?



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

On 21/12/2011 16.19, Stephen Frost wrote:

* Leonardo Francalanci (m_li...@yahoo.it) wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?


Depends on how much you trust the filesystem. :)



Ehm I hope that was a joke...


I think what I meant was: isn't this going to be useless in a couple of 
years (if, say, btrfs will be available)? Or it actually gives something 
that FS will never be able to give?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?


Yes, it will help you find/address bugs in the filesystem.  These things
are not unheard of...


It sounds to me like a huge job to fix some issues not unheard of...

My point is: if we are trying to fix misbehaving drives/controllers 
(something that is more common than one might think), that's already 
done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux.


I understand not trusting drives/controllers; but not trusting a 
filesystem...



What am I missing? (I'm far from being an expert... I just don't 
understand...)






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-18 Thread Leonardo Francalanci
 New API AnalyzeForeignTable



I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an automatic analyze of foreign tables...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is there any plan to add unsigned integer types?

2011-09-26 Thread Leonardo Francalanci
 

 compression is an interesting topic: the guys over at tokudb are
 making some wild claims...i'm curious if they are real, and what the
 real tradeoffs are.


I don't know how much of the performance they claim comes from
compression and how much from the different indexing technique they
use (see the my post here, where nobody answered...

http://archives.postgresql.org/pgsql-general/2011-09/msg00615.php


)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Your Postgresql 9.2 patch

2011-06-28 Thread Leonardo Francalanci
 Leonardo,
 
 Your patch:
 
 use less space in xl_xact_commit
 
 ...  has been waiting on an updated version from you for 10 days now.  Do
 you  think you're likely to complete it for this CommitFest?


I sent an email on the subject:

http://postgresql.1045698.n5.nabble.com/use-less-space-in-xl-xact-commit-patch-tp4400729p4495125.html


and the thread went on and on but now I'm confused about what to do...

Can someone tell me what we want???



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Your Postgresql 9.2 patch

2011-06-28 Thread Leonardo Francalanci
 I've nearly  finished editing prior to commit, so no worries.


Thank you, let me know if I can help.



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-16 Thread Leonardo Francalanci
 With regards to the naming, I think it would be better if we  kept
 XLOG_XACT_COMMIT record exactly as it is now, and make the  second
 record an entirely new record called XLOG_XACT_COMMIT_FASTPATH.  That
 way we retain backwards compatibility.
 
 If you'd like to rework  like that please, otherwise I can take it from
 here if you'd  like.


I think I did it; while doing it, I think I've found a bug: I didn't update
recoveryStopsHere. Please double check that, as I really don't 
know what I'm doing there...
Should I also change the struct name from xl_xact_commit to
xl_xact_commit_fast_path?

  How can I test it with weird stuff as subtransactions,  shared
  cache invalidation messages...?
 
 make installcheck should  cover those.


Ok, all tests passed.

commitlog_lessbytes_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-16 Thread Leonardo Francalanci
 The important thing is that we  retain backwards compatibility with
 current XLOG_XACT_COMMIT. I'm not worried  what we call the other one.


Ok,  let me see if I got it right:


#define XLOG_XACT_COMMIT0x00

should become:

#define XLOG_XACT_COMMIT_WITH_INFO  0x00

and I'll add a 

#define XLOG_XACT_COMMIT  0x60



Than I'll leave 2 structs, xl_xact_commit_with_info and
xl_xact_commit.




Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci
 We don't need to be in a hurry here. As the reviewer I'm happy to  give
 Leonardo some time, obviously no more than the end of the commit  fest.
 
 If he doesn't respond at all, I'll do it, but I'd like to give him  the
 chance and the experience if possible.


Sorry I couldn't update the patch (in fact, it's more of a total-rewrite than
an update).

How much time do I have?



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci
 On Tue, Jun 14, 2011 at 2:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Well, we certainly have the option to review and commit the patch  any
 time up until feature freeze.  However, I don't want the  CommitFest
 application to be full of entries for patches that are not  actually
 being worked on, because it makes it hard for reviewers to figure  out
 which patches in a state where they can be usefully looked at.   AIUI,
 this one is currently not, because it was reviewed three weeks ago  and
 hasn't been updated.


Yes it's true: I thought I could find the time to work on it, but I didn't.
Let me know the deadline for it, and I'll see if I can make it (or I'll make
it in the next commit fest).


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci
   On Wed, May 25, 2011 at 3:05 PM, Simon Riggs si...@2ndquadrant.com  
wrote:
  Leonardo, can you  submit an updated version of this patch today that
  incorporates Simon's  suggestion?  


Mmmh, maybe it was simpler than I thought; I must be
missing something... patch attached


How can I test it with weird stuff as subtransactions, shared
cache invalidation messages...?


Leonardo


commitlog_lessbytes_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-31 Thread Leonardo Francalanci
 I think
 we need a detailed design document for how  this is all going to work.
 We need to not only handle the master properly but  also handle the
 slave properly.  Consider, for example, the case where  the slave
 begins to replay the transaction, reaches a restartpoint  after
 replaying some of the new pages, and then crashes.  If the  subsequent
 restart from the restartpoint blows away the main relation fork,  we're
 hosed.  I fear we're plunging into implementation details  without
 having a good overall design in mind first.


As I said in my first post, I'm basing the patch on the post:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php


So I assumed the design was ok (except for the stray file around
on a standby case, which has been discussed earlier on this thread).

If there are things to be discussed/analyzed (I guess the restart point
thing is one of those) we can do it... but I thought that the whole
design was somehow in place



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-31 Thread Leonardo Francalanci
 Well, I  sort of assumed the design was OK, too, but the more we talk
 about this  WAL-logging stuff, the less convinced I am that I really
 understand the  problem.  :-(


I see. In fact, I think nobody thought about restart points...

To sum up:

1) everything seems ok when in the wal_level = minimal
case. In this case, we fsync everything and at transaction commit
we remove the init fork; in case of a crash, we don't reply
anything (as nothing has been written to the log), and we
remove the main fork as we do now.

2) in the   wal_level != minimal case things become more
complicated: if the standby reaches a restart point
and then crashes we are in trouble: it would remove the 
main fork at startup, and would reply only a portion of
the table.
I guess the same applies to the master too? I mean:
if we log   HEAP_XLOG_NEWPAGEs, reach a checkpoint,
and then crash, at server restart the main fork would be
deleted, and the pages logged on the log couldn't be 
replayed. But the problem on the master can be removed
using another type of log instead of   HEAP_XLOG_NEWPAGE 
(to be replayed by the standbys only).


Is this analysis correct?


Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-30 Thread Leonardo Francalanci
 Why is it necessary to replay the operation only on the slave?   Can we
 just use XLOG_HEAP_NEWPAGE?


Uh, I don't know why but I thought I shouldn't log a page on the master,
since all the pages are already there and fsync-ed. But if it makes no harm,
I can easily use   XLOG_HEAP_NEWPAGE (of course, only in the 
 wal_level != minimal case).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-27 Thread Leonardo Francalanci
 From: Noah Misch n...@leadboat.com
  - the  patch is missing the send all table pages to the
  standby part; is  there some code I can use as base?
 
 Nothing comes to mind as especially  similar.
 
  I guess I have to generate some special log type  that
  is only played by standby servers.
 
 What you described in  your followup mail seemed reasonable.


So, it's ok to have a log item that is replayed only if 

WalRcvInProgress()

is true?

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

  - on the standby, the commit  part should be played as it
  is on the master (that is, removing the INIT  fork).
  The abort case is different though: it would mean
  doing  nothing on the master, while removing every forks
  but the INIT fork on  the standby.
  Would it be ok to add to xl_xact_abort a new array  of
  RelFileNode(s), where for each one at abort all the forks,
   except the init fork, have to be deleted by the standby
  (while the  master shouldn't do anything with them)?
  I bet there's a cleaner  solution...
 
 Your use less space in xl_xact_commit patch seems to be  going in a good
 direction here.  It would probably also be okay to do a  
ResetUnloggedRelations()
 on the standby at every abort of a transaction that  had started an UNLOGGED 
-
 LOGGED conversion.  That is, just a flag  might be enough.
 
ok, but that would mean that a transaction that aborts a conversion
would try to reset all unlogged relations (traversing all the FS)... 
I don't know if that's acceptable performance-wise.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci
 Da: Simon Riggs si...@2ndquadrant.com
 I can't find a clear  discussion of what you are trying to do, and how,
 just a URL back to a  complex discussion on another topic.


While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert Haas said:

 I have to admit I don't like this approach very much.  I can't see 
adding 4 bytes to every commit record for this feature.


which is a correct remark.

xl_xact_commit can contain some arrays (relation to drops, 
committed sub-trans, shared invalidation msgs). The length of
these arrays is specified using 3 ints in the struct. 

So, to avoid adding more ints to the struct, I've been suggested to
remove all the ints, and use   xl_xact_commit.xinfo to flag which
arrays are, in fact, present. 

So the whole idea is:

- remove nrels, nsubxacts and nmsgs from xl_xact_commit
- use bits in xinfo to signal which arrays are present at the end
of   xl_xact_commit 
- for each present array, add the length of the array (as int) at
the end ofxl_xact_commit 
- add each present array after all the lengths


 I can't see any analysis that shows  whether this would be effective in
 reducing space of WAL records and what the  impacts might be.


The space of WAL records should always be = than without the patch:
in the worst case scenario, all ints are present (as they would without
the patch). In the best case, which I guess is the more common, each
xl_xact_commit will be 3 ints shorter.

I don't think the effect will be perceivable: the whole idea is to allow
more variable-length structures in xl_xact_commit in the future,
without incrementing the space on disk.

 The patch contains very close to zero  comments.


I tried to add some.
 
 

Leonardo

commitlog_lessbytes01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci


commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci
Sorry, email sent without body.

Fixed some English mistakes.

commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-20 Thread Leonardo Francalanci
I'll try to sum up what I understood:

1) the standby keeps the lock, so no problem with
stray files coming from the unlogged-logged log
reply, as the table can't be read during the operation

2) calling ResetUnloggedRelations before 
ProcArrayApplyRecoveryInfo would remove the problem
of the stray files on the standby in case of master crash
before commit/abort

3) promoting the standby shouldn't be an issue,
since ResetUnloggedRelations is already called in
ShutdownRecoveryTransactionEnvironment


Now, to move forward, some questions:

- the patch is missing the send all table pages to the
standby part; is there some code I can use as base?
I guess I have to generate some special log type that
is only played by standby servers.

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-20 Thread Leonardo Francalanci
 - the patch is missing the send all table pages to  the
 standby part; is there some code I can use as base?
 I guess I have to  generate some special log type that
 is only played by standby  servers.


Maybe I could use log_newpage, but instead of
XLOG_HEAP_NEWPAGE I could use something like
XLOG_HEAP_COPYPAGE; and in heap_redo, in the 
XLOG_HEAP_COPYPAGE case, call heap_xlog_newpage
only in case we're in standby...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-19 Thread Leonardo Francalanci
 On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:
By the time the  startup process
   releases the  AccessExclusiveLock acquired by the proposed 
   UNLOGGED - normal  conversion process, that relfilenode
   needs to be either  fully  copied or unlinked all over again. 
   (Alternately, find some  other  way to make sure queries don't
   read the half-copied  file.)  
  
  About this issue: how are AccessExclusiveLocks  released on
  the standby when the master crashes?
 
 I assume those  locks remain.  It wouldn't be safe to release them; a master
 crash is  just one kind of WAL receipt latency.


But somehow when the master restarts the standby gets notified it
has the unlock??? 

 When you promote the standby,  though, 
ShutdownRecoveryTransactionEnvironment()
 releases the locks.


Ok; then the problem in the UNLOGGED - normal  conversion  is that:

1) the master and the standby acquire a lock on the table
2) the master starts sending the pages to the standby
3) the master crashes before committing

up until here no problems, as the standby still has the lock on the
table.

4) when the master restarts, it removes all the fork for rels with INIT forks; 
are those deletes sent to the standby? And, if yes,
would those be replayed by the standby *before* releasing the lock?
If the answer is yes, then I don't think we have a problem... but I think
that at the moment those unlogged-table-forks deletes aren't sent at all.

(When promoting the standby,  we could have
ShutdownRecoveryTransactionEnvironment() remove all the fork for rels
with INIT forks before releasing the locks)



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-18 Thread Leonardo Francalanci
 int  counts[1];  /* variable-length array of counts, xinfo flags  define
 length of array and meaning of counts */



Damn, that's much cleaner than what I did. I don't know why
I stuck with the idea that it had to be:

int
array
int
array
...

instead of: 
int
int
...
array
array
...

which makes much more sense.


 Then, I'd make macros  like this:
 
 #define XactCommitNumberOfDroppedRelFileNodes(xlrec)  \
((xlref-xinfo  XACT_COMMIT_DROPPED_RELFILENODES) ?  xlrec-counts[0] : 0)
 #define XactCommitNumberOfCommittedSubXids(xlrec)  \
((xlref-xinfo  XACT_COMMITED_SUBXDIDS)  ?
 xlrec-counts[(xlrec-xinfo  XACT_COMMIT_DROPPED_RELFILENODES) ?  1 :
 0] : 0)
 ...etc...


ehm I don't know if macros would be enough; that's ok
for the first 2, then I think it would become a mess...
Maybe I'll use a simple function that gets all ints at once.


 ...and a similar set of macros that will  return a pointer to the
 beginning of the corresponding array, if it's  present.  I'd lay out
 the record like this:
 
 - main record
 -  array of counts (might be zero-length)
 - array of dropped relfilnodes (if  any)
 - array of committed subxids (if any)
 - array of sinval messages (if  any)


ok

 Also, it's important not to confuse xact completion with xact  commit,
 as I think some of your naming does.  Completion could perhaps  be
 thought to include abort.


mmh... I don't know if I got it... but I'll give a look, and ask questions...



Thank you very much for the help


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-18 Thread Leonardo Francalanci
this is a second version: now using 

 intcounts[1];  /* variable-length array of counts */

in xl_xact_commit to keep track of number of

different arrays at the end of the struct.

Waiting for feedbacks...




Leonardo 

commitlog_lessbytes00.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-18 Thread Leonardo Francalanci
 By the time the  startup process
 releases the AccessExclusiveLock acquired by the proposed 
 UNLOGGED - normal conversion process, that relfilenode
 needs to be either  fully copied or unlinked all over again. 
 (Alternately, find some other  way to make sure queries don't
 read the half-copied file.)  

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] use less space in xl_xact_commit patch

2011-05-16 Thread Leonardo Francalanci
Hi,


following the conversation at 

http://postgresql.1045698.n5.nabble.com/switch-UNLOGGED-to-LOGGED-tp4290461p4382333.html


I tried to remove some bytes from xl_xact_commit. 

The way I did it needs palloc+memcpy. I guess it could be done
reusing the memory for smgrGetPendingDeletes. But I don't
think it's that important.

I guess there are other ways of doing it; let me know what
you think.



Leonardo


commitlog_firstv.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-10 Thread Leonardo Francalanci
 Yes, that seems like a very appealing approach.   There is plenty of
 bit-space available in xinfo, and we could reserve a bit  each for
 nrels, nsubxacts, and nmsgs, with set meaning that an integer count  of
 that item is present and clear meaning that the count is omitted  from
 the structure (and zero).  This will probably require a bit of  tricky
 code reorganization so I think it should be done separately from  the
 main patch.  

Ok, I'll try and send a patch with this change only.
BTW  xinfo  is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left empty for future
extentions; I don't know if we care about it...


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-10 Thread Leonardo Francalanci
 I don't  think making xinfo shorter will save anything, because
 whatever follows it is  going to be a 4-byte quantity and therefore
 4-byte aligned.


ups, didn't notice it.

I'll splitxinfo into:
 
uint16   xinfo;
uint16   presentFlags;


I guess it helps with the reading? I mean, instead
of having a single uint32?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-08 Thread Leonardo Francalanci
 On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci m_li...@yahoo.it 
wrote:
  Maybe  you should change  xl_act_commit to have a separate list of rels to
   drop the init fork for  (instead of mixing those with the list of files  
to
  drop as a  whole).
 
  I tried to follow your  suggestion, thank you very much.
 
 I have to admit I don't like this  approach very much.  I can't see
 adding 4 bytes to every commit record  for this feature.


I understand.

What if, in xl_xact_commit, instead of 

RelFileNode xnodes

I use another struct for xnodes, something like:

{
 RelFileNode xnode;
 boolonlyInitFork;
}


That would increase the commit record size only when there are
RelFileNode(s) to drop at commit. So, instead of 4 bytes in
every commit, there are wasted bytes when the commit record
contains deleted permanent relations (that should happen much
less). I'm open to suggestions here...


  3) Should we have a cascade option? I don't know  if I have to handle
  inherited tables and other dependent  objects
 
 Look at the way ALTER TABLE [ONLY] works for other action types,  and copy it.


Ok


Thank you very much



Leonardo 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-26 Thread Leonardo Francalanci
  If that 1%  is random (not time/transaction related), usually you'd rather 
have an empty  table.
 
 Why do you think it would be random?

Heap blocks would be zeroed if they were found to be damaged, following a 
crash.

If you erase full blocks, you have no idea what data you erased; it could be
something changed 1 hour ago, 1 month ago, 1 year ago. This is very different 
from,
say, synchronous_commit=off: in that case, the most recent transactions may be
lost if the database should crash. In your case, some (who knows which???) 
data
is lost. So, to me that sounds like random loss. I don't think that that is 
different
from a corrupted table. You're not deleting rows recently changed; you're 
deleting
everything that is physically close to it.

  In other  words: is a table that is not consistant with anything else in 
  the 
db  useful?
 
 That's too big a leap. Why would it suddenly be inconsistent with  the
 rest of the database?


If you delete some data, and you have no idea what data you lost, I don't think 
you have a
consistent db. Unless, of course, your table has no relation with any other 
table in the db.

Of course, all these thoughts are based on the assumption that I know what 
happens when a
block is erased; but my knowledge of postgresql internals is not so good, so I 
might be
*very* wrong

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci
The only data we can't rebuild it's the heap. So what about an option for 
UNlogged indexes on a LOGged table? It would always preserve data, and it would 
'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think 
it would give a boost in performance for all those cases where the IO 
(especially random IO) is caused by the indexes, and it doesn't look too 
complicated (but maybe I'm missing something).

I proposed the unlogged to logged patch (BTW has anyone given a look at it?) 
because we partition data based on a timestamp, and we can risk loosing the 
last N minutes of data, but after N minutes we want to know data will always be 
there, so we would like to set a partition table to 'logged'. 

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci
 The amount of data loss on a big
 table will be 1% of the data loss
caused by truncating the whole table.

If that 1% is random (not time/transaction related), usually you'd rather have 
an empty table. In other words: is a table that is not consistant with anything 
else in the db useful?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-22 Thread Leonardo Francalanci
 Maybe you should change  xl_act_commit to have a separate list of rels to
 drop the init fork for  (instead of mixing those with the list of files to
 drop as a  whole).


I tried to follow your suggestion, thank you very much.

Here's a first attempt at the patch.  

I tested it with:


create table forei (v integer primary key);
insert into forei select * from generate_series(1,1);
create unlogged table pun (c integer primary key, constraint 
con foreign key (c) references forei(v));
insert into pun select * from generate_series(1,1);
alter table pun  set logged;


then shutdown the master with immediate:

bin/pg_ctl -D data -m immediate stop
bin/pg_ctl -D data start


and pun still has data:

select * from pun where c=100;


Question/comments:

1) it's a very-first-stage patch; I would need to know if something is
*very* wrong before cleaning it.
2) there are some things I implemented using a logic like let's see how it
worked 10 lines above, and I'll do the same. For example, the 2PC stuff
is totally copied from the other places, I have no idea if the code makes
sense at all (how can I test it?)
3) Should we have a cascade option? I don't know if I have to handle
inherited tables and other dependent objects
4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?



Leonardo

unl2log_20110422.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What Index Access Method Functions are really needed?

2011-04-22 Thread Leonardo Francalanci
  another question regarding indexes. Sadly I can't find enough  info in
  the documentation. Which of the functions are needed in order  for a
  index to work?
 
 All of them.


Maybe I completely misunderstood the question, but some functions
are optionals, such as amgetbitmap, right?

http://www.postgresql.org/docs/9.0/static/index-functions.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-18 Thread Leonardo Francalanci
I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).

To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag bool onlyInitFork so that the delete would remove only the INIT
fork at commit.

Everything works (note the quotes...) except in the case of not-clean
shutdown (-m immediate to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a pg_ctl -m immediate stop) the table doesn't have any forks at all.

See xact_redo_commit:

/* Make sure files supposed to be dropped are dropped */
for (i = 0; i  xlrec-nrels; i++)
{
   [...]
   for (fork = 0; fork = MAX_FORKNUM; fork++)
 {
  if (smgrexists(srel, fork))
  {
 XLogDropRelation(xlrec-xnodes[i], fork);
 smgrdounlink(srel, fork, true);
  }
 }
 smgrclose(srel);
}
[...]


Should I change xl_xact_commit in order to have, instead of:


/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1];  /* VARIABLE LENGTH ARRAY */



an array of structures such as:

{
RelFileNode   relfilenode;
bool   onlyInitFork;
}

???

Otherwise I don't know how to tell the redo commit code to delete only
the INIT fork, instead of all the relation forks...
(maybe I'm doing all wrong: I'm open to any kind of suggestion here...)


Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-16 Thread Leonardo Francalanci
  If the  master crashes while a transaction that used CREATE TABLE is 
unfinished,
   both the master and the standby will indefinitely retain identical, stray  
(not
  referenced by pg_class) files.  The catalogs do reference the  relfilenode 
of
  each unlogged relation; currently, that relfilenode never  exists on a 
standby
  while that standby is accepting connections.  By the  time the startup 
process
  releases the AccessExclusiveLock acquired by  the proposed UNLOGGED - 
normal
  conversion process, that relfilenode  needs to be either fully copied or 
unlinked
  all over again.   (Alternately, find some other way to make sure queries 
don't
  read the  half-copied file.)  In effect, the problem is that the 
  relfilenode 
is
   *not* stray, so its final state does need to be well-defined.
 
 Oh,  right.
 
 Maybe we should just put in a rule that a server in Hot Standby  mode
 won't ever try to read from an unlogged table (right now we count  on
 the fact that there will be nothing to read).  If we crash  before
 copying the whole file, it won't matter, because the catalogs  won't
 have been updated, so we'll refuse to look at it anyway.  And we  have
 to reinitialize on entering normal running anyway, so we can clean  it
 up then.


Ok then... I'll try to code the easy version first (the wal_level=minimal 
case)
and then we'll see

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-11 Thread Leonardo Francalanci
  But re-reading  it, I don't understand: what's the difference in creating
  a new  regular table and crashing before emitting the abort record,
  and  converting an unlogged table to logged and crashing before
  emitting the  abort record? How do the standby servers handle a
  CREATE TABLE  followed by a ROLLBACK if the master crashes
  before writing the abort  record? I thought that too would leave a
  stray file around on a  standby.
 
 I've been thinking about the same thing.  And AFAICS, your  analysis is
 correct, though there may be some angle to it I'm not  seeing.


Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-09 Thread Leonardo Francalanci
 I'm pretty sure we wouldn't accept a patch for a  feature that would
 only work with wal_level=minimal, but it might be a useful  starting
 point for someone else to keep hacking on.


I understand.

Reading your post at 
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:

what happens if we *crash* without writing an abort record?  It
seems  like that could leave a stray file around on a standby, 
because the  current code only cleans things up on the standby
at the start of  recovery


But re-reading it, I don't understand: what's the difference in creating
a new regular table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a 
CREATE TABLE followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would leave a
stray file around on a standby.



Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] switch UNLOGGED to LOGGED

2011-04-08 Thread Leonardo Francalanci
Hi,

I read the discussion at 


http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php 


From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated. 

Suppose I try to write a patch that allows 

ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)

only in the wal_level == minimal case: would it be accepted as a
first step? Or rejected because it doesn't allow it in the other
cases?

From what I got in the discussion, handling the other wal_level cases
can be very complicated (example: the issues in case we *crash*
without writing an abort record).




Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-21 Thread Leonardo Francalanci
 Here's my single biggest tip for newcomers to the Postgres  source:
 if you don't use ctags, glimpse, or some other tool that can  quickly
 show you all references to a given identifier, go out and get  one.
 It's one of the easiest ways to learn about  things.


I guess Eclipse is the best tool for the newcomer as it gives
you all the references very easily and shows the functions/structs
declaration when you pass with the mouse over one...

http://wiki.postgresql.org/wiki/Working_with_Eclipse




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 On the other hand, if I use a similar test case to my original  one
 (i.e. the tables are much wider) then the query planning takes
 1.42  seconds in 9.1 with this patch instead of about 4.7 seconds
 as we observed it  using PostgreSQL 9.0.0. The beginning of the gprof
 output now looks like  this:

Hi,

I'm really interested in this patch.

I tried a simple test case:

create table t (a integer, b text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..9000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a ' || i*10 || ' 
and a = ' || (i+1)*10 || ' ) ) INHERITS (t)';
   EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
END LOOP;
END$$;

explain select * from t where a  1060 and a  1090;

but I don't get any gain from the patch... explain time is still around 250 ms.

Tried with 9000 partitions, time is still 2 secs.


Maybe I've missed completely the patch purpose?


(I tried the test case at

http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at

and that, in fact, gets a boost with this patch).



Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 but I don't get any gain from the patch... explain time is still  around 250 
ms.
 Tried with 9000 partitions, time is still 2  secs.


Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the patch)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 This is going to be dominated by constraint exclusion  checking.  There's
 basically no fix for that except a more explicit  representation of the
 partitioning rules.  

Damn, I knew that was going to be more complicated :)

So in which case does this patch help? I guess in a multi-index
scenario? childtables.sql is kind of hard to read (I think a FOR loop
would have been more auto-explaining?).




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 Hmm, maybe I spoke too soon.  With 9000 child tables I get  a profile
 like this:


Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.

BTW can someone explain to me which are the cases where the
patch actually helps?


 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
 per child table, only one of which was relevant to the query.   In your
 test case there are no irrelevant indexes, which is why the  runtime
 didn't change.



Mmh... I must be doing something wrong. It looks to me it's not just
the irrelevant indexes: it's the order by that counts. If I remove that
times are the same with and without the patch:

using the test case:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'

this one runs in the same time with the patch; but adding:


order by timestamp2

made the non-patched version run 3 times slower.

My test case:

create table t (a integer, b integer, c integer, d integer, e text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..2000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a ' || i*10 || ' 
and a = ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index taidx' || i || ' ON t' || i || '  (a)';
EXECUTE 'create index tbidx' || i || ' ON t' || i || '  (b)';
EXECUTE 'create index tcidx' || i || ' ON t' || i || '  (c)';
EXECUTE 'create index tdidx' || i || ' ON t' || i || '  (d)';
END LOOP;
END$$;


explain select * from t where a  1060 and a  109000

this runs in 1.5 secs with and without the patch. But if I add

 order by b 

the non-patched version runs in 10 seconds.

Am I getting it wrong?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Custom aggragation function that creates an array

2010-10-23 Thread Leonardo Francalanci
 Before you start coding, have you looked over what's  in
 contrib/intarray ?


There's nothing that fulfills my needs there, but I guess it
would be the perfect place to watch for code examples!

Thank you

(I think I made my custom aggregation function work,
but I'll look into intarray code to see if I made mistakes...)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Custom aggragation function that creates an array

2010-10-22 Thread Leonardo Francalanci
Hi,

I want to write a custom agg function that, given an int4 index,
increments the element at index of an array and, at the end, returns
the array.  The array will always be int4[].

I need it in C, since plpgsql is way slower (and I need to use it in
5M+ rows). I did it, but I also need to create the array on demand,
not as initcond (the array size will be a parameter).

I suppose that, since I'll be using construct_array (if it makes
sense), I'll need a new MemoryContext?
I've looked at array_agg_transfn, but I think that's a lot more
complicated because it'll repalloc, so it has to keep
the memorycontext...

Since I'm not familiar with these things, would something like this
work?

if (PG_ARGISNULL(0))
{ // first time, alloc array
  arr_context = AllocSetContextCreate(rcontext,
   
 myCustomArrayResult,
 ALLOCSET_DEFAULT_MINSIZE,
 ALLOCSET_DEFAULT_INITSIZE,
 ALLOCSET_DEFAULT_MAXSIZE);
  oldcontext = MemoryContextSwitchTo(arr_context);
  transarray = construct_array([...]);
  MemoryContextSwitchTo(oldcontext );
}
else
{
 transarray = PG_GETARG_ARRAYTYPE_P(0);
}



This is how the C function is written right now:

ArrayType  *transarray;
int32   index = PG_GETARG_INT32(1);
int32   *array;

if (AggCheckCallContext(fcinfo, NULL))
{
   // code above would go in place of this line:
 transarray = PG_GETARG_ARRAYTYPE_P(0);
}
else
 transarray = PG_GETARG_ARRAYTYPE_P_COPY(0);

array = (int32*)ARR_DATA_PTR(transarray);
array[index]++;

PG_RETURN_ARRAYTYPE_P(transarray);

(checks on boundaries etc omitted)


The problem is that it expects the array to be already available, so I had to 
do:


CREATE AGGREGATE cust_agg_func (int4)
(
sfunc = cust_agg_funct_add,
stype = int4[],
initcond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
);


I would like to have:

CREATE AGGREGATE cust_agg_func (int4 index, int4 
array_size_to_be_created)
(
sfunc = cust_agg_funct_with_array_creation_add,
stype = int4[]
);




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci
Hi,


I was wondering what is the advantage of having hint bits for OLAP
-style workloads, that is when the number of transactions is not
that high.

If I got it right, in 10 pg_clog pages we can store the status for more
than 32 transactions. That's a lot, in a very small space
(80KB?).

So I was wondering what's the gain we get from hint bits in cases
where pg_clog is small (that is, will be cached by postgresql/the
OS).

does somebody have some numbers on the effect of hint bits on
first/second reads?

I mean:

create mytable as .

select * from mytable - this one will update hint bits
select * from mytable - this one will use them

to test it I guess we should change the code to have:

a version where they are never updated (that is, always look at 
pg_clog)
(so that you don't have to write them the first time, and you'll get the
true reading time + pg_clog reading time)


a version that always set them to COMMITTED
(so that you don't have to write them the first time, and you'll get the
true reading time for the second read that would use them,
regardless of any pg/OS cache)


I'm asking because I don't like having all those writes on the
first scan... and I would like to know what the real benefit is for the 
reads that come after the first one in case there are few
transactions per second (for example, less than 1 transaction per
second) 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci
 Reduction of contention for pg_clog  access, for one thing.  If you read
 the archives, you'll find that  pg_clog access contention has been shown
 to be one cause of context swap  storms.  Having to go to clog for
 every single tuple access would make  that orders of magnitude worse.


Ok; is it the 

Wierd context-switching issue on Xeon 

thread? Or that has nothing to do with it? I tried  context swap  storms
pg_clog but I didn't get anything...

 More generally, we're not going to give  up hint bits even if there are
 identifiable workloads where they don't buy  much --- because there are
 many others where they  do.


Sure: I wasn't suggesting to give up them, just to make their usage
user selectable (and on by default)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci
 I wonder if we could improve  this with some sort of process-local
 cache - not to get rid of hint bits,  just to reduce pg_clog
 contention.  We might easily end up testing the  same XID many times
 during the same table scan.

 
I guess that's my scenario... not that many transactions, so even
copying the whole pg_clog table in the per-process memory would
be doable...




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-08 Thread Leonardo Francalanci
 Applied with some  significant editorialization.  The biggest problem I
 found was that the  code for expression indexes didn't really work, and
 would leak memory like  there's no tomorrow even when it did work.


Sorry I couldn't write the way it was supposed to... I'll look at the
differences to see what I did wrong... (so maybe next time I'll do
better!)


Thank you

Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-04 Thread Leonardo Francalanci
 It sounds like the costing model might need a bit more work before  we commit 
this.


I tried again the simple sql tests I posted a while ago, and I still get the 
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.

I really don't get how a laptop hard drive can be faster at reading data using 
random
seeks (required by the original cluster method) than seq scan + sort for the 5M 
rows
test case. 
Same thing for the cluster vs bloat test: the seq scan + sort is faster on my 
machine.

I've just noticed that Josh used shared_buffers = 16MB for the cluster vs 
bloat test:
I'm using a much higher shared_buffers (I think something like 200MB), since if
you're working with tables this big I thought it could be a more appropriate 
value.
Maybe that's the thing that makes the difference???

Can someone else test the patch?

And: I don't have that deep knowledge of how postgresql deletes rows; but I 
thought
that something like:

DELETE FROM mybloat WHERE RANDOM()  0.9;

would only delete data, not indexes; so the patch should perform even better in 
this
case (as it does, in fact, on my test machine), as:

- the original cluster method would read the whole index, and fetch only the 
still alive
rows
- the new method would read the table using a seq scan, and sort in memory the 
few
rows still alive

But, as I said, maybe I'm getting this part wrong...





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-10-01 Thread Leonardo Francalanci
 I ran a few more performance tests on this patch. Here's what  I got
 for the tests Leonardo posted originally:
* 2M  rows:  22 seconds for seq. scan, 24 seconds for index scan
* 5M  rows:  139 seconds for seq. scan, 97 seconds for index scan
*  10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)
 
 I tried a few more tests of creating a table with  either 10M or 50M
 rows, then deleting 90% of the rows and running a cluster.  The patch
 didn't fare so well here:

  * 10M rows: 84 seconds for seq.  scan, 44 seconds for index scan

[...]
 So I think there are  definitely cases where this patch helps, but it
 looks like a seq. scan is  being chosen in some cases where it doesn't
 help.
 
 Test machine:  MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.


Again: what would the planner choose in that case for a:

create table p as select * from mybloat order by myid

???

I guess that if the planner makes a wrong choice in this case (that is,
seq scan + sort instead of index scan) there's no way for cluster to
behave in a different way. If, on the contrary, the create table... uses
the right plan, and cluster doesn't, we have a problem in the patch.  
Am I right?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Leonardo Francalanci
  10% is nothing.  I was expecting this  patch would give an order of
  magnitude of improvement or somethine like  that in the worst cases of
  the current code (highly unsorted  input)
 
 Yes. It should be x10 faster than ordinary method in the worst  cases.


Here's my post with a (very simple) performance test:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php


The test I used wasn't a worst case scenario, since it is based on
random data, not wrong-ordered data. Obviously, the real difference
can be seen on large tables (5M+ rows), and/or slow disks.


Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-09-29 Thread Leonardo Francalanci
  Here's my post with a (very simple) performance test:
  http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php
 I  think the 10M rows test is more in line with what we want (83s vs.  646).


Can someone else test the patch to see if what I found is still valid?
I don't think it makes much sense if I'm the only one that says
this is faster :)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: I: [HACKERS] About Our CLUSTER implementation is pessimal patch

2010-07-21 Thread Leonardo Francalanci
 I think writetup_rawheap() and readtup_rawheap() are a little  complex,
 but should work as long as there are no padding between t_len and  t_self
 in HeapTupleData struct.
 
 - It might be cleaner if you write the  total item length
   and tuple data separately.
 - (char *) tuple +  sizeof(tuplen) might be more robust
   than  tuple-t_self.


- I used your functions 
- changed the docs for CLUSTER (I don't know if they make sense/are enough)
- added a minor comment


2 questions:
 
1) about the copypaste from FormIndexDatum comment: how can I improve it?
The idea is that we could have a faster call, but it would mean copying and
pasting a lot of code from FormIndexDatum.

2) what other areas can I comment more?


  

sorted_cluster-20100721.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers