Simon Riggs wrote
On 5 November 2013 14:28, Leonardo Francalanci lt;
m_lists@
gt; wrote:
Either my sql is not correct (likely), or my understanding of the minmax
index is
not correct (even more likely), or the minmax index is not usable in a
random inputs
scenario.
Please show
Simon Riggs wrote
So in the use case you describe, the min max index would require a
scan of only 25% of the table, not the 80% described earlier for
random inserts. In my experience, people wish to keep data for much
longer periods and so the percentage of scan required would drop lower
than
Jeremy Harris wrote
Surely there's good correlation between IMSI IMEI, so have a separate
table to translate one to (a group of) the others, and
halve the indexes on your main table?
Yes; unfortunately not always both are available; but it's something
we are thinking about (it requires logic
Simon Riggs wrote
From our discussions here, IMHO there is a strong case for avoiding
btrees completely for larger historical data tables. That isn't
something I had even considered as desirable before this conversation
but ISTM now that taking that approach will be more fruitful than
Simon Riggs wrote
Can you CLUSTER
against a minmax index?
Not in this release, at least in my understanding. It's not yet
possible to do an ordered fetch, so the cluster scan probably won't
work.
As per the patch I helped writing, CLUSTER should use the
sequential heap scan+sort when it
Andres Freund-3 wrote
On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire lt;
klaussfreire@
gt; wrote:
Such a thing would help COPY, so maybe it's worth a look
I have little doubt that a deferred insertion buffer of some kind
could help
Simon Riggs wrote
Everybody on this thread is advised to look closely at Min Max indexes
before starting any further work.
MinMax will give us access to many new kinds of plan, plus they are
about as close to perfectly efficient, by which I mean almost zero
overhead, with regard to inserts
Simon Riggs wrote
Minmax indexes seem to surprise many people, so broad generalisations
aren't likely to be useful.
I think the best thing to do is to publish some SQL requests that
demonstrate in detail what you are trying to achieve and test them
against minmax indexes. That way we can
Simon Riggs wrote
On 5 November 2013 09:57, Leonardo Francalanci lt;
m_lists@
gt; wrote:
While I do believe in testing (since In theory there is no difference
between theory and practice. In practice there is), I would like to know
the properties of the minmax index before trying it.
What
Claudio Freire wrote
Min-max indexes always require a sequential scan of the min-max index
itself when querying.
I'm worried about the number of heap pages that will be scanned.
My understanding is that given the random input, the index will
not be selective enough, and will end up requiring a
Claudio Freire wrote
real data isn't truly random
Well, let's try normal_rand???
create table t1 as select trunc(normal_rand(100, 50, 3)) as n,
generate_series(1, 100) as i;
with cte as
(select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100),
Claudio Freire wrote
Well, of course, they're not magic pixie dust.
Of course they aren't. I think they can make a difference in a sequential
input scenario. But I'm not the one who said that they are fit to
solve the problems me and other people are talking about in this thread.
Claudio
Jeff Janes wrote
Some experiments I did a few years ago showed that applying sorts to the
data to be inserted could be helpful even when the sort batch size was as
small as one tuple per 5 pages of existing index. Maybe even less.
Cool!!! Do you have any idea/hint on how I could try and
Jeff Janes wrote
True, but that is also true of indexes created in bulk. It all has to
reach disk eventually--
[...]
If the checkpoint interval is as long as the partitioning period, then
hopefully the active index buffers get re-dirtied while protected in
shared_buffers, and only get
Gavin Flower-2 wrote
How about being able to mark indexes:
'MEMORY ONLY' to make them not go to disk
and
'PERSISTENT | TRANSIENT' to mark if they should be recreated on
machine bootup?
I would love that. But:
1) I'd like to make some tests with a memory drive, and confirm that in
Hmm, you realise Alvaro is working on MinMax indexes in this release?
They are very efficient with regard to index inserts and specially
designed for use on large tables.
Prior work by Heikki on Grouped Item Tuples was a way of reducing the
size of indexes, yet still allowing uniqueness
Presumably the data you are inserting isn't actually random. Please
describe the use case you are considering in more detail and some view
on how frequent that is, with some examples. Once we understand the
use case and agree it is important, we might solve problems.
Collecting calls data
What is the reason for needing such fast access to individual groups
of records? Sure sounds like the NSA or similar ;-)
Users need to search all calls originated from/to a user or from/to a specific
mobile phone to answer/analyze customers' probl... ok, I give up: I work for
the NSA ;)
In
LSM-trees seem patent free
I'm no expert, and I gave it just a look some time ago: it looked to me very
complicated to get right... and as far as I remember you don't get that much
gain, unless you go multi-level which would complicate things further
Please somebody advise patent status of
Jeff Janes wrote
The index insertions should be fast until the size of the active part of
the indexes being inserted into exceeds shared_buffers by some amount
(what
that amount is would depend on how much dirty data the kernel is willing
to
allow in the page cache before it starts suffering
Jeff Janes wrote
Are partitions read-only once time has moved on, or can stragglers show up
that need to be inserted into older partitions?
You could periodically merge older partitions into larger tables, index
those aggregated tables, then transactionally disinherit the old
partitions
Point being: hardware is marching along pretty fast (after 20+ years
of stagnation) and it's dangerous (IMO) to make big software
investments based on the situation on the ground *today*.
Yes, that's a good point.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To
Jeff Janes wrote
You could periodically merge older partitions into larger tables, index
those aggregated tables, then transactionally disinherit the old
partitions
and inherit the new aggregated one. This would keep the value of K down,
at the expense of re-writing data multiple times (but
Hi,
I don't see much interest in insert-efficient indexes. These are the ones I've
found:
- LSM-tree (used by Cassandra and SQLite4?)
- Y-Tree
(http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf
)
- Fractal indexes
Before getting too excited about some new academic index type, it's worth
noting the sad state in which hash indexes have languished for years.
Nobody's bothered to add WAL support, let alone do any other real work
on them. The non-btree index types that have been getting love are the
ones
Another point to add: I don't really see btree as a barrier to
performance for most of the problems I face. The real barriers to
database performance are storage, contention, and query planning.
Ehm that's true for regular OLTP stuff, which I understand is what most (95%?)
of people
They should, in theory, be faster than btrees -- O(1) not O(log N) page
fetches per lookup. In practice they don't seem to be faster, and
nobody's bothered to find out exactly why. Again, this isn't a terribly
encouraging precedent for implementing some other index type that's
supposed to
I bet you've mis-diagnosed the problem. Btrees don't have a problem
keeping up with 50m records; you're problem is that after a certain
point your page cache can't keep up with the pseudo-random i/o
patterns and you start seeing faults to storage.
[...] This has nothing to do the btree
Hmm, you realise Alvaro is working on MinMax indexes in this release?
They are very efficient with regard to index inserts and specially
designed for use on large tables.
Prior work by Heikki on Grouped Item Tuples was a way of reducing the
size of indexes, yet still allowing uniqueness
Agreed.
I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).
My point is that it looks like this is going to take 1-2 years in
postgresql, so it looks
I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as
the one available in ZFS and btrfs?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
On 21/12/2011 16.19, Stephen Frost wrote:
* Leonardo Francalanci (m_li...@yahoo.it) wrote:
I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?
Depends on how much you trust
I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?
Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...
It sounds to me
New API AnalyzeForeignTable
I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an automatic analyze of foreign tables...
--
Sent via pgsql-hackers mailing
compression is an interesting topic: the guys over at tokudb are
making some wild claims...i'm curious if they are real, and what the
real tradeoffs are.
I don't know how much of the performance they claim comes from
compression and how much from the different indexing technique they
use
Leonardo,
Your patch:
use less space in xl_xact_commit
... has been waiting on an updated version from you for 10 days now. Do
you think you're likely to complete it for this CommitFest?
I sent an email on the subject:
I've nearly finished editing prior to commit, so no worries.
Thank you, let me know if I can help.
Leonardo
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
With regards to the naming, I think it would be better if we kept
XLOG_XACT_COMMIT record exactly as it is now, and make the second
record an entirely new record called XLOG_XACT_COMMIT_FASTPATH. That
way we retain backwards compatibility.
If you'd like to rework like that please,
The important thing is that we retain backwards compatibility with
current XLOG_XACT_COMMIT. I'm not worried what we call the other one.
Ok, let me see if I got it right:
#define XLOG_XACT_COMMIT0x00
should become:
#define XLOG_XACT_COMMIT_WITH_INFO 0x00
and I'll add a
We don't need to be in a hurry here. As the reviewer I'm happy to give
Leonardo some time, obviously no more than the end of the commit fest.
If he doesn't respond at all, I'll do it, but I'd like to give him the
chance and the experience if possible.
Sorry I couldn't update the patch
On Tue, Jun 14, 2011 at 2:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
Well, we certainly have the option to review and commit the patch any
time up until feature freeze. However, I don't want the CommitFest
application to be full of entries for patches that are not actually
being
On Wed, May 25, 2011 at 3:05 PM, Simon Riggs si...@2ndquadrant.com
wrote:
Leonardo, can you submit an updated version of this patch today that
incorporates Simon's suggestion?
Mmmh, maybe it was simpler than I thought; I must be
missing something... patch attached
How can I test
I think
we need a detailed design document for how this is all going to work.
We need to not only handle the master properly but also handle the
slave properly. Consider, for example, the case where the slave
begins to replay the transaction, reaches a restartpoint after
replaying some
Well, I sort of assumed the design was OK, too, but the more we talk
about this WAL-logging stuff, the less convinced I am that I really
understand the problem. :-(
I see. In fact, I think nobody thought about restart points...
To sum up:
1) everything seems ok when in the wal_level =
Why is it necessary to replay the operation only on the slave? Can we
just use XLOG_HEAP_NEWPAGE?
Uh, I don't know why but I thought I shouldn't log a page on the master,
since all the pages are already there and fsync-ed. But if it makes no harm,
I can easily use XLOG_HEAP_NEWPAGE (of
From: Noah Misch n...@leadboat.com
- the patch is missing the send all table pages to the
standby part; is there some code I can use as base?
Nothing comes to mind as especially similar.
I guess I have to generate some special log type that
is only played by standby servers.
Da: Simon Riggs si...@2ndquadrant.com
I can't find a clear discussion of what you are trying to do, and how,
just a URL back to a complex discussion on another topic.
While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to
commitlog_lessbytes02.patch
Description: Binary data
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sorry, email sent without body.
Fixed some English mistakes.
commitlog_lessbytes02.patch
Description: Binary data
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'll try to sum up what I understood:
1) the standby keeps the lock, so no problem with
stray files coming from the unlogged-logged log
reply, as the table can't be read during the operation
2) calling ResetUnloggedRelations before
ProcArrayApplyRecoveryInfo would remove the problem
of the
- the patch is missing the send all table pages to the
standby part; is there some code I can use as base?
I guess I have to generate some special log type that
is only played by standby servers.
Maybe I could use log_newpage, but instead of
XLOG_HEAP_NEWPAGE I could use something like
On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:
By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED - normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again
int counts[1]; /* variable-length array of counts, xinfo flags define
length of array and meaning of counts */
Damn, that's much cleaner than what I did. I don't know why
I stuck with the idea that it had to be:
int
array
int
array
...
instead of:
int
int
...
array
array
...
which
this is a second version: now using
intcounts[1]; /* variable-length array of counts */
in xl_xact_commit to keep track of number of
different arrays at the end of the struct.
Waiting for feedbacks...
Leonardo
commitlog_lessbytes00.patch
Description: Binary data
--
By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED - normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again.
(Alternately, find some other way to make sure queries don't
read the half-copied
Hi,
following the conversation at
http://postgresql.1045698.n5.nabble.com/switch-UNLOGGED-to-LOGGED-tp4290461p4382333.html
I tried to remove some bytes from xl_xact_commit.
The way I did it needs palloc+memcpy. I guess it could be done
reusing the memory for smgrGetPendingDeletes. But I
Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure
I don't think making xinfo shorter will save anything, because
whatever follows it is going to be a 4-byte quantity and therefore
4-byte aligned.
ups, didn't notice it.
I'll splitxinfo into:
uint16 xinfo;
uint16 presentFlags;
I guess it helps with the reading? I mean, instead
On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci m_li...@yahoo.it
wrote:
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files
to
drop as a whole).
I tried to follow your suggestion
If that 1% is random (not time/transaction related), usually you'd rather
have an empty table.
Why do you think it would be random?
Heap blocks would be zeroed if they were found to be damaged, following a
crash.
If you erase full blocks, you have no idea what data you erased; it could
The only data we can't rebuild it's the heap. So what about an option for
UNlogged indexes on a LOGged table? It would always preserve data, and it would
'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think
it would give a boost in performance for all those cases
The amount of data loss on a big
table will be 1% of the data loss
caused by truncating the whole table.
If that 1% is random (not time/transaction related), usually you'd rather have
an empty table. In other words: is a table that is not consistant with anything
else in the db useful?
--
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).
I tried to follow your suggestion, thank you very much.
Here's a first attempt at the patch.
I tested it with:
create
another question regarding indexes. Sadly I can't find enough info in
the documentation. Which of the functions are needed in order for a
index to work?
All of them.
Maybe I completely misunderstood the question, but some functions
are optionals, such as amgetbitmap, right?
I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).
To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag bool onlyInitFork so that the delete would remove only the INIT
fork at commit.
Everything works (note the
If the master crashes while a transaction that used CREATE TABLE is
unfinished,
both the master and the standby will indefinitely retain identical, stray
(not
referenced by pg_class) files. The catalogs do reference the relfilenode
of
each unlogged relation; currently, that
But re-reading it, I don't understand: what's the difference in creating
a new regular table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
CREATE TABLE
I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.
I understand.
Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought
Hi,
I read the discussion at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.
Suppose I try to write a patch that allows
ALTER TABLE tablename SET
Here's my single biggest tip for newcomers to the Postgres source:
if you don't use ctags, glimpse, or some other tool that can quickly
show you all references to a given identifier, go out and get one.
It's one of the easiest ways to learn about things.
I guess Eclipse is the best tool
On the other hand, if I use a similar test case to my original one
(i.e. the tables are much wider) then the query planning takes
1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
output now looks like
but I don't get any gain from the patch... explain time is still around 250
ms.
Tried with 9000 partitions, time is still 2 secs.
Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the
This is going to be dominated by constraint exclusion checking. There's
basically no fix for that except a more explicit representation of the
partitioning rules.
Damn, I knew that was going to be more complicated :)
So in which case does this patch help? I guess in a multi-index
Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
like this:
Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.
BTW can someone explain to me
Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
per child table, only one of which was relevant to the query. In your
test case there are no irrelevant indexes, which is why the runtime
didn't change.
Mmh... I must be doing something wrong. It looks to me it's not
Before you start coding, have you looked over what's in
contrib/intarray ?
There's nothing that fulfills my needs there, but I guess it
would be the perfect place to watch for code examples!
Thank you
(I think I made my custom aggregation function work,
but I'll look into intarray code to
Hi,
I want to write a custom agg function that, given an int4 index,
increments the element at index of an array and, at the end, returns
the array. The array will always be int4[].
I need it in C, since plpgsql is way slower (and I need to use it in
5M+ rows). I did it, but I also need to
Hi,
I was wondering what is the advantage of having hint bits for OLAP
-style workloads, that is when the number of transactions is not
that high.
If I got it right, in 10 pg_clog pages we can store the status for more
than 32 transactions. That's a lot, in a very small space
(80KB?).
So I
Reduction of contention for pg_clog access, for one thing. If you read
the archives, you'll find that pg_clog access contention has been shown
to be one cause of context swap storms. Having to go to clog for
every single tuple access would make that orders of magnitude worse.
Ok; is it
I wonder if we could improve this with some sort of process-local
cache - not to get rid of hint bits, just to reduce pg_clog
contention. We might easily end up testing the same XID many times
during the same table scan.
I guess that's my scenario... not that many transactions, so even
Applied with some significant editorialization. The biggest problem I
found was that the code for expression indexes didn't really work, and
would leak memory like there's no tomorrow even when it did work.
Sorry I couldn't write the way it was supposed to... I'll look at the
differences
It sounds like the costing model might need a bit more work before we commit
this.
I tried again the simple sql tests I posted a while ago, and I still get the
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.
I really don't get how a laptop hard
I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:
* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
* 10M rows: 256 seconds seq. scan, 611
10% is nothing. I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)
Yes. It should be x10 faster than ordinary method in the worst cases.
Here's my post with a (very simple)
Here's my post with a (very simple) performance test:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php
I think the 10M rows test is more in line with what we want (83s vs. 646).
Can someone else test the patch to see if what I found is still valid?
I don't think it makes
I think writetup_rawheap() and readtup_rawheap() are a little complex,
but should work as long as there are no padding between t_len and t_self
in HeapTupleData struct.
- It might be cleaner if you write the total item length
and tuple data separately.
- (char *) tuple +
86 matches
Mail list logo