Re: [HACKERS] Improve lseek scalability v3
2011/9/19 Matthew Wilcox : > On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote: >> * Benjamin LaHaise (b...@kvack.org) wrote: >> > For such tables, can't Postgres track the size of the file internally? I'm >> > assuming it's keeping file descriptors open on the tables it manages, in >> > which case when it writes to a file to extend it, the internally stored >> > size >> > could be updated. Not making a syscall at all would scale far better than >> > even a modified lseek() will perform. >> >> We'd have to have it in shared memory and have a lock around it, it >> wouldn't be cheap at all. > > Yep, that makes perfect sense. After all, the kernel does basically the > same thing to maintain this information; why should we have userspace > duplicating the same infrastructure? > > I must admit, I'd never heard of this usage of lseek to get the current > size of a file before; I'd assumed everybody used fstat. Given this > legitimate reason for a high-frequency calling of lseek, I withdraw my > earlier objection to the patch series. > > -- > Matthew Wilcox Intel Open Source Technology Centre > "Bill, look, we understand that you're interested in selling us this > operating system, but compare it to ours. We can't possibly take such > a retrograde step." > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I really don't understand the approach here. An improvement is an improvement, do we need a use case to add an improvement to the kernel? We are not talking about to add a new syscall or to do an ABI change in this case. So my absolute ack to these patches. Marco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote: > * Benjamin LaHaise (b...@kvack.org) wrote: > > For such tables, can't Postgres track the size of the file internally? I'm > > assuming it's keeping file descriptors open on the tables it manages, in > > which case when it writes to a file to extend it, the internally stored > > size > > could be updated. Not making a syscall at all would scale far better than > > even a modified lseek() will perform. > > We'd have to have it in shared memory and have a lock around it, it > wouldn't be cheap at all. Yep, that makes perfect sense. After all, the kernel does basically the same thing to maintain this information; why should we have userspace duplicating the same infrastructure? I must admit, I'd never heard of this usage of lseek to get the current size of a file before; I'd assumed everybody used fstat. Given this legitimate reason for a high-frequency calling of lseek, I withdraw my earlier objection to the patch series. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Mon, Sep 19, 2011 at 8:31 AM, Stephen Frost wrote: > * Benjamin LaHaise (b...@kvack.org) wrote: >> For such tables, can't Postgres track the size of the file internally? I'm >> assuming it's keeping file descriptors open on the tables it manages, in >> which case when it writes to a file to extend it, the internally stored size >> could be updated. Not making a syscall at all would scale far better than >> even a modified lseek() will perform. > > We'd have to have it in shared memory and have a lock around it, it > wouldn't be cheap at all. In theory, we could implement a lock-free cache. But I still think it would be better to see this fixed on the kernel side. If we had some evidence that all of those lseek() calls were a performance problem even when the i_mutex is not seriously contended, then that would be a good argument for doing this in user-space, but I haven't seen any such evidence. On the other hand, the numbers I posted show that when i_mutex IS contended, it can cause a throughput regression of up to 90%. That seems worth fixing. If it turns out that lseek() is too expensive even in the uncontended case or with the i_mutex contention removed (or if the Linux community is unwilling to accept the proposed fix), then we can (and should) look at further optimizing it within PostgreSQL. My guess, though, is that an unlocked lseek will be fast enough that we won't need to worry about installing our own caching infrastructure (or at least, there will be plenty of more significant performance problems to hunt down first). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
* Benjamin LaHaise (b...@kvack.org) wrote: > For such tables, can't Postgres track the size of the file internally? I'm > assuming it's keeping file descriptors open on the tables it manages, in > which case when it writes to a file to extend it, the internally stored size > could be updated. Not making a syscall at all would scale far better than > even a modified lseek() will perform. We'd have to have it in shared memory and have a lock around it, it wouldn't be cheap at all. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Improve lseek scalability v3
On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote: > many tuples does the table have. Those statistics are only updated every now > and then though. > So it uses those old stats to check how many tuples are normally stored on a > page and then uses that to extrapolate the number of tuples from the current > nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a > table). > > I am not sure how interested you are on the relevant postgres internals? For such tables, can't Postgres track the size of the file internally? I'm assuming it's keeping file descriptors open on the tables it manages, in which case when it writes to a file to extend it, the internally stored size could be updated. Not making a syscall at all would scale far better than even a modified lseek() will perform. Granted, I've not looked at the Postgres code at all. -ben -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote: > I sent an email containing benchmarks from Robert Haas regarding the Subject. > Looking at lkml.org I can't see it right now, Will recheck when I am at home. > > He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times the > lseek performance. > The workload was 64 clients hammering postgres with a simple readonly > workload > (pgbench -S). Yay! Data! > For reference see the thread in the postgres archives which also links to > performance data: http://archives.postgresql.org/message- > id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=vkjpknrxk2ktdr+hsa...@mail.gmail.com So both fstat and lseek do more work than postgres wants. lseek modifies the file pointer while fstat copies all kinds of unnecessary information into userspace. I imagine this is the source of the slowdown seen in the 1-client case. There have been various proposals to make the amount of information returned by fstat limited to the 'cheap' (for various definitions of 'cheap') fields. I'd like to dig into the requirement for knowing the file size a little better. According to the blog entry it's used for "the query planner". Does the query planner need to know the exact number of bytes in the file, or is it after an order-of-magnitude? Or to-the-nearest-gigabyte? -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
> One other thing we're interested in is portability. I mean, even if > Linux were to introduce a new hypothetical syscall that was able to > return the file size at a ridiculously low cost, we probably wouldn't > use it because it'd be Linux-specific. So an improvement of lseek() > seems to be the best option. Fully agreed. It doesn't make any sense at all to implement special syscalls just to workaround a basic system call not scaling. -Andi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Fri, Sep 16, 2011 at 9:08 PM, Benjamin LaHaise wrote: > For such tables, can't Postgres track the size of the file internally? I'm > assuming it's keeping file descriptors open on the tables it manages, in > which case when it writes to a file to extend it, the internally stored size > could be updated. Not making a syscall at all would scale far better than > even a modified lseek() will perform. There's no hardwired limit on how many tables you can have in a database, it's not limited by the number of file descriptors. Postgres would have to keep some kind of LRU for recently opened files and their sizes or something like that. There would probably still be a lot of lseeks/fstats going on. Generally keeping a Postgres cached value for the size would then have a reliability issue. It's much safer to have a single authoritative value -- the actual length of the file -- than have the same value stored in two locations and then need to worry about them getting out of sync. If a write fails when extending the file due to a filesystem running out of space then Postgres might not know how to update its internal cached state accurately for example. There's no question it could be done but it's not clear it would necessarily be much faster than a lock-free lseek/fstat. On Fri, Sep 16, 2011 at 6:27 PM, Andres Freund wrote: > It depends on where the information is used. For some of the uses it needs to > be exact (the assumed size is rechecked after acquiring a lock preventing > extension) Fwiw this might give the wrong impression. I don't believe scans acquire a lock preventing extension -- that is another process can be concurrently extending the file at the same time as the scan is proceeding. The scan only locks out truncation (vacuum). Any blocks added by another process are ignored by the scan because they can only contain records invisible to that transaction. This does depend on the lseek/fstat being done after the transaction snapshot is taken which is possibly "rechecking" the value taken by the query planner but they're really two independent things. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Friday, September 16, 2011 11:02:38 PM Andres Freund wrote: > Also with fstat() instead of lseek() there was no bottleneck anymore, so I > don't think the benefits would warrant that. At least thats what I observed on a 4 x 6 machine without the patch applied (can't reboot it). That shouldn't be concurrency relevant so... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
On Friday, September 16, 2011 10:08:17 PM Benjamin LaHaise wrote: > On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote: > > many tuples does the table have. Those statistics are only updated every > > now and then though. > > So it uses those old stats to check how many tuples are normally stored > > on a page and then uses that to extrapolate the number of tuples from > > the current nr of pages (which is computed by lseek(SEEK_END) over the > > 1GB segements of a table). > > > > I am not sure how interested you are on the relevant postgres internals? > > For such tables, can't Postgres track the size of the file internally? I'm > assuming it's keeping file descriptors open on the tables it manages, in > which case when it writes to a file to extend it, the internally stored > size could be updated. Not making a syscall at all would scale far better > than even a modified lseek() will perform. Yes, it tracks the fds internally. The problem is that postgres is process based so those tables are not reachable by all processes. It could start tracking those in shared memory but the synchronization overhead for that would likely be more expensive than the syscall overhead (Given that the fdsets are possibly (and realistically) disjunct between the individual backends you would have to reserve enough shared memory for a fully seperate fds between each process... Which would complicate efficient lookup). Also with fstat() instead of lseek() there was no bottleneck anymore, so I don't think the benefits would warrant that. Greetings, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
Excerpts from Andres Freund's message of vie sep 16 14:27:33 -0300 2011: > Hi, > On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote: > > Does the query planner need to know the exact number of bytes in the file, > > or is it after an order-of-magnitude? Or to-the-nearest-gigabyte? > It depends on where the information is used. For some of the uses it needs to > be exact (the assumed size is rechecked after acquiring a lock preventing > extension) at other places I guess it would be ok if the accuracy got lower > with bigger files (those files won't ever get bigger than 1GB). One other thing we're interested in is portability. I mean, even if Linux were to introduce a new hypothetical syscall that was able to return the file size at a ridiculously low cost, we probably wouldn't use it because it'd be Linux-specific. So an improvement of lseek() seems to be the best option. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve lseek scalability v3
Hi, On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote: > On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote: > > I sent an email containing benchmarks from Robert Haas regarding the > > Subject. Looking at lkml.org I can't see it right now, Will recheck when > > I am at home. > > > > He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times > > the lseek performance. > > The workload was 64 clients hammering postgres with a simple readonly > > workload (pgbench -S). > Yay! Data! > > For reference see the thread in the postgres archives which also links to > > performance data: http://archives.postgresql.org/message- > > id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=vkjpknrxk2ktdr+hsa...@mail.gmail.com > So both fstat and lseek do more work than postgres wants. lseek modifies > the file pointer while fstat copies all kinds of unnecessary information > into userspace. I imagine this is the source of the slowdown seen in > the 1-client case. Yes, that was my theory as well. > I'd like to dig into the requirement for knowing the file size a little > better. According to the blog entry it's used for "the query planner". Its used for multiple things - one of which is the query planner. The query planner needs to know how many tuples a table has to produce a sensible plan. For that is has stats which tell 1. how big is the table 2. how many tuples does the table have. Those statistics are only updated every now and then though. So it uses those old stats to check how many tuples are normally stored on a page and then uses that to extrapolate the number of tuples from the current nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a table). I am not sure how interested you are on the relevant postgres internals? > Does the query planner need to know the exact number of bytes in the file, > or is it after an order-of-magnitude? Or to-the-nearest-gigabyte? It depends on where the information is used. For some of the uses it needs to be exact (the assumed size is rechecked after acquiring a lock preventing extension) at other places I guess it would be ok if the accuracy got lower with bigger files (those files won't ever get bigger than 1GB). But I have a hard time seeing an implementation where the approximate size would be faster to get than just the filesize? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers