Re: [HACKERS] Multi CPU Queries - Feedback and/or suggestions wanted!
I can confirm that bringing Postgres code to multi-thread implementation requires quite a bit of ground work. I have been working for a long while with a Postgres 7.* fork that uses pthreads rather than processes. The effort to make all the subsystems thread safe took some time and touched almost every section of the codebase. I recently spent some time trying to optimize for Chip Multi-Threading systems but focused more on total throughput rather than single query performance. The biggest wins came from changing some coarse grained locks in the page buffering system to a finer grained implementation. I also tried to improve single query performance by splitting index and sequential scans into two threads, one to fault in pages and check tuple visibility and the other for everything else. My success was limited and it was hard for me to work the proper costing into the query optimizer so that it fired at the right times. One place that multiple threads really helped was in index building. My code is poorly commented and the build system is a mess (I am only building 64bit SPARC for embedding into another app). However, I am using it in production and source is available if it's of any help. http://weaver2.dev.java.net Myron Scott On Oct 20, 2008, at 11:28 PM, Chuck McDevitt wrote: There is a problem trying to make Postgres do these things in Parallel. The backend code isn’t thread-safe, so doing a multi-thread implementation requires quite a bit of work. Using multiple processes has its own problems: The whole way locking works equates one process with one transaction (The proc table is one entry per process). Processes would conflict on locks, deadlocking themselves, as well as many other problems. It’s all a good idea, but the work is probably far more than you expect. Async I/O might be easier, if you used pThreads, which is mostly portable, but not to all platforms. (Yes, they do work on Windows) From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] ] On Behalf Of Jeffrey Baker Sent: 2008-10-20 22:25 To: Julius Stroffek Cc: pgsql-hackers@postgresql.org; Dano Vojtek Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or suggestions wanted! On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek <[EMAIL PROTECTED] > wrote: Topics that seem to be of interest and most of them were already discussed at developers meeting in Ottawa are 1.) parallel sorts 2.) parallel query execution 3.) asynchronous I/O 4.) parallel COPY 5.) parallel pg_dump 6.) using threads for parallel processing [...] 2.) Different subtrees (or nodes) of the plan could be executed in parallel on different CPUs and the results of this subtrees could be requested either synchronously or asynchronously. I don't see why multiple CPUs can't work on the same node of a plan. For instance, consider a node involving a scan with an expensive condition, like UTF-8 string length. If you have four CPUs you can bring to bear, each CPU could take every fourth page, computing the expensive condition for each tuple in that page. The results of the scan can be retired asynchronously to the next node above. -jwb -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Discontent with development process (was:Re: [HACKERS] pgaccess - the discussion is over)
Tom Lane wrote: >Hannu Krosing <[EMAIL PROTECTED]> writes: > >>What would your opinion be of some hack with macros, like >> > >>#if (Win32 or THREADED) >>#define GLOBAL_ pg_globals. >>#else >>#define GLOBAL_ >>#endif >> > >>and then use global variables as >> > >>GLOBAL_globvar >> > >>At least in my opinion that would increase both readability and >>maintainability. >> > >>From a code readability viewpoint this is not at all better than just >moving everything to pg_globals. You're only spelling "pg_globals." >a little differently. And it introduces twin possibilities for error: >omitting GLOBAL_ (if you're a Unix developer) or writing >pg_globals. explicitly (if you're a Win32 guy). I suppose these errors >would be caught as soon as someone tried to compile on the other >platform, but it still seems like a mess with little redeeming value. > Another suggestion might be to create a global hashtable that stores the size and pointer to global structures for each subsection. Each subsection can define its own globals structure and register them with the hashtable. This would not impact readablity and make the gobal environment easy to copy. IMHO, this is possible with minimal performance impact. Myron Scott [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Discontent with development process (was:Re: [HACKERS] pgaccess - the discussion is over)
Tom Lane wrote: > > >With a little more intelligence in the manager of this table, this could >also solve my concern about pointer variables. Perhaps the entries >could include not just address/size but some type information. If the >manager knows "this variable is a pointer to a palloc'd string" then it >could do the Right Thing during fork. Not sure offhand what the >categories would need to be, but we could derive those if anyone has >cataloged the variables that get passed down from postmaster to children. > >I don't think it needs to be a hashtable --- you wouldn't ever be doing >lookups in it, would you? Just a simple list of things-to-copy ought to >do fine. > > I'm thinking in a threaded context where a method may need to lookup a global that is not passed in. But for copying, I suppose no lookups would be neccessary. Myron Scott [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlock performance improvement proposal
On Wed, 26 Sep 2001, mlw wrote: > I can only think of two objectives for threading. (1) running the various > connections in their own thread instead of their own process. (2) running > complex queries across multiple threads. > I did a multi-threaded version of 7.0.2 using Solaris threads about a year ago in order to try and get multiple backend connections working under one java process using jni. I used the thread per connection model. I eventually got it working, but it was/is very messy ( there were global variables everywhere! ). Anyway, I was able to get a pretty good speed up on inserts by scheduling buffer writes from multiple connections on one common writing thread. I also got some other features that were important to me at the time. 1. True prepared statements under java with bound input and output variables 2. Better system utilization a. fewer Solaris lightweight processes mapped to threads. b. Fewer open files per postgres installation 3. Automatic vacuums when system activity is low by a daemon thread. but there were some drawbacks... One rogue thread or bad user function could take down all connections for that process. This was and seems to still be the major drawback to using threads. Myron Scott [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Spinlock performance improvement proposal
> But note that Myron did a number of things that are (IMHO) orthogonal yes, I did :) > to process-to-thread conversion, such as adding prepared statements, > a separate thread/process/whateveryoucallit for buffer writing, ditto > for vacuuming, etc. I think his results cannot be taken as indicative > of the benefits of threads per se --- these other things could be > implemented in a pure process model too, and we have no data with which > to estimate which change bought how much. > If you are comparing just process vs. thread, I really don't think I gained much for performance and ended up with some pretty unmanageable code. The one thing that led to most of the gains was scheduling all the writes to one thread which, as noted by Tom, you could do on the process model. Besides, Most of the advantage in doing this was taken away with the addition of WAL in 7.1. The other real gain that I saw with threading was limiting the number of open files but that led me to alter much of the file manager in order to synchronize access to the files which probably slowed things a bit. To be honest, I don't think I, personally, would try this again. I went pretty far off the beaten path with this thing. It works well for what I am doing ( a limited number of SQL statements run many times over ) but there probably was a better way. I'm thinking now that I should have tried to add a CORBA interface for connections. I would have been able to accomplish my original goals without creating a deadend for myself. Thanks all for a great project, Myron [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Using Threads?
I maybe wrong but I think that PGSQL is not threaded mostly due to historical reasons. It looks to me like the source has developed over time where much of the source is not reentrant with many global variables throughout. In addition, the parser is generated by flex which can be made to generate reentrant code but is still not thread safe b/c global variables are used. That being said, I experimented with the 7.0.2 source and came up with a multithreaded backend for PGSQL which uses Solaris Threads. It seems to work, but I drifted very far from the original source. I had to hack flex to generate threadsafe code as well. I use it as a linked library with my own fe<->be protocol. This ended up being much much more than I bargained for and looking back would probably not have tried had I known any better. Myron Scott On Mon, 27 Nov 2000, Junfeng Zhang wrote: > Hello all, > > I am new to postgreSQL. When I read the documents, I find out the Postmaster > daemon actual spawns a new backend server process to serve a new client > request. Why not use threads instead? Is that just for a historical reason, > or some performance/implementation concern? > > Thank you very much. > Junfeng >
Re: [HACKERS] Using Threads?
I would love to distribute this code to anybody who wants it. Any suggestions for a good place? However, calling the work a code redesign is a bit generous. This was more like a brute force hack. I just moved all the connection related global variables to a thread local "environment variable" and bypassed much of the postmaster code. I did this so I could port my app which was originally designed for Oracle OCI and Java. My app uses very few SQL statements but uses them over and over. I wanted true prepared statements linked to Java with JNI. I got both as well as batched transaction writes ( which was more relevant before WAL). In my situation, threads seemed much more flexible to implement, and I probably could not have done the port without it. Myron On Mon, 4 Dec 2000, Ross J. Reedstrom wrote: > Myron - > Putting aside the fork/threads discussion for a moment (the reasons, > both historical and other, such as inter-backend protection, are well > covered in the archives), the work you did sounds like an interesting > experiment in code redesign. Would you be willing to release the hacked > code somewhere for others to learn from? Hacking flex to generate > thread-safe code is of itself interesting, and the question about PG and > threads comes up so often, that an example of why it's not a simple task > would be useful. > > Ross >
Re: [HACKERS] Using Threads?
For anyone interested, I have posted my multi-threaded version of PostgreSQL here. http://www.sacadia.com/mtpg.html It is based on 7.0.2 and the TAO CORBA ORB which is here. http://www.cs.wustl.edu/~schmidt/TAO.html Myron Scott [EMAIL PROTECTED]
Re: [HACKERS] Using Threads?
spinlocks rewritten to mutex_ locktable uses sema_ some cond_ in bufmgr.c Myron Karel Zak wrote: > On Mon, 1 Jan 2001, Myron Scott wrote: > > >> For anyone interested, >> >> I have posted my multi-threaded version of PostgreSQL here. >> >> http://www.sacadia.com/mtpg.html > > > How you solve locks? Via original IPC or you rewrite it to mutex (etc). > > Karel
Re: [HACKERS] Using Threads?
Karel Zak wrote: > On Tue, 2 Jan 2001, Myron Scott wrote: > > >> spinlocks rewritten to mutex_ >> locktable uses sema_ >> some cond_ in bufmgr.c > > > Interesting, have you some comperation between IPC PostgresSQl anf > your thread based PostgreSQL. > > Karel Yes, I did some comparisions but it is hard to make accurate evaluations on the data. I basically did 1000 inserts from 7.0.2 and the modified verison from 8 simultaneous clients. The original 7.0.2 was faster by an an order of magnitude. This needs to looked into more though. It was just a rough test b/c clients and server all running on the same machine (Ultra 10 w/512MB RAM). I don't really know what the impact of changing some of the locking mechanisms is. On the one hand, there is alot of overhead associated with using TAO ORB as the fe<->be protocol. The 7.0.2 fe<->be is pretty efficient, TAO with IIOP not as much so. At the same time, using prepared statements when doing the same insert with different variables over and over cuts out re-parsing and planning the statement on every execute. Lastly, I really didn't optimize my code at all. There are some places where GetEnv() in called over and over to get the thread local variable where it should only be called once in the method and reused. Speed wasn't the motivation, I just wanted to see if threads and PostgreSQL could be done. Myron
Re: [HACKERS] Using Threads?
Alfred Perlstein wrote: > > It's possible what you're seeing is the entire process > wait for a disk IO to complete. > > I'm wondering, how many lwps does your system use? Are all > the threads bound to a single lwp or do you let the threads > manager handle this all for you? > Yeah, I looked at this. I have one thread per process that does all flushing of buffer pages at transaction commit. The client threads register buffer writes with this thread and wait for this thread to complete writes at transaction end. Unfortuately, selects also wait which really isn't nessessary. I hoped this would speed simultaneous connections. I created this as both a bound thread with its own lwp and a threads manager managed thread. I eventually settled on a threads manager managed thread, thinking that I wanted to set the priority of this thread low and commit as many transactions as possible simulateously. Maybe, I should rethink this. As for client threads, that is managed by TAO and I haven't really played with that configuration. Myron Scott [EMAIL PROTECTED]
Re: [HACKERS] Using Threads?
I have put a new version of my multi-threaded postgresql experiment at http://www.sacadia.com/mtpg.html This one actually works. I have added a server based on omniORB, a CORBA 2.3 ORB from ATT. It is much smaller than TAO and uses the thread per connection model. I haven't added the java side of the JNI interface yet but the C++ side is there. It's still not stable but it is much better than the last. Myron Scott [EMAIL PROTECTED]
Re: [HACKERS] Using Threads
> > Sorry I haven't time to see and test your experiment, > but I have a question. How you solve memory management? > The current mmgr is based on global variable > CurrentMemoryContext that is very often changed and used. > Use you for this locks? If yes it is probably problematic > point for perfomance. > > Karel > There are many many globals I had to work around including all the memory management stuff. I basically threw everything into and "environment" variable which I stored in a thread specific using thr_setspecific. Performance is acually very good for what I am doing. I was able to batch commit transactions which cuts down on fsync calls, use prepared statements from my client using CORBA, and the various locking calls for the threads (cond_wait,mutex_lock, and sema_wait) seem pretty fast. I did some performance tests for inserts 20 clients, 900 inserts per client, 1 insert per transaction, 4 different tables. 7.0.2About10:52 average completion multi-threaded2:42 average completion 7.1beta3 1:13 average completion If I increased the number of inserts per transaction, multi-threaded got closer to 7.1 for inserts. I haven't tested other other types of commands yet. Myron Scott [EMAIL PROTECTED]
Re: [HACKERS] Threads vs Processes
On Thursday, September 25, 2003, at 10:03 AM, Tom Lane wrote: Shridhar Daithankar <[EMAIL PROTECTED]> writes: One thing that can be done is to arrange all globals/statics in a structure and make that structure thread local. That's about as far from "non-invasive" as I can imagine :-( I really, really want to avoid doing anything like the above, because it would force us to expose to the whole backend many data structures and state variables that are currently local to individual .c files. That complicates understanding and debugging tremendously, not to mention slowing the edit/compile/debug cycle when you are changing such structures. Another option would be to create thread local hashtable or other lookup structure to which you would register a structure for a particular .c file or group of files. You could then define the structures you need locally without affecting other parts of the codebase. Myron Scott ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Support Parallel Query Execution in Executor
Gregory Maxwell wrote: We should consider true parallel execution and overlapping execution with I/O as distinct cases. For example, one case made in this thread involved bursty performance with seqscans presumably because the I/O was stalling while processing was being performed. In general this can be avoided without parallel execution through the use of non-blocking I/O and making an effort to keep the request pipeline full. There are other cases where it is useful to perform parallel I/O without parallel processing.. for example: a query that will perform an index lookup per row can benefit from running some number of those lookups in parallel in order to hide the lookup latency and give the OS and disk elevators a chance to make the random accesses a little more orderly. This can be accomplished without true parallel processing. (Perhaps PG does this already?) I have done some testing more along these lines with an old fork of postgres code (2001). In my tests, I used a thread to delegate out the actual heap scan of the SeqScan. The job of the "slave" thread the was to fault in buffer pages and determine the time validity of the tuples. ItemPointers are passed back to the "master" thread via a common memory area guarded by mutex locking. The master thread is then responsible for converting the ItemPointers to HeapTuples and finishing the execution run. I added a little hack to the buffer code to force pages read into the buffer to stay at the back of the free buffer list until the master thread has had a chance to use it. These are the parameters of my test table. Pages 9459: ; Tup 961187: Live 673029, Dead 288158 Average tuple size is 70 bytes create table test (rand int, varchar(256) message) So far I've done a couple of runs with a single query on a 2 processor machine with the following results via dtrace. select * from test; CPU IDFUNCTION:NAME 1 46218ExecEndSeqScan:return Inline scan time 81729 0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 59903 0 46218ExecEndSeqScan:return Inline scan time 95708 0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 58255 0 46218ExecEndSeqScan:return Inline scan time 79028 0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 50500 average 34% decrease in total time using the delegated scan. A very crude, simple test but I think it shows some promise. I know I used threads but you could probably just as easily use a slave process and pass ItemPointers via pipes or shared memory. Thanks, Myron Scott ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Support Parallel Query Execution in Executor
On Apr 8, 2006, at 10:29 PM, Luke Lonergan wrote: Myron, First, this sounds really good! On 4/8/06 9:54 PM, "Myron Scott" <[EMAIL PROTECTED]> wrote: I added a little hack to the buffer code to force pages read into the buffer to stay at the back of the free buffer list until the master thread has had a chance to use it. This is the part I'm curious about - is this using the shared_buffers region in a circular buffer fashion to store pre-fetched pages? Yes. That is basically what the slave thread is trying to do. As well as weed out any tuples/pages that don't need to be looked at due to dead tuples. I did several things to try and insure that a buffer needed by the master thread would not be pulled out of the buffer pool before it was seen by the master. I wanted to do this without holding the buffer pinned, so I did the change to the buffer free list to do this. static void AddBufferToFreelist(BufferDesc *bf) { S_LOCK(&SLockArray[FreeBufMgrLock]); int movebehind = SharedFreeList->freePrev; /* find the right spot with bias */ while ( BufferDescriptors[movebehind].bias > bf->bias ) { movebehind = BufferDescriptors[movebehind].freePrev; } ... The bias number is removed the next time the buffer is pulled out of the free list. Also, I force an ItemPointer transfer when the ItemPointer transfer list is full ( currently 4096 ) or 10% of the buffer pool have been affected by the slave thread. Lastly, if the slave thread gets too far ahead of the master thread, it waits for the master to catch up. To my knowledge, this hasn't happened yet. One thing I've wondered about is: how much memory is required to get efficient overlap? Did you find that you had to tune the amount of buffer memory to get the performance to work out? I haven't done much tuning yet. I think there is an optimal balance that I most likely haven't found yet. Myron Scott ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Support Parallel Query Execution in Executor
On Apr 9, 2006, at 9:26 AM, Martijn van Oosterhout wrote: On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote: This is the part I'm curious about - is this using the shared_buffers region in a circular buffer fashion to store pre-fetched pages? Yes. That is basically what the slave thread is trying to do. As well as weed out any tuples/pages that don't need to be looked at due to dead tuples. I did several things to try and insure that a buffer needed by the master thread would not be pulled out of the buffer pool before it was seen by the master. I wanted to do this without holding the buffer pinned, so I did the change to the buffer free list to do this. Is this necessary? I mean, what's the chance that a page might get thrown out early? And if so, what's the chance that page will still be in the OS cache? The cost of fetching a page from the OS is not really much of an overhead, so I'd like to know how much benefit these buffer cache hacks actually produce. You may be right on this one. I wanted ensure that I didn't lose pages I needed. I may have just added a belt to my suspenders. I'll add a switch to turn it off and on and try and devise a test to see what the costs either way are. Myron Scott ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Support Parallel Query Execution in Executor
On Mon, 2006-04-10 at 02:16, Martijn van Oosterhout wrote: > The appears to be two seperate cases here though, one is to just farm > out the read request to another process (basically aio), the other is > to do actual processing there. The latter is obviously for more useful > but requires a fair bit more infrastructure. > I ran some tests to see where time is spent during SeqScans. I did the following. tester=# vacuum analyze verbose test; INFO: vacuuming "public.test" INFO: "test": found 0 removable, 727960 nonremovable row versions in 5353 pagesDETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.18s/0.27u sec elapsed 0.91 sec. INFO: analyzing "public.test" INFO: "test": scanned 3000 of 5353 pages, containing 407952 live rows and 0 dead rows; 3000 rows in sample, 727922 estimated total rows VACUUM tester=# select version(); version --- PostgreSQL 8.2devel on sparc-sun-solaris2.11, compiled by GCC gcc (GCC) 3.3.2 (1 row) tester=# select count(random) from test; count 727960 (1 row) With the follow ing dtrace results... # ./probediff2.d 514607 dtrace: script './probediff2.d' matched 10 probes CPU IDFUNCTION:NAME 0 46811ExecEndSeqScan:return scan time 20406 ^C smgrread 641566800 Virtualized - smgrread 439798800 smgread - Call Count 5353 HeapTupleSatisfiesSnapshot 6735471000 Virtualized - HeapTupleSatisfiesSnapshot 3516556800 HeapTupleSatisfiesSnapshot - Call Count 727960 Virtualized - ReadBuffer 558230600 ReadBuffer 864931000 Virtualized - ExecutePlan 7331181400 Virtualized - ExecSeqScan 7331349600 ExecutePlan 20405943000 ExecSeqScan 20406161000 The virtualized times are supposed to be actual time spent on the CPU with the time spent in the probe factored out. It seems here that half the time in SeqScan is spent time validating the tuples as opposed to 1/10th doing IO. I'm not sure that just farming out read IO is going to be all that helpful in this situation. That's why I think it's a good idea to create a slave process that prefetchs pages and transfers valid ItemPointers to the master. There may not be much to be gained on simple SeqScans, however, in complex queries that include a SeqScan, you may gain alot by offloading this work onto a slave thread. A table with TOAST'ed attributes comes to mind. The slave thread could be working away on the rest of the table while the master is PG_DETOAST_DATUM'ing the attributes for transmission back to the client or additional processing. Am I missing something in this analysis? I've attached my dtrace script. Myron Scott #!/usr/sbin/dtrace -s pid$1::ExecInitSeqScan:entry { ts = timestamp; vts = vtimestamp; timeon = 1; } pid$1::ExecEndSeqScan:return /ts/ { printf("scan time %d",(timestamp - ts) /100) ; @val["ExecSeqScan"] = sum(timestamp - ts); @val["Virtualized - ExecSeqScan"] = sum(vtimestamp - vts); ts = 0; vts = 0; timeon = 0; } pid$1::HeapTupleSatisfiesSnapshot:entry /timeon/ { validity = timestamp; vvalidity = vtimestamp; } pid$1::HeapTupleSatisfiesSnapshot:return /validity/ { @val["HeapTupleSatisfiesSnapshot"] = sum(timestamp - validity); @val["Virtualized - HeapTupleSatisfiesSnapshot"] = sum(vtimestamp - vvalidity); @val["HeapTupleSatisfiesSnapshot - Call Count"] = sum(1); validity = 0; vvalidity = 0; } pid$1::smgrread:entry /timeon/ { rt= timestamp; vrt= vtimestamp; } pid$1::smgrread:return /rt/ { @val["smgrread"] = sum(timestamp - rt); @val["Virtualized - smgrread"] = sum(vtimestamp - vrt); @val["smgread - Call Count"] = sum(1); rt = 0;
Re: [HACKERS] Support Parallel Query Execution in Executor
On Tue, 2006-04-11 at 07:47, Myron Scott wrote: > client > or additional processing. Am I missing something in this analysis? > > I've attached my dtrace script. > To answer my own question, I suppose my processors are relatively slow compared to most setups. Myron Scott ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly