[HACKERS] FDW: should GetFdwRoutine be called when drop table?
Hi, Hackers, I have an fdw that each foreign table will acquire some persisted resource. In my case, some files in file system. To drop the table cleanly, I have written an object_access_hook that remove those files. The hook is installed in _PG_init. It all worked well except one case. Suppose a user login, the very first command is drop foreign table. Drop foreign table will not load the module, so that the hook is not installed and the files are not properly cleaned up. Should drop foreign table call GetFdwRoutine? _PG_init is the only entrance point that I know for registering hooks, I feel we need to trigger a load for all DML/DDL on FDW including drop. Does this make sense? Thanks, Feng
Re: [HACKERS] Bug with int2
Ah, thanks! On Tue, Feb 16, 2016 at 7:54 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Tue, Feb 16, 2016 at 7:27 PM, Feng Tian <ft...@vitessedata.com> wrote: > > I run into the following. Seems this is a bug for -32768, which should > be > > a valid smallint value. > > This isn't a bug. You see the error only due to operator precedence: > > postgres=# select (-32768)::int2; > int2 > ─ > -32,768 > (1 row) > > -- > Peter Geoghegan >
[HACKERS] Bug with int2
I run into the following. Seems this is a bug for -32768, which should be a valid smallint value. Test was run on 9.4.5. Thanks, Feng ftian=# select 32767::int2; int2 --- 32767 (1 row) ftian=# select -32767::int2; ?column? -- -32767 (1 row) ftian=# select 32768::int2; ERROR: smallint out of range ftian=# select -32768::int2; ERROR: smallint out of range ftian=#
Re: [HACKERS] Remove Windows crash dump support?
On Wed, Dec 23, 2015 at 6:14 PM, Craig Ringerwrote: > On 22 December 2015 at 23:48, Alex Ignatov > wrote: > > >> I think that you can debug crash dump since windbg exists. >> > > Nobody in their right mind uses windbg though. Visual Studio is really > where it's at and the Express versions make it much more practical. > Just FYI. Most developers of MS server (OS/SQL Server) team use windbg. Windbg was my only debugger when I worked there and I never used Visual Studio. > > You can't even install Debugging Tools for Windows and Windbg standalone > anymore. > > >> Also I think that Postgres on Windows number of instalations is so tiny >> because people even today think that it is not so solid as unix version >> thats why you think that nobody use your code ;). >> > > I disagree. Windows Pg users are often at bigger companies and don't talk > about PostgreSQL as much. Often for fear of reprisals from other database > vendors they have ongoing relationships with. At least that's been my > experience and I'm sure EDB folks will concur. > > >> Today if my memory serves me right this code can not deal with buffer >> overflow. Am i right? >> > > Stack overflow? > > No, it can't. The stack has to be somewhat sane. > > >> May be we need to add this functionality instead of drop support of it >> entirely >> > > Why? I've never seen any sign anybody has used it, ever. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Re: [HACKERS] POC, WIP: OR-clause support for indexes
Hi, Teodor, This is great. I got a question, is it possible make btree index to support OR as well? Is btree supports more invasive, in the sense that we need to do enhance ScanKey to supports an array of values? Thanks, Feng On Sat, Dec 26, 2015 at 10:04 AM, Teodor Sigaevwrote: > I'd like to present OR-clause support for indexes. Although OR-clauses > could be supported by bitmapOR index scan it isn't very effective and such > scan lost any order existing in index. We (with Alexander Korotkov) > presented results on Vienna's conference this year. In short, it provides > performance improvement: > > EXPLAIN ANALYZE > SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000; > me=0.080..0.267 rows=173 loops=1) > Recheck Cond: ((id = 5) OR (id = 500) OR (id = 5000)) > Heap Blocks: exact=172 > -> Bitmap Index Scan on idx_gin (cost=0.00..57.50 rows=15000 > width=0) (actual time=0.059..0.059 rows=147 loops=1) >Index Cond: ((id = 5) OR (id = 500) OR (id = 5000)) > Planning time: 0.077 ms > Execution time: 0.308 ms <--- > QUERY PLAN > > --- > Aggregate (cost=51180.53..51180.54 rows=1 width=0) (actual > time=796.766..796.766 rows=1 loops=1) >-> Index Only Scan using idx_btree on tst (cost=0.42..51180.40 > rows=55 width=0) (actual time=0.444..796.736 rows=173 loops=1) > Filter: ((id = 5) OR (id = 500) OR (id = 5000)) > Rows Removed by Filter: 999829 > Heap Fetches: 102 > Planning time: 0.087 ms > Execution time: 796.798 ms <-- > QUERY PLAN > > - > Aggregate (cost=21925.63..21925.64 rows=1 width=0) (actual > time=160.412..160.412 rows=1 loops=1) >-> Seq Scan on tst (cost=0.00..21925.03 rows=237 width=0) (actual > time=0.535..160.362 rows=175 loops=1) > Filter: ((id = 5) OR (id = 500) OR (id = 5000)) > Rows Removed by Filter: 999827 > Planning time: 0.459 ms > Execution time: 160.451 ms > > > It also could work together with KNN feature of GiST and in this case > performance improvement could be up to several orders of magnitude, in > artificial example it was 37000 times faster. > > Not all indexes can support oR-clause, patch adds support to GIN, GiST > and BRIN indexes. pg_am table is extended for adding amcanorclause column > which indicates possibility of executing of OR-clause by index. > > indexqual and indexqualorig doesn't contain implicitly-ANDed list of > index qual expressions, now that lists could contain OR RestrictionInfo. > Actually, the patch just tries to convert BitmapOr node to IndexScan or > IndexOnlyScan. Thats significantly simplifies logic to find possible > clause's list for index. > Index always gets a array of ScanKey but for indexes which support > OR-clauses > array of ScanKey is actually exection tree in reversed polish notation > form. Transformation is done in ExecInitIndexScan(). > > The problems on the way which I see for now: > 1 Calculating cost. Right now it's just a simple transformation of costs > computed for BitmapOr path. I'd like to hope that's possible and so index's > estimation function could be non-touched. So, they could believe that all > clauses are implicitly-ANDed > 2 I'd like to add such support to btree but it seems that it should be a > separated patch. Btree search algorithm doesn't use any kind of stack of > pages and algorithm to walk over btree doesn't clear for me for now. > 3 I could miss some places which still assumes implicitly-ANDed list of > clauses although regression tests passes fine. > > Hope, hackers will not have an strong objections to do that. But obviously > patch > requires further work and I'd like to see comments, suggestions and > recommendations. Thank you. > > > -- > Teodor Sigaev E-mail: teo...@sigaev.ru >WWW: > http://www.sigaev.ru/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
Re: [HACKERS] Fdw cleanup
Hi, Albe, Thank you. The resource is not memory, I have something like a file descriptor or network connection, which I believe is the common case for foreign table. Using RegisterXactCallback exposes an API at a much deep level to foreign table writer, and I need to find a place to anchor the context (fdw_state). I took a look at postgres_fdw and found that xact callback is exactly what is done.So I will take this approach. Again, thank you Albe, for pointing me to it. Feng On Mon, Dec 14, 2015 at 12:30 AM, Albe Laurenz <laurenz.a...@wien.gv.at> wrote: > Feng Tian wrote: > > I need some help to understand foreign table error handling. > > > > For a query on foreign table, ExecInitForeignScan is called, which in > turn calls the BeginForeignScan > > hook. Inside this hook, I allocated some resource, > > > > > > node->fdw_state = allocate_resource(...); > > > > If everything goes well, ExecEndForeignScan will call call my > EndForeignScan hook, inside the hook, I > > free the resource. > > > > free_resource(node->fdw_state); > > > > However, if during the execution an error happened, seems to me that > EndForeignScan will not be called > > (traced using gdb). So my question is, is Begin/End the right place > for allocate/free resources? > > If it is not, what is the right way to do this? > > If the resource is memory, use "palloc" to allocate it, and PostgreSQL > will take care of it automatically. > > Other than that, you could call RegisterXactCallback() and register a > callback > that will be called upon transaction end. In the callback function you can > free the resource if it was not already freed by EndForeignScan. > > Yours, > Laurenz Albe >
[HACKERS] Fdw cleanup
Hi, Hackers, I need some help to understand foreign table error handling. For a query on foreign table, ExecInitForeignScan is called, which in turn calls the BeginForeignScan hook. Inside this hook, I allocated some resource, node->fdw_state = allocate_resource(...); If everything goes well, ExecEndForeignScan will call call my EndForeignScan hook, inside the hook, I free the resource. free_resource(node->fdw_state); However, if during the execution an error happened, seems to me that EndForeignScan will not be called (traced using gdb). So my question is, is Begin/End the right place for allocate/free resources? If it is not, what is the right way to do this? Thank you very much, Feng
[HACKERS] Decimal64 and Decimal128
Hi, Here is an extension for 64 and 128 bit decimal types using IEEE decimal floating point. The original idea/implementation is from http://pgxn.org/dist/pgdecimal/1.0.0/ Original thread for dicussion is at http://www.postgresql.org/message-id/CAFj8pRApakE6s-H2yJcXD=ubpukwa6i7rx4vuvtb4puhga5...@mail.gmail.com I reimplemented 64/128 bits instead of 32/64 bits. The code use decNumber library instead of _Decimal64/128 of GCC. Also added more operators. Compared to numeric type, decimal64 arithmetics is about 2x faster, decimal128 is about 1.5x faster. However, the cast between decimal and float4/8 is implemented rather naively and slow. As always, it depends on workload, decimal may take more, or less space, may be slower if cast is frequently performed. Agains, thanks to the original author okbob (Pavel). Enjoy. Thanks, Feng
Re: [HACKERS] Decimal64 and Decimal128
On Thu, Sep 24, 2015 at 1:55 PM, Peter Geogheganwrote: > On Thu, Sep 24, 2015 at 1:53 PM, Tom Lane wrote: > > Please include the actual patch as an attachment. We do not consider > mere > > URLs to be acceptable patch submission format, because that provides no > > permanent record in our archives of what was submitted. > > I was under the impression that this was not intended as a patch > submission. > > > -- > Peter Geoghegan > If there is enough interest, would be great for it to go into the official contrib dir. Thanks,
Re: [HACKERS] Decimal64 and Decimal128
On Thu, Sep 24, 2015 at 2:17 PM, Feng Tian <ft...@vitessedata.com> wrote: > > > On Thu, Sep 24, 2015 at 1:55 PM, Peter Geoghegan <p...@heroku.com> wrote: > >> On Thu, Sep 24, 2015 at 1:53 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> > Please include the actual patch as an attachment. We do not consider >> mere >> > URLs to be acceptable patch submission format, because that provides no >> > permanent record in our archives of what was submitted. >> >> I was under the impression that this was not intended as a patch >> submission. >> >> >> -- >> Peter Geoghegan >> > > If there is enough interest, would be great for it to go into the official > contrib dir. > Thanks, > > > Second thought, the extension depends on decNumber, which is either GPL, or ICU license. Maybe this is trouble.
Re: [HACKERS] Decimal64 and Decimal128
On Thu, Sep 24, 2015 at 1:29 PM, Feng Tian <ft...@vitessedata.com> wrote: > Hi, > > Here is an extension for 64 and 128 bit decimal types using IEEE decimal > floating point. The original idea/implementation is from > http://pgxn.org/dist/pgdecimal/1.0.0/ Original thread for dicussion is > at > > > http://www.postgresql.org/message-id/CAFj8pRApakE6s-H2yJcXD=ubpukwa6i7rx4vuvtb4puhga5...@mail.gmail.com > > I reimplemented 64/128 bits instead of 32/64 bits. The code use decNumber > library instead of _Decimal64/128 of GCC. Also added more operators. > > Compared to numeric type, decimal64 arithmetics is about 2x faster, > decimal128 is about 1.5x faster. However, the cast between decimal and > float4/8 is implemented rather naively and slow. As always, it depends on > workload, decimal may take more, or less space, may be slower if cast is > frequently performed. > > Agains, thanks to the original author okbob (Pavel). Enjoy. > > Thanks, > Feng > > Ah, link. https://github.com/vitesse-ftian/pgdecimal Thanks, Feng
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Aug 20, 2015 at 1:16 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian ft...@vitessedata.com wrote: Just a quick anecdotal evidence. I did similar experiment about three years ago. The conclusion was that if you have SSD, just do quick sort and forget the longer runs, but if you are using hard drives, longer runs is the winner (and safer, to avoid cliffs).I did not experiment with RAID0/5 on many spindles though. Not limited to sort, more generally, SSD is different enough from HDD, therefore it may worth the effort for backend to guess what storage device it has, then choose the right thing to do. The devil is in the details. I cannot really comment on such a general statement. I would be willing to believe that that's true under unrealistic/unrepresentative conditions. Specifically, when multiple passes are required with a sort-merge strategy where that isn't the case with replacement selection. This could happen with a tiny work_mem setting (tiny in an absolute sense more than a relative sense). With an HDD, where sequential I/O is so much faster, this could be enough to make replacement selection win, just as it would have in the 1970s with magnetic tapes. As I've said, the solution is to simply avoid multiple passes, which should be possible in virtually all cases because of the quadratic growth in a classic hybrid sort-merge strategy's capacity to avoid multiple passes (growth relative to work_mem's growth). Once you ensure that, then you probably have a mostly I/O bound workload, which can be made faster by adding sequential I/O capacity (or, on the Postgres internals side, adding asynchronous I/O, or with memory prefetching). You cannot really buy a faster CPU to make a degenerate heapsort faster. -- Peter Geoghegan Agree everything in principal,except one thing -- no, random IO on HDD in 2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s. :-)
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Aug 20, 2015 at 10:41 AM, Peter Geoghegan p...@heroku.com wrote: On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs si...@2ndquadrant.com wrote: On 20 August 2015 at 03:24, Peter Geoghegan p...@heroku.com wrote: The patch is ~3.25x faster than master I've tried to read this post twice and both times my work_mem overflowed. ;-) Can you summarize what this patch does? I understand clearly what it doesn't do... The most important thing that it does is always quicksort runs, that are formed by simply filling work_mem with tuples in no particular order, rather than trying to make runs that are twice as large as work_mem on average. That's what the ~3.25x improvement concerned. That's actually a significantly simpler algorithm than replacement selection, and appears to be much faster. You might even say that it's a dumb algorithm, because it is less sophisticated than replacement selection. However, replacement selection tends to use CPU caches very poorly, while its traditional advantages have become dramatically less important due to large main memory sizes in particular. Also, it hurts that we don't currently dump tuples in batches, for several reasons. Better to do memory intense operations in batch, rather than having a huge inner loop, in order to minimize or prevent instruction cache misses. And we can better take advantage of asynchronous I/O. The complicated aspect of considering the patch is whether or not it's okay to not use replacement selection anymore -- is that an appropriate trade-off? The reason that the code has not actually been simplified by this patch is that I still want to use replacement selection for one specific case: when it is anticipated that a quicksort with spillover can occur, which is only possible with incremental spilling. That may avoid most I/O, by spilling just a few tuples using a heap/priority queue, and quicksorting everything else. That's compelling when you can manage it, but no reason to always use replacement selection for the first run in the common case where there well be several runs in total. Is that any clearer? To borrow a phrase from the processor architecture community, from a high level this is a Brainiac versus Speed Demon [1] trade-off. (I wish that there was a widely accepted name for this trade-off.) [1] http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Hi, Peter, Just a quick anecdotal evidence. I did similar experiment about three years ago. The conclusion was that if you have SSD, just do quick sort and forget the longer runs, but if you are using hard drives, longer runs is the winner (and safer, to avoid cliffs).I did not experiment with RAID0/5 on many spindles though. Not limited to sort, more generally, SSD is different enough from HDD, therefore it may worth the effort for backend to guess what storage device it has, then choose the right thing to do. Cheers.
Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Hi, I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates the importance of the memory-bounded hashagg patch Jeff Davis is working on). The problem is Q18, particularly this simple subquery: select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) 313; which is planned like this: QUERY PLAN - HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12) Group Key: l_orderkey Filter: (sum(l_quantity) '313'::double precision) - Seq Scan on lineitem (cost=0.00..508509923.28 rows=1848128 width=12) (4 rows) but sadly, in reality the l_orderkey cardinality looks like this: tpch=# select count(distinct l_orderkey) from lineitem; count 45 (1 row) That's a helluva difference - not the usual one or two orders of magnitude, but 1x underestimate. The usual thing to do in this case is increasing statistics target, and while this improves the estimate, the improvement is rather small: statistics target estimatedifference -- 100 429491 1 1000 4240418 1000 1 42913759 100 I find the pattern rather strange - every time the statistics target increases 10x, the difference decreases 10x - maybe that's natural, but the perfect proportionality is suspicious IMHO. Also, this is a quite large dataset - the table has ~18 billion rows, and even with target=1 we're sampling only 3M rows, which is ~0.02%. That's a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H dataset is damn uniform - there's pretty much no skew in the distributions AFAIK. So I'd expect a slightly better result. With target=1 the plan switches to GroupAggregate, because the estimate gets sufficient to exceed work_mem (2GB). But it's still way off, and it's mostly just a lucky coincidence. So I'm wondering if there's some bug because of the dataset size (an integer overflow or something like), so I added a bunch of logging into the estimator, logging all the parameters computed: target=100 (samplerows=3) - WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 nmultiple=1001 toowide_cnt=0 d=28977 numer=86931.00 denom=2024.046627 stadistinct=429491.094029 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=429491.09 adaptive=443730.00 target=1000 (samplerows=30) --- WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 nmultiple=10131 toowide_cnt=0 d=289644 numer=8689320.00 denom=20491.658538 stadistinct=4240418.111618 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=4240418.11 adaptive=4375171.00 target=1 (samplerows=300) - WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799 nmultiple=99911 toowide_cnt=0 d=2897799 numer=869339700.00 denom=202578.313396 stadistinct=42913759.396282 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=42913759.40 adaptive=9882.00 It's totalrows=1849031 in all cases. The logs also show estimate produced by the adaptive estimate (discussed in a separate thread), but apparently that does not change the estimates much :-( Any ideas? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers While better sample/stats is important for choosing a good plan, in this query, hash agg is really the right plan. If a sort agg is chosen, the performance will be really really bad. The patch that Jeff is working on is critical for a decent TPCH number (unless you have unlimited amount of memory). Thanks,
Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Hi, On 06/20/2015 08:54 AM, Feng Tian wrote: While better sample/stats is important for choosing a good plan, in this query, hash agg is really the right plan. If a sort agg is chosen, the performance will be really really bad. The patch that Jeff is working on is critical for a decent TPCH number (unless you have unlimited amount of memory). I do agree that Jeff's memory-bounded hashagg patch is very important feature, and in fact we spent a fair amount of time discussing it in Ottawa. So I'm looking forward to getting that soon ;-) But I don't think hashagg is going to be very good in this particular case. With a 3TB dataset, the query runs out of memory on a machine with 256GB of RAM. So let's assume a complete hash table has ~256GB. With work_mem=1GB that means only ~1/256 of the table can be processed in one batch, so we'll process the first 1/256 of the table, and write out the remaining 99% into batches. Then we'll read the batches one by one, and process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB into batches, and then read those ~2.49TB again. At least that's how I understand Jeff's memory-bounded hashagg proposal. The sort may perform worse in the general case, but in this case there's an index on the column, and the table is almost perfectly correlated by that column (due to generating the orders one by one, but it seems plausible it'd be the same in reality, assuming the orders are numbered using a sequence). So doing the sort by an indexscan seems rather cheap, and you only need to scan the table once. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services I have not read Jeff's patch, but here is how I think hash agg should work, Hash agg scan lineitem table, perform aggregation in memory. Once workmem is exhausted, it write intermediate state to disk, bucket by bucket. When lineitem table is finished, it reads all tuples from one bucket back, combining intermediate state and finalize the aggregation. I saw a quite extensive discussion on combining aggregation on the dev list, so I assume it will be added. Assume after modulo an efficient size for I/O, workmem is bigger than the square root of data after aggregation, the above algorithm can finish by write out and read back only once. For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion orders. For the simple subquery, one need to 1. scan table, 3TB I/O 2. write out intermediate state, 4 billion * size of (key column + intermediate state ~ 20 bytes) = 80GB 3. read back 80GB. If sort is used, also assume workmem bigger than sqrt of data, you need to scan table, write out about 20B * 20 ~ 400GB, read back 400GB. Sort may have to do extra rounds of merge, but let's ignore that. Hash agg has better performace, because, 1. less I/O 2. hash is a linear algorithm, compared to sort at n*lg(n).
[HACKERS] Float/Double cast to int
Hi, Hackers, Here is a query, server was built witch GCC on Linux, AMD64. ftian=# ftian=# select 1.5::int, 1.5::double precision::int, 314.5::int, 314.5::double precision::int; int4 | int4 | int4 | int4 --+--+--+-- 2 |2 | 315 | 314 (1 row) I believe this is because rint is broken -- can some expert on IEEE754 please help confirm that this is a bug? Thanks, Feng
Re: [HACKERS] Float/Double cast to int
Ah, thanks! I did not realize numeric comes into play. But, this is even more interesting -- I would expect numeric is more consistent than float/double when dealing with stuff like rounding. I missed the not too long ago discussion, :-) Regardless of the mechanisms underneath, it would be quite hard to explain this behavior to customer. Maybe it is time to be brave, and be compatible with reality instead of backward? Best, Feng On Thu, May 21, 2015 at 12:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Feng Tian ft...@vitessedata.com writes: Here is a query, server was built witch GCC on Linux, AMD64. ftian=# select 1.5::int, 1.5::double precision::int, 314.5::int, 314.5::double precision::int; int4 | int4 | int4 | int4 --+--+--+-- 2 |2 | 315 | 314 (1 row) I believe this is because rint is broken -- can some expert on IEEE754 please help confirm that this is a bug? rint() is doing what the IEEE spec says, ie round to nearest even. Your third expression is doing numeric-to-int, and that code doesn't obey the IEEE spec. We've had discussions (not too long ago) about making these behaviors more consistent, but people seem to be too afraid of backwards-compatibility problems if we change it. regards, tom lane
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Hi, This is Feng from Vitesse. Performance different between Money and Numeric is *HUGE*. For TPCH Q1, the performance difference is 5x for stock postgres, and ~20x for vitesse. Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use Numeric (15, 2) is ~53s. Kevin, test=# do $$ begin perform sum('1.01'::numeric) from generate_series(1,1000); end; $$; This may not reflect the difference of the two data type. One aggregate is not where most of the time is spent. TPCH Q1 has many more computing. On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck michael.ba...@credativ.de wrote: Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: BTW, after reflecting a bit more I'm less than convinced that this datatype is completely useless. Even if you prefer to store currency values in numeric columns, casting to or from money provides a way to accept or emit values in whatever monetary format the LC_MONETARY locale setting specifies. That seems like a useful feature, and it's one you could not easily duplicate using to_char/to_number (not to mention that those functions aren't without major shortcomings of their own). As an additional datapoint, Vitesse Data changed the DB schema from NUMERIC to MONEY for their TPCH benchmark for performance reasons: The modification to data types is easy to understand -- money and double types are faster than Numeric (and no one on this planet has a bank account that overflows the money type, not any time soon).[1] And Replaced NUMERIC fields representing currency with MONEY[2]. Not sure whether they modified/optimized PostgreSQL with respect to the MONEY data type and/or how much performance that gained, so CCing CK Tan as well. Michael [1] http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html [2] http://vitessedata.com/benchmark/ -- Michael Banck Projektleiter / Berater Tel.: +49 (2161) 4643-171 Fax: +49 (2161) 4643-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Fwd: [HACKERS] Vitesse DB call for testing
Hi, Tom, Sorry for double post to you. Feng -- Forwarded message -- From: Feng Tian ft...@vitessedata.com Date: Fri, Oct 17, 2014 at 10:29 AM Subject: Re: [HACKERS] Vitesse DB call for testing To: Tom Lane t...@sss.pgh.pa.us Hi, Tom, I agree using that using int128 in stock postgres will speed up things too. On the other hand, that is only one part of the equation. For example, if you look at TPCH Q1, the int128 cheating does not kick in at all, but we are 8x faster. I am not sure why do you mean by actual data access. Data is still in stock postgres format on disk. We indeed jit-ed all data fields access (deform tuple).To put things in perspective, I just timed select count(*) and select count(l_orderkey) from tpch1.lineitem; Our code is bottlenecked by memory bandwidth and difference is pretty much invisible. Thanks, Feng ftian=# set vdb_jit = 0; SET Time: 0.155 ms ftian=# select count(*) from tpch1.lineitem; count - 6001215 (1 row) Time: 688.658 ms ftian=# select count(*) from tpch1.lineitem; count - 6001215 (1 row) Time: 690.753 ms ftian=# select count(l_orderkey) from tpch1.lineitem; count - 6001215 (1 row) Time: 819.452 ms ftian=# set vdb_jit = 1; SET Time: 0.167 ms ftian=# select count(*) from tpch1.lineitem; count - 6001215 (1 row) Time: 203.543 ms ftian=# select count(l_orderkey) from tpch1.lineitem; count - 6001215 (1 row) Time: 202.253 ms ftian=# On Fri, Oct 17, 2014 at 10:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: CK Tan ck...@vitessedata.com writes: The bigint sum,avg,count case in the example you tried has some optimization. We use int128 to accumulate the bigint instead of numeric in pg. Hence the big speed up. Try the same query on int4 for the improvement where both pg and vitessedb are using int4 in the execution. Well, that's pretty much cheating: it's too hard to disentangle what's coming from JIT vs what's coming from using a different accumulator datatype. If we wanted to depend on having int128 available we could get that speedup with a couple hours' work. But what exactly are you compiling here? I trust not the actual data accesses; that seems far too complicated to try to inline. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimizer on sort aggregate
Hi, Consider the following queries. create table t(i int, j int, k int, t text); insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 100) i; ftian=# explain select count(distinct j) from t group by t, i; QUERY PLAN GroupAggregate (cost=158029.84..178029.84 rows=100 width=22) - Sort (cost=158029.84..160529.84 rows=100 width=22) Sort Key: t, i - Seq Scan on t (cost=0.00..17352.00 rows=100 width=22) (4 rows) The query, select count(distinct j) from t group by t, i; runs for 35 seconds. However, if I change the query to, select count(distinct j) from t group by i, t; -- note switching the ordering select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea Run times are just about 5 and 6.5 seconds. The reason is clear, compare a string with collation is slow, which is well understood by pg hackers. However, here, the sorting order is forced by the planner, not user. Planner can do the following optimizations, 1. for the sort we generated for sort agg, planner can switch column ordering, put int before string, 2. for the sort we generated for sort agg, use bytea compare instead of string compare. They will bring big improvement to this common query. Is this something reasonable? Thanks,
Re: [HACKERS] Optimizer on sort aggregate
Hi, David, Yes, switch sorting order would loose an interesting order so if user dictates order by t, i; planner need to resort to its cost model. Estimating cardinality of groupby is a much bigger topic than this thread. I feel sorting string as if it is bytea is particularly interesting. I am aware Peter G's patch and I think it is great, but for this sort agg case, first, I believe it is still slower than sorting bytea, and second, Peter G's patch depends on data. A common long prefix will make the patch less effective, which is probably not so uncommon (for example, URL with a domain prefix). I don't see any downside of sort bytea, other than lost the interest ordering. Thanks, Feng On Fri, Oct 17, 2014 at 4:36 PM, David Rowley dgrowle...@gmail.com wrote: On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian feng...@gmail.com wrote: Hi, Consider the following queries. create table t(i int, j int, k int, t text); insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 100) i; ftian=# explain select count(distinct j) from t group by t, i; QUERY PLAN GroupAggregate (cost=158029.84..178029.84 rows=100 width=22) - Sort (cost=158029.84..160529.84 rows=100 width=22) Sort Key: t, i - Seq Scan on t (cost=0.00..17352.00 rows=100 width=22) (4 rows) The query, select count(distinct j) from t group by t, i; runs for 35 seconds. However, if I change the query to, select count(distinct j) from t group by i, t; -- note switching the ordering select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea Run times are just about 5 and 6.5 seconds. The reason is clear, compare a string with collation is slow, which is well understood by pg hackers. However, here, the sorting order is forced by the planner, not user. Planner can do the following optimizations, 1. for the sort we generated for sort agg, planner can switch column ordering, put int before string, 2. for the sort we generated for sort agg, use bytea compare instead of string compare. They will bring big improvement to this common query. Is this something reasonable? It seems like it might be worth looking into, but I think it's likely more complex than sorting the group by order into narrowest column first. If the query was: select count(distinct j) from t group by t, i order by t, i; Then if that was rewritten to make the group by i,t then the planner would then need to perform an additional sort on t,i to get the correct order by for the final result, which may or may not be faster, it would depend on how many groups there were to sort. Though I guess you could possibly just skip the optimisation if the next node up didn't need any specific ordering. I also wonder if taking into account the column's width is not quite enough. For example if the 'i' column only had 1 distinct value, then performing the group by on 'i' first wouldn't help at all. Keep in mind that the columns could be much closer in width than in your example, e.g int and bigint. Perhaps you'd need to include some sort of heuristics to look at the number of distinct values and the width, and form some sort of weighted estimates about which column to put first. Regards David Rowley
Re: [HACKERS] libpq connectoin redirect
While these can be handled at higher level, for example, by setting up LDAP or as Hekki suggested, tricking DNS, the problem is that I don't have control of how the user connect to the server. They may not use LDAP. Solution like pgbouncer has advantages. User just get one ip/port and everything else happens automatically. Thanks, Subject: Re: [HACKERS] libpq connectoin redirect From: li...@jwp.name Date: Wed, 21 Apr 2010 15:52:39 -0700 CC: pgsql-hackers@postgresql.org To: ft...@hotmail.com On Apr 20, 2010, at 10:03 PM, feng tian wrote: Another way to do this, is to send the client an redirect message. When client connect to 127.0.0.10, instead of accepting the connection, it can reply to client telling it to reconnect to one of the server on 127.0.0.11-14. ISTM that this would be better handled at a higher-level. That is, given a server (127.0.0.10) that holds 127.0.0.11-14. Connect to that server and query for the correct target host. _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Re: [HACKERS] libpq connectoin redirect
Hi, John, The change will be on the libpq client side. I am not saying this is a general solution for the distributed transaction/scale out. However, in many cases, it is very useful. For example, in my case, I have about 100 departments each has it own database. The balance machine can just redirect to the right box according to database/user. The 4 boxes I have may not even get domain name or static IP. Another scenario, if I have some kind of replication set up, I can send transaction processing to master and analytic reporting query to slaves. Thanks, Feng feng tian wrote: Hi, I want to load balance a postgres server on 4 physical machines, say 127.0.0.11-14. I can set up a pgbouncer on 127.0.0.10 and connection pooling to my four boxes. However, the traffic from/to clients will go through an extra hop. Another way to do this, is to send the client an redirect message. When client connect to 127.0.0.10, instead of accepting the connection, it can reply to client telling it to reconnect to one of the server on 127.0.0.11-14. I am planning to write/submit a patch to do that. I wonder if there is similar effort in extending libpq protocol, or, if you have better ideas on how to achieve this. how do you plan on maintaining consistency, transactional integrity and atomicity of updates across these 4 machines? _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccountocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
[HACKERS] libpq connectoin redirect
Hi, I want to load balance a postgres server on 4 physical machines, say 127.0.0.11-14. I can set up a pgbouncer on 127.0.0.10 and connection pooling to my four boxes. However, the traffic from/to clients will go through an extra hop. Another way to do this, is to send the client an redirect message. When client connect to 127.0.0.10, instead of accepting the connection, it can reply to client telling it to reconnect to one of the server on 127.0.0.11-14. I am planning to write/submit a patch to do that. I wonder if there is similar effort in extending libpq protocol, or, if you have better ideas on how to achieve this. Thank you, Feng _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendarocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[HACKERS] Adding some new operators
Hi, I am trying to use postgresql in some research project. I need to add some new operators, some will probably take more than 2 input tuple streams. The new operator will be considered by optimizer when estimating cost and choosing a plan. Can anyone tell me how hard this will be? And where is a good place to start? Thanks. Feng ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] Adding some new operators
Hi, I am trying to use postgresql in a research project. I need to add several new operators, some will probably take more than 2 input tuple streams. The new operator will be considered by optimizer when estimating cost and choosing a plan. Can anyone tell me how hard this will be? And where is a good place to start? Thanks. Feng ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])