[HACKERS] FDW: should GetFdwRoutine be called when drop table?

2016-02-19 Thread Feng Tian
Hi, Hackers,

I have an fdw that each foreign table will acquire some persisted resource.
In my case, some files in file system.   To drop the table cleanly, I have
written
an object_access_hook that remove those files.  The hook is installed in
_PG_init.

It all worked well except one case. Suppose a user login, the very first
command is
drop foreign table.  Drop foreign table will not load the module, so that
the hook
is not installed and the files are not properly cleaned up.

Should drop foreign table call GetFdwRoutine?   _PG_init is the only
entrance point
that I know for registering hooks, I feel we need to trigger a load for all
DML/DDL on
FDW including drop.   Does this make sense?

Thanks,
Feng


Re: [HACKERS] Bug with int2

2016-02-16 Thread Feng Tian
Ah, thanks!

On Tue, Feb 16, 2016 at 7:54 PM, Peter Geoghegan <p...@heroku.com> wrote:

> On Tue, Feb 16, 2016 at 7:27 PM, Feng Tian <ft...@vitessedata.com> wrote:
> > I run into the following.   Seems this is a bug for -32768, which should
> be
> > a valid smallint value.
>
> This isn't a bug. You see the error only due to operator precedence:
>
> postgres=# select (-32768)::int2;
>   int2
> ─
>  -32,768
> (1 row)
>
> --
> Peter Geoghegan
>


[HACKERS] Bug with int2

2016-02-16 Thread Feng Tian
I run into the following.   Seems this is a bug for -32768, which should be
a valid smallint value.

Test was run on 9.4.5.

Thanks,
Feng

ftian=# select 32767::int2;


 int2
---
 32767
(1 row)

ftian=# select -32767::int2;
 ?column?
--
   -32767
(1 row)

ftian=# select 32768::int2;


ERROR:  smallint out of range
ftian=# select -32768::int2;
ERROR:  smallint out of range
ftian=#


Re: [HACKERS] Remove Windows crash dump support?

2015-12-27 Thread Feng Tian
On Wed, Dec 23, 2015 at 6:14 PM, Craig Ringer  wrote:

> On 22 December 2015 at 23:48, Alex Ignatov 
> wrote:
>
>
>> I think that you can debug crash dump since windbg exists.
>>
>
> Nobody in their right mind uses windbg though. Visual Studio is really
> where it's at and the Express versions make it much more practical.
>

Just FYI.  Most developers of MS server (OS/SQL Server) team use windbg.
Windbg was my only debugger when I worked there and I never used Visual
Studio.




>
> You can't even install Debugging Tools for Windows and Windbg standalone
> anymore.
>
>
>> Also I think that Postgres on Windows number  of instalations is so tiny
>> because people even today think that it is not so solid as unix version
>> thats why you think that nobody use your code ;).
>>
>
> I disagree. Windows Pg users are often at bigger companies and don't talk
> about PostgreSQL as much. Often for fear of reprisals from other database
> vendors they have ongoing relationships with.  At least that's been my
> experience and I'm sure EDB folks will concur.
>
>
>> Today if my memory serves me right this code can not deal with buffer
>> overflow. Am i right?
>>
>
> Stack overflow?
>
> No, it can't. The stack has to be somewhat sane.
>
>
>> May be we need to add this functionality instead of drop support of it
>> entirely
>>
>
> Why? I've never seen any sign anybody has used it, ever.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] POC, WIP: OR-clause support for indexes

2015-12-26 Thread Feng Tian
Hi, Teodor,

This is great.   I got a question, is it possible make btree index to
support OR as well?  Is btree supports more invasive, in the sense that we
need to do enhance ScanKey to supports an array of values?

Thanks,
Feng

On Sat, Dec 26, 2015 at 10:04 AM, Teodor Sigaev  wrote:

> I'd like to present OR-clause support for indexes. Although OR-clauses
> could be supported by bitmapOR index scan it isn't very effective and such
> scan lost any order existing in index. We (with Alexander Korotkov)
> presented results on Vienna's conference this year. In short, it provides
> performance improvement:
>
> EXPLAIN ANALYZE
> SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000;
> me=0.080..0.267 rows=173 loops=1)
>  Recheck Cond: ((id = 5) OR (id = 500) OR (id = 5000))
>  Heap Blocks: exact=172
>  ->  Bitmap Index Scan on idx_gin  (cost=0.00..57.50 rows=15000
> width=0) (actual time=0.059..0.059 rows=147 loops=1)
>Index Cond: ((id = 5) OR (id = 500) OR (id = 5000))
>  Planning time: 0.077 ms
>  Execution time: 0.308 ms   <---
> QUERY PLAN
>
> ---
>  Aggregate  (cost=51180.53..51180.54 rows=1 width=0) (actual
> time=796.766..796.766 rows=1 loops=1)
>->  Index Only Scan using idx_btree on tst  (cost=0.42..51180.40
> rows=55 width=0) (actual time=0.444..796.736 rows=173 loops=1)
>  Filter: ((id = 5) OR (id = 500) OR (id = 5000))
>  Rows Removed by Filter: 999829
>  Heap Fetches: 102
>  Planning time: 0.087 ms
>  Execution time: 796.798 ms  <--
> QUERY PLAN
>
> -
>  Aggregate  (cost=21925.63..21925.64 rows=1 width=0) (actual
> time=160.412..160.412 rows=1 loops=1)
>->  Seq Scan on tst  (cost=0.00..21925.03 rows=237 width=0) (actual
> time=0.535..160.362 rows=175 loops=1)
>  Filter: ((id = 5) OR (id = 500) OR (id = 5000))
>  Rows Removed by Filter: 999827
>  Planning time: 0.459 ms
>  Execution time: 160.451 ms
>
>
> It also could work together with KNN feature of GiST and in this case
> performance improvement could be up to several orders of magnitude, in
> artificial example it was 37000 times faster.
>
> Not all  indexes can support oR-clause, patch adds support to GIN, GiST
> and BRIN indexes. pg_am table is extended for adding amcanorclause column
> which indicates possibility of executing of OR-clause by index.
>
>  indexqual and indexqualorig doesn't contain implicitly-ANDed list of
> index qual expressions, now that lists could contain OR RestrictionInfo.
> Actually, the patch just tries to convert BitmapOr node to IndexScan or
> IndexOnlyScan. Thats significantly simplifies logic to find possible
> clause's list for index.
> Index always gets a array of ScanKey but for indexes which support
> OR-clauses
> array  of ScanKey is actually exection tree in reversed polish notation
> form. Transformation is done in ExecInitIndexScan().
>
> The problems on the way which I see for now:
> 1 Calculating cost. Right now it's just a simple transformation of costs
> computed for BitmapOr path. I'd like to hope that's possible and so index's
> estimation function could be non-touched. So, they could believe that all
> clauses are implicitly-ANDed
> 2 I'd like to add such support to btree but it seems that it should be a
> separated patch. Btree search algorithm doesn't use any kind of stack of
> pages and algorithm to walk over btree doesn't clear for me for now.
> 3 I could miss some places which still assumes  implicitly-ANDed list of
> clauses although regression tests passes fine.
>
> Hope, hackers will not have an strong objections to do that. But obviously
> patch
> requires further work and I'd like to see comments, suggestions and
> recommendations. Thank you.
>
>
> --
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>WWW:
> http://www.sigaev.ru/
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] Fdw cleanup

2015-12-14 Thread Feng Tian
Hi, Albe,

Thank you.

The resource is not memory, I have something like a file descriptor or
network connection, which I believe is the common case for foreign table.
Using RegisterXactCallback exposes an API at a much deep level to foreign
table writer, and I need to find a place to anchor the context (fdw_state).
  I took a look at postgres_fdw and found that xact callback is exactly
what is done.So I will take this approach.

Again, thank you Albe, for pointing me to it.
Feng



On Mon, Dec 14, 2015 at 12:30 AM, Albe Laurenz <laurenz.a...@wien.gv.at>
wrote:

> Feng Tian wrote:
> > I need some help to understand foreign table error handling.
> >
> > For a query on foreign table, ExecInitForeignScan is called, which in
> turn calls the BeginForeignScan
> > hook.   Inside this hook, I allocated some resource,
> >
> >
> > node->fdw_state = allocate_resource(...);
> >
> > If everything goes well, ExecEndForeignScan will call call my
> EndForeignScan hook, inside the hook, I
> > free the resource.
> >
> > free_resource(node->fdw_state);
> >
> > However, if during the execution an error happened, seems to me that
> EndForeignScan will not be called
> > (traced using gdb).   So my question is, is Begin/End the right place
> for allocate/free resources?
> > If it is not, what is the right way to do this?
>
> If the resource is memory, use "palloc" to allocate it, and PostgreSQL
> will take care of it automatically.
>
> Other than that, you could call RegisterXactCallback() and register a
> callback
> that will be called upon transaction end.  In the callback function you can
> free the resource if it was not already freed by EndForeignScan.
>
> Yours,
> Laurenz Albe
>


[HACKERS] Fdw cleanup

2015-12-13 Thread Feng Tian
Hi, Hackers,

I need some help to understand foreign table error handling.

For a query on foreign table, ExecInitForeignScan is called, which in turn
calls the BeginForeignScan hook.   Inside this hook, I allocated some
resource,

node->fdw_state = allocate_resource(...);

If everything goes well, ExecEndForeignScan will call call my
EndForeignScan hook, inside the hook, I free the resource.

free_resource(node->fdw_state);

However, if during the execution an error happened, seems to me that
EndForeignScan will not be called (traced using gdb).   So my question is,
is Begin/End the right place for allocate/free resources?   If it is not,
what is the right way to do this?

Thank you very much,
Feng


[HACKERS] Decimal64 and Decimal128

2015-09-24 Thread Feng Tian
Hi,

Here is an extension for 64 and 128 bit decimal types using IEEE decimal
floating point.  The original idea/implementation is from
http://pgxn.org/dist/pgdecimal/1.0.0/   Original thread for dicussion is at

http://www.postgresql.org/message-id/CAFj8pRApakE6s-H2yJcXD=ubpukwa6i7rx4vuvtb4puhga5...@mail.gmail.com

I reimplemented 64/128 bits instead of 32/64 bits.  The code use decNumber
library instead of _Decimal64/128 of GCC.   Also added more operators.

Compared to numeric type, decimal64 arithmetics is about 2x faster,
decimal128 is about 1.5x faster.  However, the cast between decimal and
float4/8 is implemented rather naively and slow.   As always, it depends on
workload, decimal may take more, or less space, may be slower if cast is
frequently performed.

Agains, thanks to the original author okbob (Pavel).   Enjoy.

Thanks,
Feng


Re: [HACKERS] Decimal64 and Decimal128

2015-09-24 Thread Feng Tian
On Thu, Sep 24, 2015 at 1:55 PM, Peter Geoghegan  wrote:

> On Thu, Sep 24, 2015 at 1:53 PM, Tom Lane  wrote:
> > Please include the actual patch as an attachment.  We do not consider
> mere
> > URLs to be acceptable patch submission format, because that provides no
> > permanent record in our archives of what was submitted.
>
> I was under the impression that this was not intended as a patch
> submission.
>
>
> --
> Peter Geoghegan
>

If there is enough interest, would be great for it to go into the official
contrib dir.
Thanks,


Re: [HACKERS] Decimal64 and Decimal128

2015-09-24 Thread Feng Tian
On Thu, Sep 24, 2015 at 2:17 PM, Feng Tian <ft...@vitessedata.com> wrote:

>
>
> On Thu, Sep 24, 2015 at 1:55 PM, Peter Geoghegan <p...@heroku.com> wrote:
>
>> On Thu, Sep 24, 2015 at 1:53 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> > Please include the actual patch as an attachment.  We do not consider
>> mere
>> > URLs to be acceptable patch submission format, because that provides no
>> > permanent record in our archives of what was submitted.
>>
>> I was under the impression that this was not intended as a patch
>> submission.
>>
>>
>> --
>> Peter Geoghegan
>>
>
> If there is enough interest, would be great for it to go into the official
> contrib dir.
> Thanks,
>
>
> Second thought, the extension depends on decNumber, which is either GPL,
or ICU license.  Maybe this is trouble.


Re: [HACKERS] Decimal64 and Decimal128

2015-09-24 Thread Feng Tian
On Thu, Sep 24, 2015 at 1:29 PM, Feng Tian <ft...@vitessedata.com> wrote:

> Hi,
>
> Here is an extension for 64 and 128 bit decimal types using IEEE decimal
> floating point.  The original idea/implementation is from
> http://pgxn.org/dist/pgdecimal/1.0.0/   Original thread for dicussion is
> at
>
>
> http://www.postgresql.org/message-id/CAFj8pRApakE6s-H2yJcXD=ubpukwa6i7rx4vuvtb4puhga5...@mail.gmail.com
>
> I reimplemented 64/128 bits instead of 32/64 bits.  The code use decNumber
> library instead of _Decimal64/128 of GCC.   Also added more operators.
>
> Compared to numeric type, decimal64 arithmetics is about 2x faster,
> decimal128 is about 1.5x faster.  However, the cast between decimal and
> float4/8 is implemented rather naively and slow.   As always, it depends on
> workload, decimal may take more, or less space, may be slower if cast is
> frequently performed.
>
> Agains, thanks to the original author okbob (Pavel).   Enjoy.
>
> Thanks,
> Feng
>
>
Ah, link.

https://github.com/vitesse-ftian/pgdecimal

Thanks,
Feng


Re: [HACKERS] Using quicksort for every external sort run

2015-08-20 Thread Feng Tian
On Thu, Aug 20, 2015 at 1:16 PM, Peter Geoghegan p...@heroku.com wrote:

 On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian ft...@vitessedata.com wrote:
  Just a quick anecdotal evidence.  I did similar experiment about three
 years
  ago.   The conclusion was that if you have SSD, just do quick sort and
  forget the longer runs, but if you are using hard drives, longer runs is
 the
  winner (and safer, to avoid cliffs).I did not experiment with
 RAID0/5 on
  many spindles though.
 
  Not limited to sort, more generally, SSD is different enough from HDD,
  therefore it may worth the effort for backend to guess what storage
 device
  it has, then choose the right thing to do.

 The devil is in the details. I cannot really comment on such a general
 statement.

 I would be willing to believe that that's true under
 unrealistic/unrepresentative conditions. Specifically, when multiple
 passes are required with a sort-merge strategy where that isn't the
 case with replacement selection. This could happen with a tiny
 work_mem setting (tiny in an absolute sense more than a relative
 sense). With an HDD, where sequential I/O is so much faster, this
 could be enough to make replacement selection win, just as it would
 have in the 1970s with magnetic tapes.

 As I've said, the solution is to simply avoid multiple passes, which
 should be possible in virtually all cases because of the quadratic
 growth in a classic hybrid sort-merge strategy's capacity to avoid
 multiple passes (growth relative to work_mem's growth). Once you
 ensure that, then you probably have a mostly I/O bound workload, which
 can be made faster by adding sequential I/O capacity (or, on the
 Postgres internals side, adding asynchronous I/O, or with memory
 prefetching). You cannot really buy a faster CPU to make a degenerate
 heapsort faster.

 --
 Peter Geoghegan


Agree everything in principal,except one thing -- no, random IO on HDD in
2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s.
:-)


Re: [HACKERS] Using quicksort for every external sort run

2015-08-20 Thread Feng Tian
On Thu, Aug 20, 2015 at 10:41 AM, Peter Geoghegan p...@heroku.com wrote:

 On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs si...@2ndquadrant.com
 wrote:
  On 20 August 2015 at 03:24, Peter Geoghegan p...@heroku.com wrote:
 
 
  The patch is ~3.25x faster than master
 
 
  I've tried to read this post twice and both times my work_mem overflowed.
  ;-)
 
  Can you summarize what this patch does? I understand clearly what it
 doesn't
  do...

 The most important thing that it does is always quicksort runs, that
 are formed by simply filling work_mem with tuples in no particular
 order, rather than trying to make runs that are twice as large as
 work_mem on average. That's what the ~3.25x improvement concerned.
 That's actually a significantly simpler algorithm than replacement
 selection, and appears to be much faster. You might even say that it's
 a dumb algorithm, because it is less sophisticated than replacement
 selection. However, replacement selection tends to use CPU caches very
 poorly, while its traditional advantages have become dramatically less
 important due to large main memory sizes in particular. Also, it hurts
 that we don't currently dump tuples in batches, for several reasons.
 Better to do memory intense operations in batch, rather than having a
 huge inner loop, in order to minimize or prevent instruction cache
 misses. And we can better take advantage of asynchronous I/O.

 The complicated aspect of considering the patch is whether or not it's
 okay to not use replacement selection anymore -- is that an
 appropriate trade-off?

 The reason that the code has not actually been simplified by this
 patch is that I still want to use replacement selection for one
 specific case: when it is anticipated that a quicksort with
 spillover can occur, which is only possible with incremental
 spilling. That may avoid most I/O, by spilling just a few tuples using
 a heap/priority queue, and quicksorting everything else. That's
 compelling when you can manage it, but no reason to always use
 replacement selection for the first run in the common case where there
 well be several runs in total.

 Is that any clearer? To borrow a phrase from the processor
 architecture community, from a high level this is a Brainiac versus
 Speed Demon [1] trade-off. (I wish that there was a widely accepted
 name for this trade-off.)

 [1]
 http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate
 --
 Peter Geoghegan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Hi, Peter,

Just a quick anecdotal evidence.  I did similar experiment about three
years ago.   The conclusion was that if you have SSD, just do quick sort
and forget the longer runs, but if you are using hard drives, longer runs
is the winner (and safer, to avoid cliffs).I did not experiment with
RAID0/5 on many spindles though.

Not limited to sort, more generally, SSD is different enough from HDD,
therefore it may worth the effort for backend to guess what storage
device it has, then choose the right thing to do.

Cheers.


Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Feng Tian
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra tomas.von...@2ndquadrant.com
 wrote:

 Hi,

 I'm currently running some tests on a 3TB TPC-H data set, and I tripped
 over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which
 somehow illustrates the importance of the memory-bounded hashagg patch Jeff
 Davis is working on).

 The problem is Q18, particularly this simple subquery:

 select l_orderkey
 from lineitem
 group by l_orderkey
 having sum(l_quantity)  313;

 which is planned like this:

QUERY PLAN

 -
  HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity)  '313'::double precision)
-  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=1848128
 width=12)
 (4 rows)

 but sadly, in reality the l_orderkey cardinality looks like this:

 tpch=# select count(distinct l_orderkey) from lineitem;
count
 
  45
 (1 row)

 That's a helluva difference - not the usual one or two orders of
 magnitude, but 1x underestimate.

 The usual thing to do in this case is increasing statistics target, and
 while this improves the estimate, the improvement is rather small:

statistics target estimatedifference
--
100 429491 1
1000   4240418  1000
1 42913759   100

 I find the pattern rather strange - every time the statistics target
 increases 10x, the difference decreases 10x - maybe that's natural, but the
 perfect proportionality is suspicious IMHO.

 Also, this is a quite large dataset - the table has ~18 billion rows, and
 even with target=1 we're sampling only 3M rows, which is ~0.02%. That's
 a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H
 dataset is damn uniform - there's pretty much no skew in the distributions
 AFAIK. So I'd expect a slightly better result.

 With target=1 the plan switches to GroupAggregate, because the
 estimate gets sufficient to exceed work_mem (2GB). But it's still way off,
 and it's mostly just a lucky coincidence.

 So I'm wondering if there's some bug because of the dataset size (an
 integer overflow or something like), so I added a bunch of logging into the
 estimator, logging all the parameters computed:

 target=100 (samplerows=3)
 -
 WARNING:  attnum=1 attname=l_orderkey f1=27976 ndistinct=28977
 nmultiple=1001 toowide_cnt=0 d=28977 numer=86931.00
 denom=2024.046627 stadistinct=429491.094029
 WARNING:  ndistinct estimate attnum=1 attname=l_orderkey current=429491.09
 adaptive=443730.00

 target=1000 (samplerows=30)
 ---
 WARNING:  attnum=1 attname=l_orderkey f1=279513 ndistinct=289644
 nmultiple=10131 toowide_cnt=0 d=289644 numer=8689320.00
 denom=20491.658538 stadistinct=4240418.111618
 WARNING:  ndistinct estimate attnum=1 attname=l_orderkey
 current=4240418.11 adaptive=4375171.00

 target=1 (samplerows=300)
 -
 WARNING:  attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799
 nmultiple=99911 toowide_cnt=0 d=2897799 numer=869339700.00
 denom=202578.313396 stadistinct=42913759.396282
 WARNING:  ndistinct estimate attnum=1 attname=l_orderkey
 current=42913759.40 adaptive=9882.00

 It's totalrows=1849031 in all cases. The logs also show estimate
 produced by the adaptive estimate (discussed in a separate thread), but
 apparently that does not change the estimates much :-(

 Any ideas?

 --
 Tomas Vondra   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



While better sample/stats is important for choosing a good plan, in this
query, hash agg is really the right plan.   If a sort agg is chosen, the
performance will be really really bad.   The patch that Jeff is working on
is critical for a decent TPCH number (unless you have unlimited amount of
memory).

Thanks,


Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Feng Tian
On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra tomas.von...@2ndquadrant.com
wrote:

 Hi,

 On 06/20/2015 08:54 AM, Feng Tian wrote:


 While better sample/stats is important for choosing a good plan, in
 this query, hash agg is really the right plan. If a sort agg is
 chosen, the performance will be really really bad. The patch that
 Jeff is working on is critical for a decent TPCH number (unless you
 have unlimited amount of memory).


 I do agree that Jeff's memory-bounded hashagg patch is very important
 feature, and in fact we spent a fair amount of time discussing it in
 Ottawa. So I'm looking forward to getting that soon ;-)

 But I don't think hashagg is going to be very good in this particular
 case. With a 3TB dataset, the query runs out of memory on a machine with
 256GB of RAM. So let's assume a complete hash table has ~256GB. With
 work_mem=1GB that means only ~1/256 of the table can be processed in one
 batch, so we'll process the first 1/256 of the table, and write out the
 remaining 99% into batches. Then we'll read the batches one by one, and
 process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB
 into batches, and then read those ~2.49TB again. At least that's how I
 understand Jeff's memory-bounded hashagg proposal.

 The sort may perform worse in the general case, but in this case there's
 an index on the column, and the table is almost perfectly correlated by
 that column (due to generating the orders one by one, but it seems
 plausible it'd be the same in reality, assuming the orders are numbered
 using a sequence). So doing the sort by an indexscan seems rather cheap,
 and you only need to scan the table once.

 regards


 --
 Tomas Vondra   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


I have not read Jeff's patch, but here is how I think hash agg should work,

Hash agg scan lineitem table, perform aggregation in memory.   Once workmem
is exhausted, it write intermediate state to disk, bucket by bucket.  When
lineitem table is finished, it reads all tuples from one bucket back,
combining intermediate state and finalize the aggregation.   I saw a quite
extensive discussion on combining aggregation on the dev list, so I assume
it will be added.

Assume after modulo an efficient size for I/O, workmem is bigger than the
square root of data after aggregation, the above algorithm can finish by
write out and read back only once.

For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion
orders.   For the simple subquery, one need to
1. scan table, 3TB I/O
2. write out intermediate state, 4 billion * size of (key column +
intermediate state ~ 20 bytes) = 80GB
3. read back 80GB.

If sort is used, also assume workmem bigger than sqrt of data, you need to
scan table, write out about 20B * 20 ~ 400GB, read back 400GB.   Sort may
have to do extra rounds of merge, but let's ignore that.

Hash agg has better performace, because,
1. less I/O
2. hash is a linear algorithm, compared to sort at n*lg(n).


[HACKERS] Float/Double cast to int

2015-05-21 Thread Feng Tian
Hi, Hackers,

Here is a query, server was built witch GCC on Linux, AMD64.


ftian=#
ftian=# select 1.5::int, 1.5::double precision::int, 314.5::int,
314.5::double precision::int;
 int4 | int4 | int4 | int4
--+--+--+--
2 |2 |  315 |  314
(1 row)


I believe this is because rint is broken -- can some expert on IEEE754
please help confirm that this is a bug?

Thanks,
Feng


Re: [HACKERS] Float/Double cast to int

2015-05-21 Thread Feng Tian
Ah, thanks!  I did not realize numeric comes into play.   But, this is even
more interesting -- I would expect numeric is more consistent than
float/double when dealing with stuff like rounding.

I missed the not too long ago discussion, :-)   Regardless of the
mechanisms underneath, it would be quite hard to explain this behavior to
customer.  Maybe it is time to be brave, and be compatible with reality
instead of backward?

Best,
Feng




On Thu, May 21, 2015 at 12:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Feng Tian ft...@vitessedata.com writes:
  Here is a query, server was built witch GCC on Linux, AMD64.

  ftian=# select 1.5::int, 1.5::double precision::int, 314.5::int,
  314.5::double precision::int;
   int4 | int4 | int4 | int4
  --+--+--+--
  2 |2 |  315 |  314
  (1 row)

  I believe this is because rint is broken -- can some expert on IEEE754
  please help confirm that this is a bug?

 rint() is doing what the IEEE spec says, ie round to nearest even.
 Your third expression is doing numeric-to-int, and that code doesn't
 obey the IEEE spec.  We've had discussions (not too long ago) about
 making these behaviors more consistent, but people seem to be too
 afraid of backwards-compatibility problems if we change it.

 regards, tom lane



Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices

2014-11-03 Thread Feng Tian
Hi,

This is Feng from Vitesse.   Performance different between Money and
Numeric is *HUGE*.   For TPCH Q1, the performance difference is 5x for
stock postgres, and ~20x for vitesse.

Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use
Numeric (15, 2) is ~53s.

Kevin,
 test=# do $$ begin perform sum('1.01'::numeric) from
generate_series(1,1000); end; $$;

This may not reflect the difference of the two data type.   One aggregate
is not where most of the time is spent.  TPCH Q1 has many more computing.

















On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck michael.ba...@credativ.de
wrote:

 Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane:
  BTW, after reflecting a bit more I'm less than convinced that this
  datatype is completely useless.  Even if you prefer to store currency
  values in numeric columns, casting to or from money provides a way to
  accept or emit values in whatever monetary format the LC_MONETARY locale
  setting specifies.  That seems like a useful feature, and it's one you
  could not easily duplicate using to_char/to_number (not to mention that
  those functions aren't without major shortcomings of their own).

 As an additional datapoint, Vitesse Data changed the DB schema from
 NUMERIC to MONEY for their TPCH benchmark for performance reasons: The
 modification to data types is easy to understand -- money and double
 types are faster than Numeric (and no one on this planet has a bank
 account that overflows the money type, not any time soon).[1] And
 Replaced NUMERIC fields representing currency with MONEY[2].

 Not sure whether they modified/optimized PostgreSQL with respect to the
 MONEY data type and/or how much performance that gained, so CCing CK Tan
 as well.


 Michael

 [1]
 http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html
 [2] http://vitessedata.com/benchmark/

 --
 Michael Banck
 Projektleiter / Berater
 Tel.: +49 (2161) 4643-171
 Fax:  +49 (2161) 4643-100
 Email: michael.ba...@credativ.de

 credativ GmbH, HRB Mönchengladbach 12080
 USt-ID-Nummer: DE204566209
 Hohenzollernstr. 133, 41061 Mönchengladbach
 Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Fwd: [HACKERS] Vitesse DB call for testing

2014-10-17 Thread Feng Tian
Hi, Tom,

Sorry for double post to you.

Feng

-- Forwarded message --
From: Feng Tian ft...@vitessedata.com
Date: Fri, Oct 17, 2014 at 10:29 AM
Subject: Re: [HACKERS] Vitesse DB call for testing
To: Tom Lane t...@sss.pgh.pa.us


Hi, Tom,

I agree using that using int128 in stock postgres will speed up things
too.  On the other hand, that is only one part of the equation.   For
example, if you look at TPCH Q1, the int128 cheating does not kick in at
all, but we are 8x faster.

I am not sure why do you mean by actual data access.   Data is still in
stock postgres format on disk.  We indeed jit-ed all data fields access
(deform tuple).To put things in perspective, I just timed select
count(*) and select count(l_orderkey) from tpch1.lineitem;   Our code is
bottlenecked by memory bandwidth and difference is pretty much invisible.

Thanks,
Feng


ftian=# set vdb_jit = 0;

SET

Time: 0.155 ms

ftian=# select count(*) from tpch1.lineitem;

  count

-

 6001215

(1 row)


Time: 688.658 ms

ftian=# select count(*) from tpch1.lineitem;

  count

-

 6001215

(1 row)


Time: 690.753 ms

ftian=# select count(l_orderkey) from tpch1.lineitem;

  count

-

 6001215

(1 row)


Time: 819.452 ms

ftian=# set vdb_jit = 1;

SET

Time: 0.167 ms

ftian=# select count(*) from tpch1.lineitem;

  count

-

 6001215

(1 row)


Time: 203.543 ms

ftian=# select count(l_orderkey) from tpch1.lineitem;

  count

-

 6001215

(1 row)


Time: 202.253 ms

ftian=#




On Fri, Oct 17, 2014 at 10:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 CK Tan ck...@vitessedata.com writes:
  The bigint sum,avg,count case in the example you tried has some
 optimization. We use int128 to accumulate the bigint instead of numeric in
 pg. Hence the big speed up. Try the same query on int4 for the improvement
 where both pg and vitessedb are using int4 in the execution.

 Well, that's pretty much cheating: it's too hard to disentangle what's
 coming from JIT vs what's coming from using a different accumulator
 datatype.  If we wanted to depend on having int128 available we could
 get that speedup with a couple hours' work.

 But what exactly are you compiling here?  I trust not the actual data
 accesses; that seems far too complicated to try to inline.

 regards, tom lane


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] Optimizer on sort aggregate

2014-10-17 Thread Feng Tian
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from
generate_series(1, 100) i;

ftian=# explain select count(distinct j) from t group by t, i;
   QUERY PLAN

 GroupAggregate  (cost=158029.84..178029.84 rows=100 width=22)
   -  Sort  (cost=158029.84..160529.84 rows=100 width=22)
 Sort Key: t, i
 -  Seq Scan on t  (cost=0.00..17352.00 rows=100 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the
ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert
t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a
string with collation is slow, which is well understood by pg hackers.
However, here, the sorting order is forced by the planner, not user.
Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column
ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of
string compare.

They will bring big improvement to this common query.   Is this something
reasonable?

Thanks,


Re: [HACKERS] Optimizer on sort aggregate

2014-10-17 Thread Feng Tian
Hi, David,

Yes, switch sorting order would loose an interesting order so if user
dictates order by t, i; planner need to resort to its cost model.
Estimating cardinality of groupby is a much bigger topic than this thread.

I feel sorting string as if it is bytea is particularly interesting.  I am
aware Peter G's patch and I think it is great, but for this sort agg case,
first, I believe it is still slower than sorting bytea, and second, Peter
G's patch depends on data.   A common long prefix will make the patch less
effective, which is probably not so uncommon (for example, URL with a
domain prefix).  I don't see any downside of sort bytea, other than lost
the interest ordering.

Thanks,
Feng



On Fri, Oct 17, 2014 at 4:36 PM, David Rowley dgrowle...@gmail.com wrote:

 On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian feng...@gmail.com wrote:

 Hi,

 Consider the following queries.

 create table t(i int, j int, k int, t text);
 insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from
 generate_series(1, 100) i;

 ftian=# explain select count(distinct j) from t group by t, i;
QUERY PLAN
 
  GroupAggregate  (cost=158029.84..178029.84 rows=100 width=22)
-  Sort  (cost=158029.84..160529.84 rows=100 width=22)
  Sort Key: t, i
  -  Seq Scan on t  (cost=0.00..17352.00 rows=100 width=22)
 (4 rows)


 The query,
 select count(distinct j) from t group by t, i;

 runs for 35 seconds.  However, if I change the query to,
 select count(distinct j) from t group by i, t;  -- note switching the
 ordering
 select count(distinct j) from t group by decode(t, 'escape'), i; --
 convert t to bytea

 Run times are just about 5 and 6.5 seconds.  The reason is clear, compare
 a string with collation is slow, which is well understood by pg hackers.
 However, here, the sorting order is forced by the planner, not user.
 Planner can do the following optimizations,

 1. for the sort we generated for sort agg, planner can switch column
 ordering, put int before string,
 2. for the sort we generated for sort agg, use bytea compare instead of
 string compare.

 They will bring big improvement to this common query.   Is this something
 reasonable?


 It seems like it might be worth looking into, but I think it's likely more
 complex than sorting the group by order into narrowest column first.

 If the query was:

 select count(distinct j) from t group by t, i order by t, i;

 Then if that was  rewritten to make the group by i,t then the planner
 would then need to perform an additional sort on t,i to get the correct
 order by for the final result, which may or may not be faster, it would
 depend on how many groups there were to sort. Though I guess you could
 possibly just skip the optimisation if the next node up didn't need any
 specific ordering.

 I also wonder if taking into account the column's width is not quite
 enough. For example if the 'i' column only had 1 distinct value, then
 performing the group by on 'i' first wouldn't help at all. Keep in mind
 that the columns could be much closer in width than in your example, e.g
 int and bigint. Perhaps you'd need to include some sort of heuristics to
 look at the number of distinct values and the width, and form some sort of
 weighted estimates about which column to put first.

 Regards

 David Rowley



Re: [HACKERS] libpq connectoin redirect

2010-04-22 Thread feng tian

While these can be handled at higher level, for example, by setting 
up LDAP or as Hekki suggested, tricking DNS, the problem is that 
I don't have control of how the user connect to the server.  They 
may not use LDAP.  Solution like pgbouncer has advantages.  User
just get one ip/port and everything else happens automatically.

Thanks,


 Subject: Re: [HACKERS] libpq connectoin redirect
 From: li...@jwp.name
 Date: Wed, 21 Apr 2010 15:52:39 -0700
 CC: pgsql-hackers@postgresql.org
 To: ft...@hotmail.com
 
 On Apr 20, 2010, at 10:03 PM, feng tian wrote:
  Another way to do this, is to send the client an redirect message.  When 
  client connect to 127.0.0.10, instead of accepting the connection, it can 
  reply to client telling it to reconnect to one of the server on 
  127.0.0.11-14.  
 
 ISTM that this would be better handled at a higher-level. That is, given a 
 server (127.0.0.10) that holds 127.0.0.11-14. Connect to that server and 
 query for the correct target host.
  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: [HACKERS] libpq connectoin redirect

2010-04-22 Thread feng tian

Hi, John,

The change will be on the libpq client side.  I am not saying this is a general
solution for the distributed transaction/scale out.  However, in many cases, it
is very useful.  For example, in my case, I have about 100 departments each has 
it own database.  The balance machine can just redirect to the right box 
according
to database/user.  The 4 boxes I have may not even get domain name or static IP.
Another scenario, if I have some kind of replication set up, I can send 
transaction
processing to master and analytic reporting query to slaves. 

Thanks,
Feng

feng tian wrote:
Hi,

I want to load balance a postgres server on 4 physical 
machines, say 
127.0.0.11-14.  I can set up a pgbouncer on 127.0.0.10 and 
connection 
pooling to my four boxes.  However, the traffic from/to clients
 will 
go through an extra hop.  Another way to do this, is to send 
the 
client an redirect message.  When client connect to 
127.0.0.10, 
instead of accepting the connection, it can reply to client 
telling it 
to reconnect to one of the server on 127.0.0.11-14. 

I am planning to write/submit a patch to do that.  I wonder if 
there 
is similar effort in extending libpq protocol, or, if you have 
better 
ideas on how to achieve this.

how do you plan on maintaining consistency, transactional 
integrity and 
atomicity of updates across these 4 machines?
  
_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccountocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

[HACKERS] libpq connectoin redirect

2010-04-21 Thread feng tian

Hi,

I want to load balance a postgres server on 4 physical machines, say 
127.0.0.11-14.  I can set up a pgbouncer on 127.0.0.10 and connection pooling 
to my four boxes.  However, the traffic from/to clients will go through an 
extra hop.  Another way to do this, is to send the client an redirect 
message.  When client connect to 127.0.0.10, instead of accepting the 
connection, it can reply to client telling it to reconnect to one of the server 
on 127.0.0.11-14.  

I am planning to write/submit a patch to do that.  I wonder if there is similar 
effort in extending libpq protocol, or, if you have better ideas on how to 
achieve this.

Thank you,
Feng

  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendarocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[HACKERS] Adding some new operators

2001-06-11 Thread Feng Tian

Hi,

I am trying to use postgresql in some research
project.  I need to add some new operators, some
will probably take more than 2 input tuple streams.
The new operator will be considered by optimizer
when estimating cost and choosing a plan.

Can anyone tell me how hard this will be?  And
where is a good place to start? 

Thanks.

Feng



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



[HACKERS] Adding some new operators

2001-06-06 Thread Feng Tian

Hi,

I am trying to use postgresql in a research
project.  I need to add several new operators, some
will probably take more than 2 input tuple streams.
The new operator will be considered by optimizer
when estimating cost and choosing a plan.

Can anyone tell me how hard this will be?  And
where is a good place to start? 

Thanks.

Feng




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])