Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-03 Thread Arthur Silva
On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund  wrote:

> Hi,
>
> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > In the attached patch I've attached simplehash.h, which can be
> > customized by a bunch of macros, before being inlined.  There's also a
> > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > execGrouping.c.
>
> Attached is a significantly improved version of this series.  The main
> changes are:
>
> - The hash table now uses robin-hood style hash collision handling. See the
>   commit message of 0002 (or simplehash.h) for an introduction. That
>   significantly reduces the impact of "clustering" due to linear
>   addressing.
> - Significant comment and correctness fixes, both in simplehash.h
> - itself, and 0003/0004.
> - a lot of other random performance improvements for the hashing code.
>
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.
>
>
> This patch series currently consists out of four patches:
> - 0001 - add unlikely/likely() macros. The use here is not entirely
>   mandatory, but they do provide a quite consistent improvement. There
>   are other threads where they proved to be beneficial, so I see little
>   reason not to introduce them.
> - 0002 - the customizable hashtable itself. It now contains documentation.
> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
>   for the issue mentioned in [1], to improve peformance in heavily lossy
>   scenarios (otherwise we could regress in some extreme cases)
> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>
>
> While not quite perfect yet, I do think this is at a state where input
> is needed.  I don't want to go a lot deeper into this rabbit hole,
> before we have some agreement that this is a sensible course of action.
>
>
> I think there's several possible additional users of simplehash.h,
> e.g. catcache.c - which has an open coded, not particularly fast hash
> table and frequently shows up in profiles - but I think the above two
> conversions are plenty to start with.
>
>
> Comments?
>
>
> Greetings,
>
> Andres Freund
>
> [1] http://archives.postgresql.org/message-id/20160923205843.zcs
> 533sqdzlh4cpo%40alap3.anarazel.de
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
This is a great addition.

A couple of comments.

* 80% occupancy is a bit conservative for RH hashing, it works well up to
90% if you use the early stops due to distance. So that TODO is worth
pursuing.

* An optimized memory layout for RH hashmap is along the lines of
HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
specially well with large occupancies (~90%). Due to RH the probings on the
H array are very short and within a single cache line. Even with a 31bit
hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
Essentially guaranteeing that the 99% percentile of lookups will only hit a
couple of cache lines (if you ignore crossing cache lines and other
details).


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Arthur Silva
On Sep 14, 2016 5:18 PM, "Robert Haas"  wrote:
>
> On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
>  wrote:
> > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page
> > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there
> > are 6 or more dead tuples per page, bitmap will outperform the current
> > representation, assuming max allocation for bitmap. If we can use
additional
> > estimates to restrict the size to somewhat more conservative value and
then
> > keep overflow area, then probably the break-even happens even earlier
than
> > that. I hope this gives us a good starting point, but let me know if you
> > think it's still a wrong approach to pursue.
>
> Well, it's certainly a bigger change.  I think the big concern is that
> the amount of memory now becomes fixed based on the table size.  So
> one problem is that you have to figure out what you're going to do if
> the bitmap doesn't fit in maintenance_work_mem.  A related problem is
> that it might fit but use more memory than before, which could cause
> problems for some people.  Now on the other hand it could also use
> less memory for some people, and that would be good.
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.
>
> I'm fairly sure that our number one priority should be to minimize the
> number of cases where we need to do multiple scans of the indexes to
> stay within maintenance_work_mem.  If we're satisfied we've met that
> goal, then within that we should try to make VACUUM as fast as
> possible with as little memory usage as possible.  I'm not 100% sure I
> know how to get there, or how much work it's worth expending.  In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
> /me thinks a bit.
>
> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!
>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.  And even if our
> memory limit is very high, we want to avoid using memory gratuitously;
> I think we should try to grow memory usage incrementally with either
> representation.
>
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

I'd say it's an idea worth pursuing. It's the base idea behind roaring
bitmaps, arguably the best overall compressed bitmap implementation.


Re: [HACKERS] Status of 64 bit atomics

2016-05-27 Thread Arthur Silva
On May 27, 2016 5:01 PM, "John Gorman"  wrote:
>
> Hi All
>
> Someone recently told me that the postgresql atomics library was
incomplete
> for 64 bit operations such as pg_atomic_fetch_add_u64() and should not be
used.
>
> Can someone definitively confirm whether it is okay to rely on the 64 bit
atomics
> or whether it is better to protect 64 bit operations with a spinlock?
>
> Thanks!
> John

Golang has asm implementations for these even on 32bit platforms (see
https://github.com/golang/go/tree/master/src/sync/atomic).

Couldn't we borrow them? Or even better, fall back to spin lock on these,
but transparently.


Re: [HACKERS] 9.5 release notes

2015-08-20 Thread Arthur Silva
On Sat, Aug 8, 2015 at 11:04 PM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Aug  9, 2015 at 01:24:33AM +1200, David Rowley wrote:
 
  On 7 August 2015 at 14:24, Bruce Momjian br...@momjian.us wrote:
 
  On Tue, Jun 30, 2015 at 09:00:44PM +0200, Andres Freund wrote:
   * 2014-12-08 [519b075] Simon ..: Use GetSystemTimeAsFileTime
 directly in
  win32
 2014-12-08 [8001fe6] Simon ..: Windows: use
  GetSystemTimePreciseAsFileTime if ..
 Timer resolution isn't a unimportant thing for people using
 explain?
 
  This all seemed very internals-only, e.g.:
 
  On most Windows systems this change will actually have no
 significant
  effect on
  timestamp resolution as the system timer tick is typically
 between 1ms
  and 15ms
  depending on what timer resolution currently running
 applications have
  requested. You can check this with clockres.exe from
 sysinternals.
  Despite the
  platform limiation this change still permits capture of finer
  timestamps where
  the system is capable of producing them and it gets rid of an
  unnecessary
  syscall.
 
  Was I wrong?
 
 
 
  This does have a user visible change. Timestamps are now likely to have 6
  digits after the decimal point, if they're on a version of windows which
  supports GetSystemTimePreciseAsFileTime();
 
  Master:
 
  postgres=# select now();
now
  ---
   2015-08-09 01:14:01.959645+12
  (1 row)
 
  9.4.4
  postgres=# select now();
  now
  
   2015-08-09 01:15:09.783+12
  (1 row)

 Yes, this was already in the release notes:

 Allow higher-precision timestamp resolution on systemitem
 class=osnameWindows 8/ or systemitem class=osnameWindows
 Server 2012/ and later Windows systems (Craig Ringer)

 I am not sure why people were saying it was missing.

 --
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com

   + Everyone has their own god. +


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Are we landing pg_tgrm 1.2 in pg 9.5?

If yes (we should), up to an order of magnitude improvements is a worthy
inclusion in the release notes.

--
Arthur Silva


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Arthur Silva
On Thu, Jul 30, 2015 at 5:31 PM, Gavin Flower gavinflo...@archidevsys.co.nz
 wrote:

 On 31/07/15 02:24, Heikki Linnakangas wrote:

 On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

 Also, I think it's possible to migrate to 64-bit XIDs without breaking
 pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
 would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
 distinguish old and new formats.


 I think we should move to 64-bit XIDs in in-memory structs snapshots,
 proc array etc. And expand clog to handle 64-bit XIDs. But keep the
 xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to
 the page header so that logically the xmin/xmax fields on the page are 64
 bits wide, but physically stored in 32 bits. That's possible as long as no
 two XIDs on the same page are more than 2^31 XIDs apart. So you still need
 to freeze old tuples on the page when that's about to happen, but it would
 make it possible to have more than 2^32 XID transactions in the clog. You'd
 never be forced to do anti-wraparound vacuums, you could just let the clog
 grow arbitrarily large.

 There is a big downside to expanding xmin/xmax to 64 bits: it takes
 space. More space means more memory needed for caching, more memory
 bandwidth, more I/O, etc.

 - Heikki



 I think having a special case to save 32 bits per tuple would cause
 unnecessary complications, and the savings are minimal compared to the size
 of current modern storage devices and the typical memory used in serious
 database servers.

 I think it is too much pain for very little gain, especially when looking
 into the future growth in storage capacity andbandwidth.

 The early mainframes used a base displacement technique to keep the size
 of addresses down in instructions: 16 bit addresses, comprising 4 bits for
 a base register and 12 bits for the displacement (hence the use of 4KB
 pages sizes now!).  Necessary at the time when mainframes were often less
 than 128 KB!  Now it would ludicrous to do that for modern servers!


 Cheers,
 Gavin

 (Who is ancient enough, to have programmed such MainFrames!)



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


In the other hand PG tuple overhead is already the largest among the
alternatives.
Even if storage keeps getting faster and cheaper stuff you can't ignore the
overhead of adding yet another 8bytes to each tuple.


Re: [HACKERS] GSoC 2015: SP-GIST for geometrical objects

2015-03-27 Thread Arthur Silva
On Mar 27, 2015 11:08 AM, Dima Ivanovskiy dima...@mail.ru wrote:

 Hello, I am Dmitrii, student of Moscow Institute of Physics and Technology

 Abstract:

 I chose project Indexing prolonged geometrical objects (i.e. boxes,
circles, polygons, not points) with SP-GiST by mapping to 4d-space.
 According to the presentation
 https://www.pgcon.org/2011/schedule/attachments/197_pgcon-2011.pdf
 SP-GIST 3 times faster than GiST in some cases. But GIST supports
geometrical data types:
 box, circle, polygon with operators:|   | @ @ @ |
| ~ ~=
 Popular spatial extension PostGIS doesn't include SP-GIST, but has a lot
of geometrical features.

 Project details:

 After meeting with Alexander Korotkov, I wrote some plan.
 Using of K-D-tree and Quadtree in building index for geometrical data
types can increase speed of search in some cases.
 The main idea is representing 2-D geometrical objects in their bounding
box. Set of 2-D boxes is 4-D space.
 New _ops will work with points from 4-D space, for example kd_box_ops,
quad_circle_ops and will support all geometrical operators.
 After conversion object to their bounding box algo has set of tuples (x1,
y1, x2, y2).
 Our goal is separate this space the most equally. If we talk about
K-D-tree, on first step K-D-tree algorithm will split space in 2 parts by
the first coordinate, in next step by the second coordinate etc., after
4-th coordinate we repeat this procedure.
 At the end we have index at geometrical objects and use traversal tree
for every search operator.

 Postgresql has already has realization ideas of MBR in gist/gistproc.c.
So I will transfer this realization to other type of tree.

 Of cource, I assume that SP-GIST can be not the best decision of this
problem. So after testing this clear methods, I will try to find more
effective way. Maybe with using combination of different spatial tree
structures.

 Project Schedule:

 until May 25

 Read documentation and source code, clarify details of implementation.

 1st month

 Implement new '_ops' with all geometrical operators for box, circle,
polygon

 2nd month

 Research new methods for increase speed of geometrical query

 3rd month

 Final refactoring, testing and submitting a patch.


 Links:

 http://www.sai.msu.su/~megera/postgres/talks/gist_tutorial.html - about
GIST
 https://toster.ru/q/27135#answer_110197 - people need SP-GIST for cubes
 http://www.slideshare.net/profyclub_ru/o-lt - presentation about indexes
 http://pgconf.ru/static/presentations/2015/korotkov_spatial.pdf - working
with geo objects

Nice proposal.

Dynamic Kdtrees can perform badly as the splitting median can get way off
as updates are coming. What are your thoughts about that?

Also what's up with the 4d space? I don't quite get it. These types are 2
or 3 dimensions.


Re: [HACKERS] GSoC 2015: SP-GIST for geometrical objects

2015-03-27 Thread Arthur Silva
On Mar 27, 2015 6:41 PM, Dima Ivanovskiy dima...@mail.ru wrote:


 On Mar 27, 2015 11:08 AM, Dima Ivanovskiy dima...@mail.ru wrote:
 
  Hello, I am Dmitrii, student of Moscow Institute of Physics and
Technology
 
  Abstract:
 
  I chose project Indexing prolonged geometrical objects (i.e. boxes,
circles, polygons, not points) with SP-GiST by mapping to 4d-space.
  According to the presentation
  https://www.pgcon.org/2011/schedule/attachments/197_pgcon-2011.pdf
  SP-GIST 3 times faster than GiST in some cases. But GIST supports
geometrical data types:
  box, circle, polygon with operators:|   | @ @ @
| | ~ ~=
  Popular spatial extension PostGIS doesn't include SP-GIST, but has a
lot of geometrical features.
 
  Project details:
 
  After meeting with Alexander Korotkov, I wrote some plan.
  Using of K-D-tree and Quadtree in building index for geometrical data
types can increase speed of search in some cases.
  The main idea is representing 2-D geometrical objects in their
bounding box. Set of 2-D boxes is 4-D space.
  New _ops will work with points from 4-D space, for example kd_box_ops,
quad_circle_ops and will support all geometrical operators.
  After conversion object to their bounding box algo has set of tuples
(x1, y1, x2, y2).
  Our goal is separate this space the most equally. If we talk about
K-D-tree, on first step K-D-tree algorithm will split space in 2 parts by
the first coordinate, in next step by the second coordinate etc., after
4-th coordinate we repeat this procedure.
  At the end we have index at geometrical objects and use traversal tree
for every search operator.
 
  Postgresql has already has realization ideas of MBR in
gist/gistproc.c. So I will transfer this realization to other type of tree.
 
  Of cource, I assume that SP-GIST can be not the best decision of this
problem. So after testing this clear methods, I will try to find more
effective way. Maybe with using combination of different spatial tree
structures.
 
  Project Schedule:
 
  until May 25
 
  Read documentation and source code, clarify details of implementation.
 
  1st month
 
  Implement new '_ops' with all geometrical operators for box, circle,
polygon
 
  2nd month
 
  Research new methods for increase speed of geometrical query
 
  3rd month
 
  Final refactoring, testing and submitting a patch.
 
 
  Links:
 
  http://www.sai.msu.su/~megera/postgres/talks/gist_tutorial.html -
about GIST
  https://toster.ru/q/27135#answer_110197 - people need SP-GIST for cubes
  http://www.slideshare.net/profyclub_ru/o-lt - presentation about
indexes
  http://pgconf.ru/static/presentations/2015/korotkov_spatial.pdf -
working with geo objects
 
 Nice proposal.

 Dynamic Kdtrees can perform badly as the splitting median can get way
off as updates are coming. What are your thoughts about that?

 Also what's up with the 4d space? I don't quite get it. These types are
2 or 3 dimensions.


 I read spgist README one more time. I didn't find the mechanism for
maintaining good balance after updates.
 I think we can use Bkd-Tree,
https://www.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf. But It
can be not the best solving.
 I include Research time in 2nd month of timeline.

 About 4d space. All these types are 2 dimensional.
 Just as in R-tree object is approximated by MBR. MBR for 2d-objects can
be mapped to 4d-point. More general, nd-object MBR can be mapped into
2nd-point.



The reason I said that is because you pointed performance as one
motivation  factor (and the lack of balancing properties can degrade
performance really fast on larger indexes).

The Bkd variant seems interesting but I don't think spgist provides enough
abstraction to implement it.

A bounding box can still be inserted/queried with a 2d kdtree so I don't
know why you call it 4d. I assume it's a matter of naming.

Overall the proposal seems totally doable, although I have doubts about its
usefulness. But I'm no one... You should have some feedback from the
mentors soon.


Re: [HACKERS] pg_rewind in contrib

2015-03-26 Thread Arthur Silva
On Mar 26, 2015 4:20 AM, Vladimir Borodin r...@simply.name wrote:


 26 марта 2015 г., в 7:32, Michael Paquier michael.paqu...@gmail.com
написал(а):

 On Thu, Mar 26, 2015 at 12:23 PM, Venkata Balaji N nag1...@gmail.com
wrote:

 Test 1 :

 [...]

 If the master is crashed or killed abruptly, it may not be possible to
do a
 rewind. Is my understanding correct ?


 Yep. This is mentioned in the documentation:
 http://www.postgresql.org/docs/devel/static/app-pgrewind.html
 The target server must shut down cleanly before running pg_rewind.


 You can start old master, wait for crash recovery to complete, stop it
cleanly and then use pg_rewind. It works.


Shouldn't we have a flag so it does that automatically if necessary?


 Test 2 :

 - On a successfully running streaming replication with one master and
one
 slave, i did a clean shutdown of master
 - promoted slave
 - performed some operations (data changes) on newly promoted slave and
did a
 clean shutdown
 - Executed pg_rewind on the old master to sync with the latest changes
on
 new master. I got the below message

 The servers diverged at WAL position 0/A298 on timeline 1.
 No rewind required.

 I am not getting this too.


 In this case the master WAL visibly did not diverge from the slave WAL
 line. A rewind is done if the master touches new relation pages after
 the standby has been promoted, and before the master is shutdown.
 --
 Michael


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



 --
 Да пребудет с вами сила...
 https://simply.name/ru



[HACKERS] Paper from IBM: Memory-Efficient Hash Joins

2015-03-13 Thread Arthur Silva
I've come across this paper today, I thought I'd share it here too.

http://www.vldb.org/pvldb/vol8/p353-barber.pdf


--
Arthur Silva


Re: [HACKERS] logical column ordering

2015-02-27 Thread Arthur Silva
On Fri, Feb 27, 2015 at 4:44 PM, Tomas Vondra tomas.von...@2ndquadrant.com
wrote:

 On 27.2.2015 20:34, Alvaro Herrera wrote:
  Tomas Vondra wrote:
 
  I think we could calls to the randomization functions into some of the
  regression tests (say 'create_tables.sql'), but that makes regression
  tests ... well, random, and I'm not convinced that's a good thing.
 
  Also, this makes regression tests harder to think, because SELECT *
  does different things depending on the attlognum order.
 
  No, that approach doesn't seem very useful.  Rather, randomize the
  columns in the CREATE TABLE statement, and then fix up the attlognums so
  that the SELECT * expansion is the same as it would be with the
  not-randomized CREATE TABLE.

 Yes, that's a possible approach too - possibly a better one for
 regression tests as it fixes the 'SELECT *' but it effectively uses
 fixed 'attlognum' and 'attnum' values (it's difficult to randomize
 those, as they may be referenced in other catalogs).

 --
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Sorry to intrude, I've been following this post and I was wondering if it
would allow (in the currently planed form or in the future) a wider set of
non-rewriting DDLs to Postgres. For example, drop a column without
rewriting the table.


Re: [HACKERS] logical column ordering

2015-02-27 Thread Arthur Silva
On Feb 27, 2015 5:02 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:

 Arthur Silva wrote:

  Sorry to intrude, I've been following this post and I was wondering if
it
  would allow (in the currently planed form or in the future) a wider set
of
  non-rewriting DDLs to Postgres. For example, drop a column without
  rewriting the table.

 Perhaps.  But dropping a column already does not rewrite the table, only
 marks the column as dropped in system catalogs, so do you have a better
 example.

 One obvious example is that you have

 CREATE TABLE t (
t1 int,
t3 int
 );

 and later want to add t2 in the middle, the only way currently is to
 drop the table and start again (re-creating all dependant views, FKs,
 etc).  With the patch you will be able to add the column at the right
 place.  If no default value is supplied for the new column, no table
 rewrite is necessary at all.

 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Cool! I didn't know I could drop stuff without rewriting.

Ya, that's another example, people do these from GUI tools. That's a nice
side effect. Cool (again)!


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 11:19 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote:
  I don't think the password storing best practices apply to db
 connection
  authentication.

 Why not?

 --
 Peter Geoghegan


I assume if the hacker can intercept the server unencrypted traffic and/or
has access to its hard-drive the database is compromised anyway.

I maybe missing something though.


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 11:25 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote:
  I assume if the hacker can intercept the server unencrypted traffic
 and/or
  has access to its hard-drive the database is compromised anyway.

 That sounds like an argument against hashing the passwords in general.


 --
 Peter Geoghegan


Indeed.

In a perfect world SCRAM would be the my choice. FWIW Mongodb 3.0 also uses
SCRAM as the preferred method for password based authentication.


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 10:32 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com
 wrote:
  Although the patch was described as relatively easy to write, it never
  went anywhere, because it *replaced* MD5 authentication with bcrypt,
  which would be a big problem for existing clients.  It seems clear
  that we should add something new and not immediately kill off what
  we've already got, so that people can transition smoothly.  An idea I
  just had today is to keep using basically the same system that we are
  currently using for MD5, but with a stronger hash algorithm, like
  SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
  SHA-512).  Those are slower, but my guess is that even SHA-512 is not
  enough slower for anybody to care very much, and if they do, well
  that's another reason to make use of the new stuff optional.

 I believe that a big advantage of bcrypt for authentication is the
 relatively high memory requirements. This frustrates GPU based
 attacks.


 --
 Peter Geoghegan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


There's also scrypt, which can be tuned for both memory and compute
requirements.

I don't think the password storing best practices apply to db connection
authentication. So SHA256 (or any other non terribly broken hash) is
probably fine for Pg.


Re: [HACKERS] VODKA?

2015-01-07 Thread Arthur Silva
On Jan 6, 2015 7:14 PM, Josh Berkus j...@agliodbs.com wrote:

 Oleg, Teodor:

 I take it VODKA is sliding to version 9.6?

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

This is kinda off, but I was wondering if anyone ever considered running a
crowd-funding campaign for this sort of feature (aka potentially very
popular).


Re: [HACKERS] Commitfest problems

2014-12-19 Thread Arthur Silva
On Fri, Dec 19, 2014 at 6:28 AM, Mark Kirkwood 
mark.kirkw...@catalyst.net.nz wrote:

 On 19/12/14 20:48, Andres Freund wrote:

 On 2014-12-18 10:02:25 -0800, Joshua D. Drake wrote:

 I think a lot of hackers forget exactly how tender their egos are. Now I
 say
 this knowing that a lot of them will go, Oh give me a break but as
 someone
 who employs hackers, deals with open source AND normal people :P every
 single day, I can tell you without a single inch of sarcasm that petting
 egos is one of the ways you get things done in the open source (and
 really
 any male dominated) community.


 To me that's a bit over the top stereotyping.


 +1

 Having been mentioned one or two times myself - it was an unexpected wow
 - cool rather than a grumpy/fragile I must be noticed thing. I think
 some folk have forgotten the underlying principle of the open source
 community - it is about freely giving - time or code etc. The there must
 be something in it for me me me meme is - well - the *other* world view.

  However, doing crappy work and let's not be shy about it, there is NOTHING
 fun about reviewing someone else's code needs to have incentive.


 FWIW, I don't agree with this at all. Reviewing code can be quite
 interesting - with the one constraint that the problem the patch solves
 needs to be somewhat interesting. The latter is what I think gets many
 of the more experienced reviewers - lots of the patches just solve stuff
 they don't care about.


 Yeah, and also it helps if the patch addresses an area that you at least
 know *something* about - otherwise it is really hard to review in any
 useful way.

 Cheers

 Mark



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


I'm trying to follow this thread as much as I could but I could've missed a
part of it.

Merit/credits aside what would really help Postgres right now is a more
streamlined/modern (the only words I could come up with) development
process.

Using the mailing list for everything is alright, it works. But there is
lot of tools that could be used along it:
gerrit/gitlab/github/bitbucket/jira and other tools that do: pull requests,
code review and bugs (or any combination of them).

That'd reduce friction for new contributors and further development. These
tools are being used everywhere and I find hard to believe that PG can't
benefit from any of them.

--
Arthur Silva


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Arthur Silva
On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed rahilasye...@gmail.com
wrote:

 What I would suggest is instrument the backend with getrusage() at
 startup and shutdown and have it print the difference in user time and
 system time.  Then, run tests for a fixed number of transactions and
 see how the total CPU usage for the run differs.

 Folllowing are the numbers obtained on tests with absolute CPU usage,
 fixed number of transactions and longer duration with latest fpw
 compression patch

 pgbench command : pgbench  -r -t 25 -M prepared

 To ensure that data is not highly compressible, empty filler columns were
 altered using

 alter table pgbench_accounts alter column filler type text using
 gen_random_uuid()::text

 checkpoint_segments = 1024
 checkpoint_timeout =  5min
 fsync = on

 The tests ran for around 30 mins.Manual checkpoint was run before each
 test.

 Compression   WAL generated%compressionLatency-avg   CPU usage
 (seconds)  TPS  Latency
 stddev


 on  1531.4 MB  ~35 %  7.351 ms
 user diff: 562.67s system diff: 41.40s  135.96
 13.759 ms


 off  2373.1 MB 6.781
 ms   user diff: 354.20s  system diff: 39.67s147.40
   14.152 ms

 The compression obtained is quite high close to 35 %.
 CPU usage at user level when compression is on is quite noticeably high as
 compared to that when compression is off. But gain in terms of reduction of
 WAL is also high.

 Server specifications:
 Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
 RAM: 32GB
 Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm



 Thank you,

 Rahila Syed





 On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas robertmh...@gmail.com
 wrote:

 On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed rahilasyed...@gmail.com
 wrote:
 If that's really true, we could consider having no configuration any
 time, and just compressing always.  But I'm skeptical that it's
 actually true.
 
  I was referring to this for CPU utilization:
 
 http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com
  http://
 
  The above tests were performed on machine with configuration as follows
  Server specifications:
  Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2
 nos
  RAM: 32GB
  Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
  1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

 I think that measurement methodology is not very good for assessing
 the CPU overhead, because you are only measuring the percentage CPU
 utilization, not the absolute amount of CPU utilization.  It's not
 clear whether the duration of the tests was the same for all the
 configurations you tried - in which case the number of transactions
 might have been different - or whether the number of operations was
 exactly the same - in which case the runtime might have been
 different.  Either way, it could obscure an actual difference in
 absolute CPU usage per transaction.  It's unlikely that both the
 runtime and the number of transactions were identical for all of your
 tests, because that would imply that the patch makes no difference to
 performance; if that were true, you wouldn't have bothered writing
 it

 What I would suggest is instrument the backend with getrusage() at
 startup and shutdown and have it print the difference in user time and
 system time.  Then, run tests for a fixed number of transactions and
 see how the total CPU usage for the run differs.

 Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
 footprint for updates, and we found that compression was pretty darn
 expensive there in terms of CPU time.  So I am suspicious of the
 finding that it is free here.  It's not impossible that there's some
 effect which causes us to recoup more CPU time than we spend
 compressing in this case that did not apply in that case, but the
 projects are awfully similar, so I tend to doubt it.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



This can be improved in the future by using other algorithms.


Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)

2014-12-10 Thread Arthur Silva
On Wed, Dec 10, 2014 at 1:50 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/28/2014 04:12 PM, Alexander Korotkov wrote:

 3. A binary heap would be a better data structure to buffer the rechecked
 values. A Red-Black tree allows random insertions and deletions, but in
 this case you need to insert arbitrary values but only remove the
 minimum
 item. That's exactly what a binary heap excels at. We have a nice binary
 heap implementation in the backend that you can use, see
 src/backend/lib/binaryheap.c.
 

 Hmm. For me binary heap would be a better data structure for KNN-GiST at
 all :-)


 I decided to give this a shot, replacing the red-black tree in GiST with
 the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat
 simpler, as the binaryheap interface is simpler than the red-black tree
 one. Unfortunately, performance was somewhat worse. That was quite
 surprising, as insertions and deletions are both O(log N) in both data
 structures, but the red-black tree implementation is more complicated.

 I implemented another data structure called a Pairing Heap. It's also a
 fairly simple data structure, but insertions are O(1) instead of O(log N).
 It also performs fairly well in practice.

 With that, I got a small but measurable improvement. To test, I created a
 table like this:

 create table gisttest (id integer, p point);
 insert into gisttest select id, point(random(), random()) from
 generate_series(1, 100) id;
 create index i_gisttest on gisttest using gist (p);

 And I ran this query with pgbench:

 select id from gisttest order by p - '(0,0)' limit 1000;

 With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
 That's a nice little improvement, but perhaps more importantly, the pairing
 heap implementation consumes less memory. To measure that, I put a
 MemoryContextStats(so-queueCtx) call into gistendscan. With the above
 query, but without the limit clause, on master I got:

 GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998
 chunks); 21296 used

 And with the patch:

 GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks);
 21072 used

 That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice
 to reduce that memory consumption, as there is no hard upper bound on how
 much might be needed. If the GiST tree is really disorganized for some
 reason, a query might need a lot more.


 So all in all, I quite like this patch, even though it doesn't do anything
 too phenomenal. It adds a some code, in the form of the new pairing heap
 implementation, but it makes the GiST code a little bit simpler. And it
 gives a small performance gain, and reduces memory usage a bit.

 - Heikki



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


It may be better to replace the lib/binaryheap altogether as it offers
comparable/better performance.


[HACKERS] Faster relevance ranking didn't make it into PostgreSQL 9.4

2014-10-30 Thread Arthur Silva
This is slight off topic but please bear with me.

I came across this post:
http://pauleveritt.wordpress.com/2014/10/29/faster-relevance-ranking-didnt-make-it-into-postgresql-9-4/
I was curious about it so I checked several commit fest pages and searched
the mailing lists but I wasn't able to locate the relevant discussion.

Can someone point me to to it?

--
Arthur Silva


Re: [HACKERS] [WIP Patch] Using 128-bit integers for sum, avg and statistics aggregates

2014-10-28 Thread Arthur Silva
On Sat, Oct 25, 2014 at 12:38 PM, Andreas Karlsson andr...@proxel.se
wrote:

 Hi,

 There was recently talk about if we should start using 128-bit integers
 (where available) to speed up the aggregate functions over integers which
 uses numeric for their internal state. So I hacked together a patch for
 this to see what the performance gain would be.

 Previous thread: http://www.postgresql.org/message-id/20141017182500.
 gf2...@alap3.anarazel.de

 What the patch does is switching from using numerics in the aggregate
 state to int128 and then convert the type from the 128-bit integer in the
 final function.

 The functions where we can make use of int128 states are:

 - sum(int8)
 - avg(int8)
 - var_*(int2)
 - var_*(int4)
 - stdev_*(int2)
 - stdev_*(int4)

 The initial benchmark results look very promising. When summing 10 million
 int8 I get a speedup of ~2.5x and similarly for var_samp() on 10 million
 int4 I see a speed up of ~3.7x. To me this indicates that it is worth the
 extra code. What do you say? Is this worth implementing?

 The current patch still requires work. I have not written the detection of
 int128 support yet, and the patch needs code cleanup (for example: I used
 an int16_ prefix on the added functions, suggestions for better names are
 welcome). I also need to decide on what estimate to use for the size of
 that state.

 The patch should work and pass make check on platforms where __int128_t is
 supported.

 The simple benchmarks:

 CREATE TABLE test_int8 AS SELECT x::int8 FROM generate_series(1, 1000)
 x;

 Before:

 # SELECT sum(x) FROM test_int8;
   sum
 
  500500
 (1 row)

 Time: 2521.217 ms

 After:

 # SELECT sum(x) FROM test_int8;
   sum
 
  500500
 (1 row)

 Time: 1022.811 ms

 CREATE TABLE test_int4 AS SELECT x::int4 FROM generate_series(1, 1000)
 x;

 Before:

 # SELECT var_samp(x) FROM test_int4;
   var_samp
 
  83416.6667
 (1 row)

 Time: 3808.546 ms

 After:

 # SELECT var_samp(x) FROM test_int4;
   var_samp
 
  83416.6667
 (1 row)

 Time: 1033.243 ms

 Andreas


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


These are some nice improvements.

As far as I'm aware int128 types are supported on every major compiler when
compiling for 64bit platforms. Right?


Re: [HACKERS] Fixed xloginsert_locks for 9.4

2014-10-03 Thread Arthur Silva
On Fri, Oct 3, 2014 at 1:40 PM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Oct  3, 2014 at 04:11:30PM +0200, Andres Freund wrote:
  On 2014-10-03 10:07:39 -0400, Gregory Smith wrote:
   On 10/3/14, 8:26 AM, Andres Freund wrote:
   #define NUM_XLOGINSERT_LOCKS  1
   tps = 52.711939 (including connections establishing)
   #define NUM_XLOGINSERT_LOCKS  8
   tps = 286.496054 (including connections establishing)
   #define NUM_XLOGINSERT_LOCKS  16
   tps = 346.113313 (including connections establishing)
   #define NUM_XLOGINSERT_LOCKS  24
   tps = 363.242111 (including connections establishing)
  
   Just to clarify:  that 10% number I threw out was meant as a rough
 estimate
   for a system with the default configuration, which is all that I
 tested.  It
   seemed like people would likely need to tune all the usual things like
   checkpoint_segments, shared_buffers, etc. as well before seeing much
 better.
   You did all that, and sure enough the gain went up; thanks for
 confirming my
   guess.
  
   I still don't think that means this needs a GUC for 9.4.  Look at that
 jump
   from 1 to 8.  The low-hanging fruit here hasn't just been knocked
 off.  It's
   been blended into a frozen drink, poured into a glass, and had a little
   paper umbrella put on top.  I think that's enough for 9.4.  But, yes,
 let's
   see if we can add delivery to the side of the pool in the next version
 too.
 
  So 25% performance on a relatively small machine improvements aren't
  worth a GUC? That are likely to be larger on a bigger machine?
 
  I utterly fail to see why that's a service to our users.

 Well, I think the issue is that having a GUC that can't reasonably be
 tuned by 95% of our users is nearly useless.  Few users are going to run
 benchmarks to see what the optimal value is.

 I remember Informix had a setting that had no description except try
 different values to see if it helps performance --- we don't want to do
 that.

 What if we emit a server message if the setting is too low?  That's how
 we handle checkpoint_segments.

 --
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com

   + Everyone has their own god. +


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Not all GUC need to be straight forward to tune.
If the gains are worthy I don't see any reason not to have it.


Re: [HACKERS] Fixed xloginsert_locks for 9.4

2014-10-03 Thread Arthur Silva
On Fri, Oct 3, 2014 at 3:10 PM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Oct  3, 2014 at 02:07:45PM -0400, Bruce Momjian wrote:
  On Fri, Oct  3, 2014 at 03:00:56PM -0300, Arthur Silva wrote:
   I remember Informix had a setting that had no description except
 try
   different values to see if it helps performance --- we don't want
 to do
   that.
  
   What if we emit a server message if the setting is too low?
 That's how
   we handle checkpoint_segments.
  
   Not all GUC need to be straight forward to tune.
   If the gains are worthy I don't see any reason not to have it.
 
  Every GUC add complexity to the system because people have to understand
  it to know if they should tune it.  No GUC is zero-cost.

 Please see my blog post about the cost of adding GUCs:

 http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009

 --
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com

   + Everyone has their own god. +


That's true Bruce (nice post, it was a good reading).
But how can we ignore 25%+ improvements (from 8 to 24)?
At very least we should delivery some pretty good defaults.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-16 Thread Arthur Silva
On Tue, Sep 16, 2014 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote:
  Well, I can only judge from the use cases I personally have, none of
  which involve more than 100 keys at any level for most rows.  So far
  I've seen some people argue hypotetical use cases involving hundreds of
  keys per level, but nobody who *actually* has such a use case.
 
  I already told you that I did, and that it was the only and only app I
  had written for JSONB.
 
  Ah, ok, I thought yours was a test case.  Did you check how it performed
  on the two patches at all?  My tests with 185 keys didn't show any
  difference, including for a last key case.

 No, I didn't test it.   But I think Heikki's test results pretty much
 tell us everything there is to see here.  This isn't really that
 complicated; I've read a few papers on index compression over the
 years and they seem to often use techniques that have the same general
 flavor as what Heikki did here, adding complexity in the data format
 to gain other advantages.  So I don't think we should be put off.


I second this reasoning. Even if I ran a couple of very realistic test
cases that support all-lengths I do fell that the Hybrid aproach would be
better as it covers all bases. To put things in perspective Tom's latest
patch isn't much simpler either.

Since it would still be a breaking change we should consider changing the
layout to key-key-key-value-value-value as it seems to pay off.


 Basically, I think that if we make a decision to use Tom's patch
 rather than Heikki's patch, we're deciding that the initial decision,
 by the folks who wrote the original jsonb code, to make array access
 less than O(n) was misguided.  While that could be true, I'd prefer to
 bet that those folks knew what they were doing.  The only way reason
 we're even considering changing it is that the array of lengths
 doesn't compress well, and we've got an approach that fixes that
 problem while preserving the advantages of fast lookup.  We should
 have a darn fine reason to say no to that approach, and it didn't
 benefit my particular use case is not it.

 In practice, I'm not very surprised that the impact doesn't seem too
 bad when you're running SQL queries from the client.  There's so much
 other overhead, for de-TOASTing and client communication and even just
 planner and executor costs, that this gets lost in the noise.  But
 think about a PL/pgsql procedure, say, where somebody might loop over
 all of the elements in array.  If those operations go from O(1) to
 O(n), then the loop goes from O(n) to O(n^2).  I will bet you a
 beverage of your choice that somebody will find that behavior within a
 year of release and be dismayed by it.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Arthur Silva
I couldn't get my hands on the twitter data but I'm generating my own. The
json template is http://paste2.org/wJ1dfcjw and data was generated with
http://www.json-generator.com/. It has 35 top level keys, just in case
someone is wondering.
I generated 1 random objects and I'm inserting them repeatedly until I
got 320k rows.

Test query: SELECT data-'name', data-'email' FROM t_json
Test storage: EXTERNAL
Test jsonb lengths quartiles: {1278,1587,1731,1871,2231}
Tom's lengths+cache aware: 455ms
HEAD: 440ms

This is a realistic-ish workload in my opinion and Tom's patch performs
within 4% of HEAD.

Due to the overall lenghts I couldn't really test compressibility so I
re-ran the test. This time I inserted an array of 2 objects in each row, as
in: [obj, obj];
The objects where taken in sequence from the 1 pool so contents match
in both tests.

Test query: SELECT data # '{0, name}', data # '{0, email}', data # '{1,
name}', data # '{1, email}' FROM t_json
Test storage: EXTENDED
HEAD: 17mb table + 878mb toast
HEAD size quartiles: {2015,2500,2591,2711,3483}
HEAD query runtime: 15s
Tom's: 220mb table + 580mb toast
Tom's size quartiles: {1665,1984,2061,2142.25,2384}
Tom's query runtime: 13s

This is an intriguing edge case that Tom's patch actually outperform the
base implementation for 3~4kb jsons.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-14 Thread Arthur Silva
Em 14/09/2014 12:21, Andres Freund and...@2ndquadrant.com escreveu:

 On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote:
  
   What we are looking for here is uniqueness thus better error
detection. Not
   avalanche effect, nor cryptographically secure, nor bit distribution.
   As far as I'm aware CRC32C is unbeaten collision wise and time proven.
  
   I couldn't find tests with xxhash and crc32 on the same hardware so I
spent
   some time putting together a benchmark (see attachment, to run it just
   start run.sh)
  
   I included a crc32 implementation using ssr4.2 instructions (which
works on
   pretty much any Intel processor built after 2008 and AMD built after
2012),
   a portable Slice-By-8 software implementation and xxhash since it's
the
   fastest software 32bit hash I know of.
  
   Here're the results running the test program on my i5-4200M
  
   crc sb8: 90444623
   elapsed: 0.513688s
   speed: 1.485220 GB/s
  
   crc hw: 90444623
   elapsed: 0.048327s
   speed: 15.786877 GB/s
  
   xxhash: 7f4a8d5
   elapsed: 0.182100s
   speed: 4.189663 GB/s
  
   The hardware version is insanely and works on the majority of Postgres
   setups and the fallback software implementations is 2.8x slower than
the
   fastest 32bit hash around.
  
   Hopefully it'll be useful in the discussion.

 Note that all these numbers aren't fully relevant to the use case
 here. For the WAL - which is what we're talking about and the only place
 where CRC32 is used with high throughput - the individual parts of a
 record are pretty darn small on average. So performance of checksumming
 small amounts of data is more relevant. Mind, that's not likely to go
 for CRC32, especially not slice-by-8. The cache fooprint of the large
 tables is likely going to be noticeable in non micro benchmarks.


Indeed, the small input sizes is something I was missing. Something more
cache friendly would be better, it's just a matter of finding a better
candidate.

Although I find it highly unlikely that the 4kb extra table of sb8 brings
its performance down to sb4 level, even considering the small inputs and
cache misses.

For what's worth mysql, cassandra, kafka, ext4, xfx all use crc32c
checksums in their WAL/Journals.

  Also, while I understand that CRC has a very venerable history and
  is well studied for transmission type errors, I have been unable to find
  any research on its applicability to validating file/block writes to a
  disk drive.

 Which incidentally doesn't really match what the CRC is used for
 here. It's used for individual WAL records. Usually these are pretty
 small, far smaller than disk/postgres' blocks on average. There's a
 couple scenarios where they can get large, true, but most of them are
 small.
 The primary reason they're important is to correctly detect the end of
 the WAL. To ensure we're interpreting half written records, or records
 from before the WAL file was overwritten.


  While it is to quote you unbeaten collision wise, xxhash,
  both the 32-bit and 64-bit version are its equal.

 Aha? You take that from the smhasher results?

  Since there seems to be a lack of research on disk based error
  detection versus CRC polynomials, it seems likely that any of the
  proposed hash functions are on an equal footing in this regard. As
  Andres commented up-thread, xxhash comes along for free with lz4.

 This is pure handwaving.

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Arthur Silva
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu k...@rice.edu wrote:

 On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
  On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
   Andres Freund and...@2ndquadrant.com writes:
On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com
   wrote:
That's not entirely true. CRC-32C beats pretty much everything with
   the same
length quality-wise and has both hardware implementations and
 highly
optimized software versions.
  
For better or for worse CRC is biased by detecting all single bit
errors, the detection capability of larger errors is slightly
diminished. The quality of the other algorithms I mentioned is also
very good, while producing uniformly varying output.
  
There's also much more literature about the various CRCs in
 comparison
to some of these hash allgorithms.
  
   Indeed.  CRCs have well-understood properties for error detection.
   Have any of these new algorithms been analyzed even a hundredth as
   thoroughly?  No.  I'm unimpressed by evidence-free claims that
   something else is also very good.
  
   Now, CRCs are designed for detecting the sorts of short burst errors
   that are (or were, back in the day) common on phone lines.  You could
   certainly make an argument that that's not the type of threat we face
   for PG data.  However, I've not seen anyone actually make such an
   argument, let alone demonstrate that some other algorithm would be
 better.
   To start with, you'd need to explain precisely what other error pattern
   is more important to defend against, and why.
  
   regards, tom lane
  
 
  Mysql went this way as well, changing the CRC polynomial in 5.6.
 
  What we are looking for here is uniqueness thus better error detection.
 Not
  avalanche effect, nor cryptographically secure, nor bit distribution.
  As far as I'm aware CRC32C is unbeaten collision wise and time proven.
 
  I couldn't find tests with xxhash and crc32 on the same hardware so I
 spent
  some time putting together a benchmark (see attachment, to run it just
  start run.sh)
 
  I included a crc32 implementation using ssr4.2 instructions (which works
 on
  pretty much any Intel processor built after 2008 and AMD built after
 2012),
  a portable Slice-By-8 software implementation and xxhash since it's the
  fastest software 32bit hash I know of.
 
  Here're the results running the test program on my i5-4200M
 
  crc sb8: 90444623
  elapsed: 0.513688s
  speed: 1.485220 GB/s
 
  crc hw: 90444623
  elapsed: 0.048327s
  speed: 15.786877 GB/s
 
  xxhash: 7f4a8d5
  elapsed: 0.182100s
  speed: 4.189663 GB/s
 
  The hardware version is insanely and works on the majority of Postgres
  setups and the fallback software implementations is 2.8x slower than the
  fastest 32bit hash around.
 
  Hopefully it'll be useful in the discussion.

 Thank you for running this sample benchmark. It definitely shows that the
 hardware version of the CRC is very fast, unfortunately it is really only
 available on x64 Intel/AMD processors which leaves all the rest lacking.
 For current 64-bit hardware, it might be instructive to also try using
 the XXH64 version and just take one half of the hash. It should come in
 at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
 CRC. Also, while I understand that CRC has a very venerable history and
 is well studied for transmission type errors, I have been unable to find
 any research on its applicability to validating file/block writes to a
 disk drive. While it is to quote you unbeaten collision wise, xxhash,
 both the 32-bit and 64-bit version are its equal. Since there seems to
 be a lack of research on disk based error detection versus CRC polynomials,
 it seems likely that any of the proposed hash functions are on an equal
 footing in this regard. As Andres commented up-thread, xxhash comes along
 for free with lz4.

 Regards,
 Ken


For the sake of completeness the results for xxhash64 in my machine

xxhash64
speed:  7.365398 GB/s

Which is indeed very fast.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Arthur Silva
That's not entirely true. CRC-32C beats pretty much everything with the
same length quality-wise and has both hardware implementations and highly
optimized software versions.
Em 12/09/2014 17:18, Ants Aasma a...@cybertec.at escreveu:

 On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I don't mean that we should abandon this patch - compression makes the
 WAL
  smaller which has all kinds of other benefits, even if it makes the raw
 TPS
  throughput of the system worse. But I'm just saying that these TPS
  comparisons should be taken with a grain of salt. We probably should
  consider switching to a faster CRC algorithm again, regardless of what
 we do
  with compression.

 CRC is a pretty awfully slow algorithm for checksums. We should
 consider switching it out for something more modern. CityHash,
 MurmurHash3 and xxhash look like pretty good candidates, being around
 an order of magnitude faster than CRC. I'm hoping to investigate
 substituting the WAL checksum algorithm 9.5.

 Given the room for improvement in this area I think it would make
 sense to just short-circuit the CRC calculations for testing this
 patch to see if the performance improvement is due to less data being
 checksummed.

 Regards,
 Ants Aasma
 --
 Cybertec Schönig  Schönig GmbH
 Gröhrmühlgasse 26
 A-2700 Wiener Neustadt
 Web: http://www.postgresql-support.de


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)

2014-09-12 Thread Arthur Silva
Em 12/09/2014 17:23, Andres Freund and...@2ndquadrant.com escreveu:

 On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote:
  On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
  At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote:
  
  We probably should consider switching to a faster CRC algorithm again,
  regardless of what we do with compression.
  
  As it happens, I'm already working on resurrecting a patch that Andres
  posted in 2010 to switch to zlib's faster CRC implementation.
 
  As it happens, I also wrote an implementation of Slice-by-4 the other
day
  :-). Haven't gotten around to post it, but here it is.
 
  What algorithm does zlib use for CRC calculation?

 Also slice-by-4, with a manually unrolled loop doing 32bytes at once,
using
 individual slice-by-4's. IIRC I tried and removing that slowed things
 down overall. What it also did was move crc to a function. I'm not sure
 why I did it that way, but it really might be beneficial - if you look
 at profiles today there's sometimes icache/decoding stalls...

 Hm. Let me look:

http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de

 Ick, there's quite some debugging leftovers ;)

 I think it might be a good idea to also switch the polynom at the same
 time. I really really think we should, when the hardware supports, use
 the polynom that's available in SSE4.2. It has similar properties, can
 implemented in software just the same...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

This Google library is worth a look https://code.google.com/p/crcutil/ as
it has some extremely optimized versions.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-11 Thread Arthur Silva
I agree that there's no reason to fix an algorithm to it, unless maybe it's
pglz. There's some initial talk about implementing pluggable compression
algorithms for TOAST and I guess the same must be taken into consideration
for the WAL.

--
Arthur Silva


On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com
wrote:

 I will repeat the above tests with high load on CPU and using the
 benchmark
 given by Fujii-san and post the results.

 Average % of CPU usage at user level for each of the compression algorithm
 are as follows.

 CompressionMultipleSingle

 Off81.133881.1267
 LZ4  81.099881.1695
 Snappy:80.9741 80.9703
 Pglz :81.235381.2753

 
 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
 
 
 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
 

 The numbers show CPU utilization of Snappy is the least. The CPU
 utilization
 in increasing order is
 pglz  No compression  LZ4  Snappy

 The variance of average CPU utilization numbers is very low. However ,
 snappy seems to be best when it comes to lesser utilization of CPU.

 As per the measurement results posted till date

 LZ4 outperforms snappy and pglz in terms of compression ratio and
 performance. However , CPU utilization numbers show snappy utilizes least
 amount of CPU . Difference is not much though.

 As there has been no consensus yet about which compression algorithm to
 adopt, is it better to make this decision independent of the FPW
 compression
 patch as suggested earlier in this thread?. FPW compression can be done
 using built in compression pglz as it shows considerable performance over
 uncompressed WAL and good compression ratio
 Also, the patch to compress multiple blocks at once gives better
 compression
 as compared to single block. ISTM that performance overhead introduced by
 multiple blocks compression is slightly higher than single block
 compression
 which can be tested again after modifying the patch to use pglz .  Hence,
 this patch can be built using multiple blocks compression.

 Thoughts?



 --
 View this message in context:
 http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
 Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Memory Alignment in Postgres

2014-09-11 Thread Arthur Silva
On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote:
  I'm continuously studying Postgres codebase. Hopefully I'll be able to
 make
  some contributions in the future.
 
  For now I'm intrigued about the extensive use of memory alignment. I'm
 sure
  there's some legacy and some architecture that requires it reasoning
 behind
  it.
 
  That aside, since it wastes space (a lot of space in some cases) there
 must
  be a tipping point somewhere. I'm sure one can prove aligned access is
  faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
  postgres, specially in the page/rows area.
 
  Just for the sake of comparison Mysql COMPACT storage (default and
  recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
  4-byte alignment. Not sure about Oracle and others.
 
  Is it worth the extra space in newer architectures (specially Intel)?
  Do you guys think this is something worth looking at?

 Yes.  At least in my opinion, though, it's not a good project for a
 beginner.  If you get your changes to take effect, you'll find that a
 lot of things will break in places that are not easy to find or fix.
 You're getting into really low-level areas of the system that get
 touched infrequently and require a lot of expertise in how things work
 today to adjust.


I thought all memory alignment was (or at least the bulk of it) handled
using some codebase wide macros/settings, otherwise how could different
parts of the code inter-op? Poking this area might suffice for some initial
testing to check if it's worth any more attention.

Unaligned memory access received a lot attention in Intel post-Nehalen era.
So it may very well pay off on Intel servers. You might find this blog post
and it's comments/external-links interesting
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

I'm a newbie in the codebase, so please let me know if I'm saying anything
non-sense.


 The idea I've had before is to try to reduce the widest alignment we
 ever require from 8 bytes to 4 bytes.  That is, look for types with
 typalign = 'd', and rewrite them to have typalign = 'i' by having them
 use two 4-byte loads to load an eight-byte value.  In practice, I
 think this would probably save a high percentage of what can be saved,
 because 8-byte alignment implies a maximum of 7 bytes of wasted space,
 while 4-byte alignment implies a maximum of 3 bytes of wasted space.
 And it would probably be pretty cheap, too, because any type with less
 than 8 byte alignment wouldn't be affected at all, and even those
 types that were affected would only be slightly slowed down by doing
 two loads instead of one.  In contrast, getting rid of alignment
 requirements completely would save a little more space, but probably
 at the cost of a lot more slowdown: any type with alignment
 requirements would have to fetch the value byte-by-byte instead of
 pulling the whole thing out at once.


Does byte-by-byte access stand true nowadays? I though modern processors
would fetch memory at very least in word sized chunks, so 4/8 bytes then
merge-slice.


 But there are a couple of obvious problems with this idea, too, such as:

 1. It's really complicated and a ton of work.

2. It would break pg_upgrade pretty darn badly unless we employed some
 even-more-complex strategy to mitigate that.
 3. The savings might not be enough to justify the effort.


Very true.


 It might be interesting for someone to develop a tool measuring the
 number of bytes of alignment padding we lose per tuple or per page and
 gather some statistics on it on various databases.  That would give us
 some sense as to the possible savings.


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] Memory Alignment in Postgres

2014-09-11 Thread Arthur Silva
On Thu, Sep 11, 2014 at 11:27 AM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-09-11 10:32:24 -0300, Arthur Silva wrote:
  Unaligned memory access received a lot attention in Intel post-Nehalen
 era.
  So it may very well pay off on Intel servers. You might find this blog
 post
  and it's comments/external-links interesting
 
 http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

 FWIW, the reported results of imo pretty meaningless for postgres. It's
 sequential access over larger amount of memory. I.e. a perfectly
 prefetchable workload where it doesn't matter if superflous cachelines
 are fetched because they're going to be needed next round anyway.

 In many production workloads one of the most busy accesses to individual
 datums is the binary search on individual pages during index
 lookups. That's pretty much exactly the contrary to the above.

 Not saying that it's not going to be a benefit in many scenarios, but
 it's far from being as simple as saying that unaligned accesses on their
 own aren't penalized anymore.

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


I modified the test code to use a completely random scan pattern to test
something that completely trashes the cache. Not realistic but still
confirms the hypothesis that the overhead is minimal on modern Intel.

-- test results compiling for 32bit --
 processing word of size 2
offset = 0
 average time for offset 0 is 422.7
offset = 1
 average time for offset 1 is 422.85

 processing word of size 4
offset = 0
 average time for offset 0 is 436.6
offset = 1
 average time for offset 1 is 451
offset = 2
 average time for offset 2 is 444.3
offset = 3
 average time for offset 3 is 441.9

 processing word of size 8
offset = 0
 average time for offset 0 is 630.15
offset = 1
 average time for offset 1 is 653
offset = 2
 average time for offset 2 is 655.5
offset = 3
 average time for offset 3 is 660.85
offset = 4
 average time for offset 4 is 650.1
offset = 5
 average time for offset 5 is 656.9
offset = 6
 average time for offset 6 is 656.6
offset = 7
 average time for offset 7 is 656.9


-- test results compiling for 64bit --
 processing word of size 2
offset = 0
 average time for offset 0 is 402.55
offset = 1
 average time for offset 1 is 406.9

 processing word of size 4
offset = 0
 average time for offset 0 is 424.05
offset = 1
 average time for offset 1 is 436.55
offset = 2
 average time for offset 2 is 435.1
offset = 3
 average time for offset 3 is 435.3

 processing word of size 8
offset = 0
 average time for offset 0 is 444.9
offset = 1
 average time for offset 1 is 470.25
offset = 2
 average time for offset 2 is 468.95
offset = 3
 average time for offset 3 is 476.75
offset = 4
 average time for offset 4 is 474.9
offset = 5
 average time for offset 5 is 468.25
offset = 6
 average time for offset 6 is 469.8
offset = 7
 average time for offset 7 is 469.1
//  g++ -O2 -o test test.cpp  ./test


#include sys/stat.h
#include sys/time.h
#include sys/types.h
#include iostream
#include cassert
#include vector
#include inttypes.h

using namespace std;

class WallClockTimer
{
public:
struct timeval t1, t2;
WallClockTimer() :
t1(), t2()
{
gettimeofday(t1, 0);
t2 = t1;
}
void reset()
{
gettimeofday(t1, 0);
t2 = t1;
}
int elapsed()
{
return (t2.tv_sec * 1000 + t2.tv_usec / 1000) - (t1.tv_sec * 1000 + t1.tv_usec / 1000);
}
int split()
{
gettimeofday(t2, 0);
return elapsed();
}
};

// xor shift
uint32_t xor128(void)
{
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x  11);
x = y;
y = z;
z = w;
return w = w ^ (w  19) ^ (t ^ (t  8));
}

template class T
void runtest()
{
size_t N = 10 * 1000 * 1000 ;
int repeat = 20;
WallClockTimer timer;
const bool paranoid = false;
cout processing word of size sizeof(T)endl;
for(unsigned int offset = 0; offsetsizeof(T); ++offset)
{
vectorT bigarray(N+2);
coutoffset = offsetendl;
T * const begin =   reinterpret_castT * (reinterpret_castuintptr_t(bigarray[0]) + offset);
assert(offset + reinterpret_castuintptr_t(bigarray[0])  == reinterpret_castuintptr_t(begin)  );
T * const end = begin + N;
if(paranoid) assert(reinterpret_castuintptr_t(end)reinterpret_castuintptr_t(bigarray.back()));
int sumt = 0;
//cout ignore this: ;
for(int k = 0 ; k  repeat; ++k)
{
timer.reset();
for(size_t i = 0; i N; ++i)
{
int ri = xor128() % N;
begin[ri] = static_castT( i );
}
volatile T val = 1

Re: [HACKERS] Memory Alignment in Postgres

2014-09-11 Thread Arthur Silva
On Thu, Sep 11, 2014 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Merlin Moncure mmonc...@gmail.com writes:
  Be advised of the difficulties you are going to face here.  Assuming
  for a second there is no reason not to go unaligned on Intel and there
  are material benefits to justify the effort, that doesn't necessarily
  hold for other platforms like arm/power.

 Note that on many (most?) non-Intel architectures, unaligned access is
 simply not an option.  The chips themselves will throw SIGBUS or
 equivalent if you try it.  Some kernels provide signal handlers that
 emulate the unaligned access in software rather than killing the process;
 but the performance consequences of hitting such traps more than very
 occasionally would be catastrophic.


 Even on Intel, I'd wonder what unaligned accesses do to atomicity
 guarantees and suchlike.  This is not a big deal for row data storage,
 but we'd have to be careful about it if we were to back off alignment
 requirements for in-memory data structures such as latches and buffer
 headers.

 Another fun thing you'd need to deal with is ensuring that the C structs
 we overlay onto catalog data rows still match up with the data layout
 rules.

 On the whole, I'm pretty darn skeptical that such an effort would repay
 itself.  There are lots of more promising things to hack on.

 regards, tom lane


Indeed I don't know any other architectures that this would be at an
option. So if this ever moves forward it must be turned on at compile time
for x86-64 only. I wonder how the Mysql handle their rows even on those
architectures as their storage format is completely packed.

If we just reduced the alignment requirements when laying out columns in
the rows and indexes by reducing/removing padding -- typalign, it'd be
enough gain in my (humble) opinion.

If you think alignment is not an issue you can see saving everywhere, which
is kinda insane...

I'm unsure how this equates in patch complexity, but judging by the
reactions so far I'm assuming a lot.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-11 Thread Arthur Silva
On Thu, Sep 11, 2014 at 10:01 PM, Josh Berkus j...@agliodbs.com wrote:

 So, I finally got time to test Tom's latest patch on this.

 TLDR: we want to go with Tom's latest patch and release beta3.

 Figures:

 So I tested HEAD against the latest lengths patch.  Per Arthur Silva, I
 checked uncompressed times for JSONB against compressed times.  This
 changed the picture considerably.

 TABLE SIZES
 ---

 HEAD

   ?column?   | pg_size_pretty
 -+
  json text format| 393 MB
  jsonb: compressed   | 1147 MB
  jsonb: uncompressed | 1221 MB

 PATCHED

   ?column?   | pg_size_pretty
 -+
  json text format| 394 MB
  jsonb: compressed   | 525 MB
  jsonb: uncompressed | 1200 MB


 EXTRACTION TIMES
 

 HEAD

 Q1 (search via GIN index followed by extracting 100,000 values from rows):

 jsonb compressed: 4000
 jsonb uncompressed: 3250


 Q2 (seq scan and extract 200,000 values from rows):

 json: 11700
 jsonb compressed: 3150
 jsonb uncompressed: 2700


 PATCHED

 Q1:

 jsonb compressed: 6750
 jsonb uncompressed: 3350

 Q2:

 json: 11796
 jsonb compressed: 4700
 jsonb uncompressed: 2650

 --

 Conclusion: with Tom's patch, compressed JSONB is 55% smaller when
 compressed (EXTENDED).  Extraction times are 50% to 70% slower, but this
 appears to be almost entirely due to decompression overhead.  When not
 compressing (EXTERNAL), extraction times for patch versions are
 statistically the same as HEAD, and file sizes are similar to HEAD.

 USER REACTION
 -

 I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres
 users how they would feel about a compression vs. extraction time
 tradeoff.  The audience was evenly split.

 However, with the current patch, the user can choose.  Users who know
 enough for performance tuning can set JSONB columns to EXTERNAL, and the
 the same performance as the unpatched version.

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



The compression ratio difference is exaggerated in this case but it does
support that Tom's patch alleviates the extraction penalty.

In my testings with the github archive data the savings -
performance-penalty was fine, but I'm not confident in those results since
there were only 8 top level keys.
For comparison, some twitter api objects[1] have 30+ top level keys. If I
have time in the next couple of days I'll conduct some testings with the
public twitter fire-hose data.

[1] https://dev.twitter.com/rest/reference/get/statuses/home_timeline


[HACKERS] Memory Alignment in Postgres

2014-09-09 Thread Arthur Silva
I'm continuously studying Postgres codebase. Hopefully I'll be able to make
some contributions in the future.

For now I'm intrigued about the extensive use of memory alignment. I'm sure
there's some legacy and some architecture that requires it reasoning behind
it.

That aside, since it wastes space (a lot of space in some cases) there must
be a tipping point somewhere. I'm sure one can prove aligned access is
faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
postgres, specially in the page/rows area.

Just for the sake of comparison Mysql COMPACT storage (default and
recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
4-byte alignment. Not sure about Oracle and others.

Is it worth the extra space in newer architectures (specially Intel)?
Do you guys think this is something worth looking at?

I'm trying to messing with the *ALIGN macros but so far I wasn't able to
get any conclusive results. My guess is that I'm missing something in the
code or pg_bench doesn't stress the difference enough.

--
Arthur Silva


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-02 Thread Arthur Silva
On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed rahilasye...@gmail.com wrote:

 Hello,

 It'd be interesting to check avg cpu usage as well

 I have collected average CPU utilization numbers by collecting sar output
 at interval of 10 seconds  for following benchmark:

 Server specifications:
 Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
 RAM: 32GB
 Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

 Benchmark:

 Scale : 16
 Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
  -sleepTime 550,250,250,200,200

 Warmup time  : 1 sec
 Measurement time : 900 sec
 Number of tx types   : 5
 Number of agents : 16
 Connection pool size : 16
 Statement cache size : 40
 Auto commit  : false


 Checkpoint segments:1024
 Checkpoint timeout:5 mins


 Average % of CPU utilization at user level for multiple blocks compression:

 Compression Off  =  3.34133

  Snappy = 3.41044

 LZ4  = 3.59556

  Pglz = 3.66422


 The numbers show the average CPU utilization is in the following order
 pglz  LZ4  Snappy  No compression
 Attached is the graph which gives plot of % CPU utilization versus time
 elapsed for each of the compression algorithms.
 Also, the overall CPU utilization during tests is very low i.e below 10% .
 CPU remained idle for large(~90) percentage of time. I will repeat the
 above tests with high load on CPU and using the benchmark given by
 Fujii-san and post the results.


 Thank you,



 On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva arthur...@gmail.com wrote:


 Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu:

 
  On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com
 wrote:
   Hello,
   Thank you for comments.
  
  Could you tell me where the patch for single block in one run is?
   Please find attached patch for single block compression in one run.
 
  Thanks! I ran the benchmark using pgbench and compared the results.
  I'd like to share the results.
 
  [RESULT]
  Amount of WAL generated during the benchmark. Unit is MB.
 
  MultipleSingle
  off202.0201.5
  on6051.06053.0
  pglz3543.03567.0
  lz43344.03485.0
  snappy3354.03449.5
 
  Latency average during the benchmark. Unit is ms.
 
  MultipleSingle
  off19.119.0
  on55.357.3
  pglz45.045.9
  lz444.244.7
  snappy43.443.3
 
  These results show that FPW compression is really helpful for decreasing
  the WAL volume and improving the performance.
 
  The compression ratio by lz4 or snappy is better than that by pglz. But
  it's difficult to conclude which lz4 or snappy is best, according to
 these
  results.
 
  ISTM that compression-of-multiple-pages-at-a-time approach can compress
  WAL more than compression-of-single-... does.
 
  [HOW TO BENCHMARK]
  Create pgbench database with scall factor 1000.
 
  Change the data type of the column filler on each pgbench table
  from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
  gen_random_uuid() in order to avoid empty column, e.g.,
 
   alter table pgbench_accounts alter column filler type text using
  gen_random_uuid()::text
 
  After creating the test database, run the pgbench as follows. The
  number of transactions executed during benchmark is almost same
  between each benchmark because -R option is used.
 
pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
 
  checkpoint_timeout is 5min, so it's expected that checkpoint was
  executed at least two times during the benchmark.
 
  Regards,
 
  --
  Fujii Masao
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers

 It'd be interesting to check avg cpu usage as well.



Is there any reason to default to LZ4-HC? Shouldn't we try the default as
well? LZ4-default is known for its near realtime speeds in exchange for a
few % of compression, which sounds optimal for this use case.

Also, we might want to compile these libraries with -O3 instead of the
default -O2. They're finely tuned to work with all possible compiler
optimizations w/ hints and other tricks, this is specially true for LZ4,
not sure for snappy.

In my virtual machine LZ4 w/ -O3 compression runs at twice the speed
(950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though
(58MB/s) @ (60.27%).

Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5%
short compared to LZ4-HC.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-27 Thread Arthur Silva
Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu:

 On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com
wrote:
  Hello,
  Thank you for comments.
 
 Could you tell me where the patch for single block in one run is?
  Please find attached patch for single block compression in one run.

 Thanks! I ran the benchmark using pgbench and compared the results.
 I'd like to share the results.

 [RESULT]
 Amount of WAL generated during the benchmark. Unit is MB.

 MultipleSingle
 off202.0201.5
 on6051.06053.0
 pglz3543.03567.0
 lz43344.03485.0
 snappy3354.03449.5

 Latency average during the benchmark. Unit is ms.

 MultipleSingle
 off19.119.0
 on55.357.3
 pglz45.045.9
 lz444.244.7
 snappy43.443.3

 These results show that FPW compression is really helpful for decreasing
 the WAL volume and improving the performance.

 The compression ratio by lz4 or snappy is better than that by pglz. But
 it's difficult to conclude which lz4 or snappy is best, according to these
 results.

 ISTM that compression-of-multiple-pages-at-a-time approach can compress
 WAL more than compression-of-single-... does.

 [HOW TO BENCHMARK]
 Create pgbench database with scall factor 1000.

 Change the data type of the column filler on each pgbench table
 from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
 gen_random_uuid() in order to avoid empty column, e.g.,

  alter table pgbench_accounts alter column filler type text using
 gen_random_uuid()::text

 After creating the test database, run the pgbench as follows. The
 number of transactions executed during benchmark is almost same
 between each benchmark because -R option is used.

   pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared

 checkpoint_timeout is 5min, so it's expected that checkpoint was
 executed at least two times during the benchmark.

 Regards,

 --
 Fujii Masao


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

It'd be interesting to check avg cpu usage as well.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-27 Thread Arthur Silva
On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva arthur...@gmail.com wrote:

 It won't be faster by any means, but it should definitely be incorporated
 if any format changes are made (like Tom already suggested).

 I think it's important we gather at least 2 more things before making any
 calls:
 * Josh tests w/ cache aware patch, which should confirm cache aware is
 indeed prefered
 * Tests with toast hacked to use lz4 instead, which might ease any
 decisions


 --
 Arthur Silva



 On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com
 wrote:
  The difference is small but I's definitely faster, which makes sense
 since
  cache line misses are probably slightly reduced.
  As in the previous runs, I ran the query a dozen times and took the
 average
  after excluding runs with a high deviation.

 I'm not surprised that it hasn't beaten HEAD. I haven't studied the
 problem in detail, but I don't think that the cache awareness of the
 new revision is necessarily a distinct advantage.

 --
 Peter Geoghegan



I'm attaching a quick-n-dirty patch that uses lz4 compression instead of
pglz in case someone wants to experiment with it. Seems to work in my test
env, I'll make more tests when I get home.

PS: gotta love gmail fixed defaults of top-posting...


lz4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Arthur Silva
Tom, here's the results with github data (8 top level keys) only. Here's a
sample object https://gist.github.com/igrigorik/2017462

All-Lenghts + Cache-Aware EXTERNAL
Query 1: 516ms
Query 2: 350ms

The difference is small but I's definitely faster, which makes sense since
cache line misses are probably slightly reduced.
As in the previous runs, I ran the query a dozen times and took the average
after excluding runs with a high deviation.

compare to (copied from my previous email)

HEAD (aka, all offsets) EXTERNAL
Test query 1 runtime: 505ms
Test query 2 runtime: 350ms


All Lengths (Tom Lane patch) EXTERNAL
Test query 1 runtime: 525ms
Test query 2 runtime: 355ms


--
Arthur Silva



On Tue, Aug 26, 2014 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
  I wish it were cache-friendly too, per the upthread tangent about having
  to fetch keys from all over the place within a large JSON object.

  ... and while I was typing that sentence, lightning struck.  The existing
  arrangement of object subfields with keys and values interleaved is just
  plain dumb.  We should rearrange that as all the keys in order, then all
  the values in the same order.  Then the keys are naturally adjacent in
  memory and object-key searches become much more cache-friendly: you
  probably touch most of the key portion of the object, but none of the
  values portion, until you know exactly what part of the latter to fetch.
  This approach might complicate the lookup logic marginally but I bet not
  very much; and it will be a huge help if we ever want to do smart access
  to EXTERNAL (non-compressed) JSON values.

  I will go prototype that just to see how much code rearrangement is
  required.

 This looks pretty good from a coding point of view.  I have not had time
 yet to see if it affects the speed of the benchmark cases we've been
 trying.  I suspect that it won't make much difference in them.  I think
 if we do decide to make an on-disk format change, we should seriously
 consider including this change.

 The same concept could be applied to offset-based storage of course,
 although I rather doubt that we'd make that combination of choices since
 it would be giving up on-disk compatibility for benefits that are mostly
 in the future.

 Attached are two patches: one is a delta against the last jsonb-lengths
 patch I posted, and the other is a merged patch showing the total change
 from HEAD, for ease of application.

 regards, tom lane




Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Arthur Silva
It won't be faster by any means, but it should definitely be incorporated
if any format changes are made (like Tom already suggested).

I think it's important we gather at least 2 more things before making any
calls:
* Josh tests w/ cache aware patch, which should confirm cache aware is
indeed prefered
* Tests with toast hacked to use lz4 instead, which might ease any decisions


--
Arthur Silva



On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote:
  The difference is small but I's definitely faster, which makes sense
 since
  cache line misses are probably slightly reduced.
  As in the previous runs, I ran the query a dozen times and took the
 average
  after excluding runs with a high deviation.

 I'm not surprised that it hasn't beaten HEAD. I haven't studied the
 problem in detail, but I don't think that the cache awareness of the
 new revision is necessarily a distinct advantage.

 --
 Peter Geoghegan



Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-24 Thread Arthur Silva
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus j...@agliodbs.com wrote:

 On 08/20/2014 03:42 PM, Arthur Silva wrote:
  What data are you using right now Josh?

 The same data as upthread.

 Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of
 Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload?

 I'm concerned that my workload is unusual and don't want us to make this
 decision based entirely on it.

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com



Here's my test results so far with the github archive data.

It's important to keep in mind that the PushEvent event objects that I use
in the queries only contains a small number of keys (8 to be precise), so
these tests don't really stress the changed code.

Anyway, in this dataset (with the small objects) using the all-lengths
patch provide small compression savings but the overhead is minimal.



Test data: 610MB of Json -- 341969 items

Index size (jsonb_ops): 331MB

Test query 1: SELECT data-'url', data-'actor' FROM t_json WHERE data @
'{type: PushEvent}'
Test query 1 items: 169732

Test query 2: SELECT data FROM t_json WHERE data @ '{type: PushEvent}'
Test query 2 items:


HEAD (aka, all offsets) EXTENDED
Size: 374MB
Toast Size: 145MB

Test query 1 runtime: 680ms
Test query 2 runtime: 405ms

HEAD (aka, all offsets) EXTERNAL
Size: 366MB
Toast Size: 333MB

Test query 1 runtime: 505ms
Test query 2 runtime: 350ms

All Lengths (Tom Lane patch) EXTENDED
Size: 379MB
Toast Size: 108MB

Test query 1 runtime: 720ms
Test query 2 runtime: 420ms

All Lengths (Tom Lane patch) EXTERNAL
Size: 366MB
Toast Size: 333MB

Test query 1 runtime: 525ms
Test query 2 runtime: 355ms


--
Arthur Silva


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-20 Thread Arthur Silva
What data are you using right now Josh?

There's the github archive http://www.githubarchive.org/
Here's some sample data https://gist.github.com/igrigorik/2017462




--
Arthur Silva



On Wed, Aug 20, 2014 at 6:09 PM, Josh Berkus j...@agliodbs.com wrote:

 On 08/20/2014 08:29 AM, Tom Lane wrote:
  Josh Berkus j...@agliodbs.com writes:
  On 08/15/2014 04:19 PM, Tom Lane wrote:
  Personally I'd prefer to go to the all-lengths approach, but a large
  part of that comes from a subjective assessment that the hybrid
 approach
  is too messy.  Others might well disagree.
 
  ... So, that extraction test is about 1% *slower* than the basic Tom
 Lane
  lengths-only patch, and still 80% slower than original JSONB.  And it's
  the same size as the lengths-only version.
 
  Since it's looking like this might be the direction we want to go, I took
  the time to flesh out my proof-of-concept patch.  The attached version
  takes care of cosmetic issues (like fixing the comments), and includes
  code to avoid O(N^2) penalties in findJsonbValueFromContainer and
  JsonbIteratorNext.  I'm not sure whether those changes will help
  noticeably on Josh's test case; for me, they seemed worth making, but
  they do not bring the code back to full speed parity with the all-offsets
  version.  But as we've been discussing, it seems likely that those costs
  would be swamped by compression and I/O considerations in most scenarios
  with large documents; and of course for small documents it hardly
 matters.

 Table sizes and extraction times are unchanged from the prior patch
 based on my workload.

 We should be comparing all-lengths vs length-and-offset maybe using
 another workload as well ...

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com



Re: [HACKERS] [PATCH] Incremental backup: add backup profile to base backup

2014-08-18 Thread Arthur Silva
On Mon, Aug 18, 2014 at 10:05 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 08/18/2014 08:05 AM, Alvaro Herrera wrote:

 Marco Nenciarini wrote:

  To calculate the md5 checksum I've used the md5 code present in pgcrypto
 contrib as the code in src/include/libpq/md5.h is not suitable for large
 files. Since a core feature cannot depend on a piece of contrib, I've
 moved the files

 contrib/pgcrypto/md5.c
 contrib/pgcrypto/md5.h

 to

 src/backend/utils/hash/md5.c
 src/include/utils/md5.h

 changing the pgcrypto extension to use them.


 We already have the FNV checksum implementation in the backend -- can't
 we use that one for this and avoid messing with MD5?

 (I don't think we're looking for a cryptographic hash here.  Am I wrong?)


 Hmm. Any user that can update a table can craft such an update that its
 checksum matches an older backup. That may seem like an onerous task; to
 correctly calculate the checksum of a file in a previous, you need to know
 the LSNs and the exact data, including deleted data, on every block in the
 table, and then construct a suitable INSERT or UPDATE that modifies the
 table such that you get a collision. But for some tables it could be
 trivial; you might know that a table was bulk-loaded with a particular LSN
 and there are no dead tuples. Or you can simply create your own table and
 insert exactly the data you want. Messing with your own table might seem
 harmless, but it'll e.g. let you construct a case where an index points to
 a tuple that doesn't exist anymore, or there's a row that doesn't pass a
 CHECK-constraint that was added later. Even if there's no direct security
 issue with that, you don't want that kind of uncertainty from a backup
 solution.

 But more to the point, I thought the consensus was to use the highest LSN
 of all the blocks in the file, no? That's essentially free to calculate (if
 you have to read all the data anyway), and isn't vulnerable to collisions.

 - Heikki




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


We also have both crc32 and crc64 implementations in pg_crc. If the goal is
just verifying file integrity (we can't really protect against intentional
modification) crc sounds more appropriate to me.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-16 Thread Arthur Silva
On Fri, Aug 15, 2014 at 8:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Arthur Silva arthur...@gmail.com writes:
  We should add some sort of versionning to the jsonb format. This can be
  explored in the future in many ways.

 If we end up making an incompatible change to the jsonb format, I would
 support taking the opportunity to stick a version ID in there.  But
 I don't want to force a dump/reload cycle *only* to do that.

  As for the current problem, we should explore the directory at the end
  option. It should improve compression and keep good access performance.

 Meh.  Pushing the directory to the end is just a band-aid, and since it
 would still force a dump/reload, it's not a very enticing band-aid.
 The only thing it'd really fix is the first_success_by issue, which
 we could fix *without* a dump/reload by using different compression
 parameters for jsonb.  Moving the directory to the end, by itself,
 does nothing to fix the problem that the directory contents aren't
 compressible --- and we now have pretty clear evidence that that is a
 significant issue.  (See for instance Josh's results that increasing
 first_success_by did very little for the size of his dataset.)

 I think the realistic alternatives at this point are either to
 switch to all-lengths as in my test patch, or to use the hybrid approach
 of Heikki's test patch.  IMO the major attraction of Heikki's patch
 is that it'd be upward compatible with existing beta installations,
 ie no initdb required (but thus, no opportunity to squeeze in a version
 identifier either).  It's not showing up terribly well in the performance
 tests I've been doing --- it's about halfway between HEAD and my patch on
 that extract-a-key-from-a-PLAIN-stored-column test.  But, just as with my
 patch, there are things that could be done to micro-optimize it by
 touching a bit more code.

 I did some quick stats comparing compressed sizes for the delicio.us
 data, printing quartiles as per Josh's lead:

 all-lengths {440,569,609,655,1257}
 Heikki's patch  {456,582,624,671,1274}
 HEAD{493,636,684,744,1485}

 (As before, this is pg_column_size of the jsonb within a table whose rows
 are wide enough to force tuptoaster.c to try to compress the jsonb;
 otherwise many of these values wouldn't get compressed.)  These documents
 don't have enough keys to trigger the first_success_by issue, so that
 HEAD doesn't look too awful, but still there's about an 11% gain from
 switching from offsets to lengths.  Heikki's method captures much of
 that but not all.

 Personally I'd prefer to go to the all-lengths approach, but a large
 part of that comes from a subjective assessment that the hybrid approach
 is too messy.  Others might well disagree.

 In case anyone else wants to do measurements on some more data sets,
 attached is a copy of Heikki's patch updated to apply against git tip.

 regards, tom lane


I agree that versioning might sound silly at this point, but lets keep it
in mind.
Row level compression is very slow itself, so it sounds odd to me paying
25% performance penalty everywhere for the sake of having better
compression ratio in the dictionary area.
Consider, for example, an optimization that stuffs integers (up to 28 bits)
inside the JEntry itself. That alone would save 8 bytes for each integer.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-15 Thread Arthur Silva
I'm still getting up to speed on postgres development but I'd like to leave
an opinion.

We should add some sort of versionning to the jsonb format. This can be
explored in the future in many ways.

As for the current problem, we should explore the directory at the end
option. It should improve compression and keep good access performance.

A 4 byte header is sufficient to store the directory offset and some
versionning bits.
 Em 15/08/2014 17:39, Tom Lane t...@sss.pgh.pa.us escreveu:

 Josh Berkus j...@agliodbs.com writes:
  On 08/14/2014 07:24 PM, Tom Lane wrote:
  We can certainly reduce that.  The question was whether it would be
  worth the effort to try.  At this point, with three different test
  data sets having shown clear space savings, I think it is worth
  the effort.  I'll poke into it tomorrow or over the weekend, unless
  somebody beats me to it.

  Note that I specifically created that data set to be a worst case: many
  top-level keys, no nesting, and small values.  However, I don't think
  it's an unrealistic worst case.

  Interestingly, even on the unpatched, 1GB table case, the *index* on the
  JSONB is only 60MB.  Which shows just how terrific the improvement in
  GIN index size/performance is.

 I've been poking at this, and I think the main explanation for your result
 is that with more JSONB documents being subject to compression, we're
 spending more time in pglz_decompress.  There's no free lunch in that
 department: if you want compressed storage it's gonna cost ya to
 decompress.  The only way I can get decompression and TOAST access to not
 dominate the profile on cases of this size is to ALTER COLUMN SET STORAGE
 PLAIN.  However, when I do that, I do see my test patch running about 25%
 slower overall than HEAD on an explain analyze select jfield - 'key'
 from table type of query with 200-key documents with narrow fields (see
 attached perl script that generates the test data).

 It seems difficult to improve much on that for this test case.  I put some
 logic into findJsonbValueFromContainer to calculate the offset sums just
 once not once per binary-search iteration, but that only improved matters
 5% at best.  I still think it'd be worth modifying the JsonbIterator code
 to avoid repetitive offset calculations, but that's not too relevant to
 this test case.

 Having said all that, I think this test is something of a contrived worst
 case.  More realistic cases are likely to have many fewer keys (so that
 speed of the binary search loop is less of an issue) or else to have total
 document sizes large enough that inline PLAIN storage isn't an option,
 meaning that detoast+decompression costs will dominate.

 regards, tom lane



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers