Re: [HACKERS] Pg V10: Patch for bug in bonjour support

2017-11-08 Thread Luke Lonergan
Hey Tom,

On 11/8/17, 4:39 PM, "Tom Lane"  wrote:

So now I'm wondering if you know something the rest of us don't about
how to configure the platform for bonjour to work.

Nope – in fact, I hadn’t tried to use Bonjour on this instance, but had only 
enabled it thinking I’d want it within my network to be discoverable via 
Bonjour…

Given these issues I’d recommend disabling the functionality for non-Mac until 
someone can get a workable solution in place – then at least it will avoid 
people like me trying to enable a broken feature and failing…

- Luke




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg V10: Patch for bug in bonjour support

2017-11-08 Thread Luke Lonergan
# dpkg -S !$

dpkg -S /usr/lib/x86_64-linux-gnu/libdns_sd.so.1.0.0

libavahi-compat-libdnssd1:amd64: /usr/lib/x86_64-linux-gnu/libdns_sd.so.1.0.0

 

Also:
  ii  libavahi-compat-libdnssd1:amd640.6.32-1ubuntu1
amd64Avahi Apple Bonjour compatibility library

 

- Luke

 

On 11/8/17, 3:00 PM, "Tom Lane"  wrote:


BTW, when I try this on Fedora 25, it builds cleanly but the feature

    doesn't seem to work --- I get this at postmaster start:

I wonder which libdns_sd you are using.

 



Re: [HACKERS] Pg V10: Patch for bug in bonjour support

2017-11-08 Thread Luke Lonergan
Hi Tom – works for me on Linux (Ubuntu)…

- Luke

*** /home/llonergan/archive/configure.in2017-11-08 14:17:56.804891827 
-0800
--- configure.in2017-11-08 14:15:58.961186149 -0800
***
*** 1293,1298 
--- 1293,1299 
  
  if test "$with_bonjour" = yes ; then
AC_CHECK_HEADER(dns_sd.h, [], [AC_MSG_ERROR([header file  is 
required for Bonjour])])
+   AC_SEARCH_LIBS([DNSServiceRefSockFD],[dns_sd])
  fi
  
  # for contrib/uuid-ossp





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Pg V10: Patch for bug in bonjour support

2017-11-08 Thread Luke Lonergan
Hi all – I’m doing some geo analysis and was excited to see all the new 
features in V10 – particularly the declarative partitioning support!

 

Found a tiny bug in the build for Bonjour – patch below:

 

*** configure.in  2017-10-02 14:09:15.0 -0700

--- /home/llonergan/archive/configure.in    2017-11-08 12:53:29.522584528 -0800

***

*** 1293,1298 

--- 1293,1299 

  

  if test "$with_bonjour" = yes ; then

AC_CHECK_HEADER(dns_sd.h, [], [AC_MSG_ERROR([header file  is 
required for Bonjour])])

+   AC_CHECK_LIB(dns_sd, DNSServiceRefSockFD, [], [AC_MSG_ERROR([library 
'dns_sd' is required for Bonjour])])

  fi

  

  # for contrib/uuid-ossp

 

- Luke



Re: [HACKERS] 8.3 / 8.2.6 restore comparison

2008-02-19 Thread Luke Lonergan
Bah.

It's the stuff in the format translation path and conversion to/from datums
that is the bottleneck.

We sped up COPY TO recently by a factor of 10 using similar approaches to
what we did for COPY FROM in the past.  There's a format conversion that is
the culprit.

We routinely get about 12 MB/s of heap insertion rate per CPU core and it's
CPU bound.

You can peek in on what's happening using gstack on Linux, or the gdb
attach and print stacktrace approach for a crude profile.

- Luke

On 2/19/08 1:36 PM, Joshua D. Drake [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On Tue, 19 Feb 2008 13:21:46 -0800
 Joshua D. Drake [EMAIL PROTECTED] wrote:
 
 Were their any thoughts on this? I am also finding that backing up
 this database is rudely slow with the same type of odd metrics
 (almost zero (or zero) iowait). We can saturate a CPU but the CPU is
 certainly not pushing the data to disk as fast as it could.
 
 Further on this. We have tested on RHEL4 and RHEL5. Their are two
 machines, each with 32GB of ram. I have four of these in the RHEL 5
 machine:
 
 processor   : 0
 vendor_id   : AuthenticAMD
 cpu family  : 15
 model   : 65
 model name  : Dual-Core AMD Opteron(tm) Processor 8216
 stepping: 2
 cpu MHz : 2411.132
 cache size  : 1024 KB
 physical id : 0
 siblings: 2
 core id : 0
 cpu cores   : 2
 fpu : yes
 fpu_exception   : yes
 cpuid level : 1
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
 fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
 extapic cr8_legacy bogomips: 4823.59 TLB size: 1024 4K
 pages clflush size: 64 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual
 power management: ts fid vid ttp tm stc
 
 Here is the mini specs:
 
 # 5U Q524 - Quad Opteron 24 SCSI
 # Tyan S4885G3NR 800 Series CPU Opteron 800
 # AMD Opteron 880 - 2.4GHz 2Core x 4
 # 32GB - DDR ECC REG 400MHz (16x2GB) x 1  (16 x 2GB 3200 ECC REG Smart
 Modular  (32GB) # HD 1: 73GB SCSI 15K RPM x 24
 # PCI 1: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
 # MegaRaid LSIiTBBU01 Battery - Order #: LSI9
 # PCI 2: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
 # MegaRaid LSIiTBBU01 Battery - Order #: LSI9
 # DVD-ROM/Sony 3.5 Inch Floppy Drive
 
 The configuration is:
 
 / RAID 1
 / xlogs RAID 1
 /data1 10 drives RAID 10
 /data2 10 drives RAID 10
 
 The thing that is frustrating here, is it appears that PostgreSQL just
 can't utilize the hardware. I *know* it can because we have larger
 machines in production that use PostgreSQL happily. However when I have
 220G backups taking 8 hours and restores taking 11 hours, I begin to
 wonder where the bottleneck is.
 
 Assuming 25 megs a second per drive (random write) on data1 and data2
 we should be pushing 250M a second. Just to be insanely conservative
 let's cut that in half to 125M per second. That means every 10 seconds
 we should do ~ 1G. That means every minute we should to ~ 6G, which
 means 360G an hour.
 
 Granted we have index creation and everything else going on but 11
 hours and no IO chewing?
 
 As a note these are reproducible on both machines regardless of RHEL5
 or RHEL4.
 
 I know there are much stats here but I have provided them in previous
 posts on this thread. Perhaps someone sees a red flag in the hardware?
 
 
 Sincerely,
 
 Joshua D. Drake
 - -- 
 The PostgreSQL Company since 1997: http://www.commandprompt.com/
 PostgreSQL Community Conference: http://www.postgresqlconference.org/
 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)
 
 iD8DBQFHu0vyATb/zqfZUUQRAiuYAJ9ut6i/cPv2MYc8RO2+wNw09M5/WwCfUaGY
 sAkFt+S14i0kFMn6mz9juBw=
 =TNys
 -END PGP SIGNATURE-
 
---(end of broadcast)---
TIP
 9: In versions below 8.0, the planner will ignore your desire to
   choose
 an index scan if your joining column's datatypes do not
   match



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] How to keep a table in memory?

2007-11-12 Thread Luke Lonergan
Vacuum is a better thing to run, much less CPU usage.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Greg Smith [mailto:[EMAIL PROTECTED]
Sent:   Monday, November 12, 2007 11:59 PM Eastern Standard Time
To: Alex Drobychev
Cc: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] How to keep a table in memory?

On Mon, 12 Nov 2007, Alex Drobychev wrote:

 Or any other ideas for pinning a table in memory?

If the table you're worried about is only 20MB, have you considered just 
running something regularly that touches the whole thing?  This may be the 
only time I've ever considered running select count(*) from x as a 
productive move.  That would waste some CPU, but it would help those pages 
win the eviction war as you say.

You definately should follow-up on the suggestion given to look at the 
pg_buffercache contrib module to get a better idea what's going on under 
the LRU hood.  In fact, you may want to install a tweak that's standard in 
8.3 to show the usage counts in order to better get a feel for what's 
going on; the appendix on my article at 
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm goes 
into this a bit, with the documentation to pg_buffercache having the rest 
of what you'd need.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Including Snapshot Info with Indexes

2007-10-20 Thread Luke Lonergan
Hi Hannu,

On 10/14/07 12:58 AM, Hannu Krosing [EMAIL PROTECTED] wrote:

 What has happened in reality, is that the speed difference between CPU,
 RAM and disk speeds has _increased_ tremendously

Yes.

 which makes it even
 more important to _decrease_ the size of stored data if you want good
 performance

Or bring the cpu processing closer to the data it's using (or both).

By default, the trend you mention first will continue in an unending way -
the consequence is that the distance between a processor and it's target
data will continue to increase ad-infinitum.

By contrast, you can only decrease the data volume so much - so in the end
you'll be left with the same problem - the data needs to be closer to the
processing.  This is the essence of parallel / shared nothing architecture.

Note that we've done this at Greenplum.  We're also implementing a DSM-like
capability and are investigating a couple of different hybrid row / column
store approaches.

Bitmap index with index-only access does provide nearly all of the
advantages of a column store from a speed standpoint BTW.  Even though
Vertica is touting speed advantages - our parallel engine plus bitmap index
will crush them in benchmarks when they show up with real code.

Meanwhile they're moving on to new ideas - I kid you not Horizontica is
Dr. Stonebraker's new idea :-)

So - bottom line - some ideas from column store make sense, but it's not a
cure-all.
 
 There is also a MonetDB/X100 project, which tries to make MonetOD
 order(s) of magnitude faster by doing in-page compression in order to
 get even more performance, see:

Actually, the majority of the points made by the MonetDB team involve
decreasing the abstractions in the processing path to improve the IPC
(instructions per clock) efficiency of the executor.

We are also planning to do this by operating on data in vectors of projected
rows in the executor, which will increase the IPC by reducing I-cache misses
and improving D-cache locality.  Tight loops will make a much bigger
difference when long runs of data are the target operands.

- Luke 



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] top for postgresql (ptop?)

2007-09-25 Thread Luke Lonergan
Hi Mark,

I haven't yet looked at what you've done, but I'm an enthusiastic supporter
of this idea.  We're looking to do something that will view running queries
and allow drill down into those executing at any given time, showing their
plans and some notion of what operators are being executed.

The idea of a ptop that shows running queries using a curses interface seems
like a great start.

Our needs for data warehousing workloads are going to be different from the
OLTP users - our queries hang around long enough to warrant a drill-down.

How far can you take the drill-down piece?  Have you thought about how to
acquire the status in the executor yet?  One strategy we've considered is to
use the same approach as pstack on Solaris - it takes a pid and dumps the
stack of a backend, which clearly shows which executor node is being worked
on currently.  I think pstack uses dtrace underneath the hood...

- Luke 


On 9/25/07 3:00 AM, Mark Wong [EMAIL PROTECTED] wrote:

 Hi everyone,
 
 I was playing with converting unixtop (the version of top used in
 FreeBSD) to only show PostgreSQL processes pulled from the
 pg_stat_activity table.  I have a version that kind of works here:
 
 http://pgfoundry.org/frs/download.php/1468/ptop-3.6.1-pre6.tar.gz
 
 I've tried it on FreeBSD and Linux, not sure about other platforms
 though.  So it looks a lot like top and can currently do a few simple
 things like display the current_query from pg_stat_activity for a
 given process, show the locks held by a process and on which tables,
 and show the query plan for the current query.  It is a ways from
 polished (not really documented, etc.) but I wanted to see what people
 thought of a text/curses sort of monitoring tool like this.  Maybe
 something to distribute with PostgreSQL? :)
 
 Forgive me if I didn't try out pgtop (the CPAN module.)
 
 Regards,
 Mark
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] top for postgresql (ptop?)

2007-09-25 Thread Luke Lonergan
Hi Greg,

On 9/25/07 9:15 PM, Greg Smith [EMAIL PROTECTED] wrote:

 It's also worth noting that there's a similar Linux utility called gstack.

Cool!

So - the basic idea is that we could uncover the current run context in a
very lightweight manner by just dumping the stack and interpreting it.  This
is pretty messy on the client side because of the ugly reconstruction, but
is very unobtrusive to the running query.

An alternative might be to take the plan tree, augment it with stats and
store it in a table, maybe augment the backend to catch a certain signal
(SIGUSR maybe?) and that would cause an update to the table?  That way we'd
have the desired feature that the tracking isn't continuous, it's based on a
peek approach, which is only as obtrusive as needed.

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Raw device I/O for large objects

2007-09-17 Thread Luke Lonergan
Index organized tables would do this and it would be a generic capability.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Georgi Chulkov [mailto:[EMAIL PROTECTED]
Sent:   Monday, September 17, 2007 11:50 PM Eastern Standard Time
To: Tom Lane
Cc: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Raw device I/O for large objects

Hi,

 We've heard this idea proposed before, and it's been shot down as a poor
 use of development effort every time.  Check the archives for previous
 threads, but the basic argument goes like this: when Oracle et al did
 that twenty years ago, it was a good idea because (1) operating systems
 tended to have sucky filesystems, (2) performance and reliability
 properties of same were not very consistent across platforms, and (3)
 being large commercial software vendors they could afford to throw lots
 of warm bodies at anything that seemed like a bottleneck.  None of those
 arguments holds up well for us today however.  If you think you want to
 reimplement a filesystem you need to have some pretty concrete reasons
 why you can outsmart all the smart folks who have worked on
 your-favorite-OS's filesystems for lo these many years.  There's also
 the fact that on any reasonably modern disk hardware, raw I/O is
 anything but.

Thanks, I agree with all your arguments.

Here's the reason why I'm looking at raw device storage for large objects only 
(as opposed to all tables): with raw device I/O I can control, to an extent, 
spatial locality. So, if I have an application that wants to store N large 
objects (totaling several gigabytes) and read them back in some order that is 
well-known in advance, I could store my large objects in that order on the 
raw device.* Sequentially reading them back would then be very efficient. 
With a file system underneath, I don't have that freedom. (Such a scenario 
occurs with raster databases, for example.)

* assuming I have a way to communicate these requirements; that's a whole new 
problem

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device 
interface for large objects?
2. How feasible it is to decouple general table storage from large object 
storage?

Thank you for your time,

Georgi

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] FW: was [PERFORM] partitioned table and ORDER BY indexed_field DESC LIMIT 1

2007-08-25 Thread Luke Lonergan
Below is a patch against Greenplum Database that fixes the problem.

- Luke

-- Forwarded Message
From: Luke Lonergan [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 09:25:53 -0700
To: Heikki Linnakangas [EMAIL PROTECTED], Anton [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Conversation: [PERFORM] partitioned table and ORDER BY indexed_field DESC
LIMIT 1
Subject: Re: [PERFORM] partitioned table and ORDER BY indexed_field DESC
LIMIT 1

Below is a patch against 8.2.4 (more or less), Heikki can you take a look at
it?

This enables the use of index scan of a child table by recognizing sort
order of the append node.  Kurt Harriman did the work.

- Luke

Index: cdb-pg/src/backend/optimizer/path/indxpath.c
===
RCS file: 
/data/FISHEYE_REPOSITORIES/greenplum/cvsroot/cdb2/cdb-pg/src/backend/optimiz
er/path/indxpath.c,v
diff -u -N -r1.22 -r1.22.2.1
--- cdb-pg/src/backend/optimizer/path/indxpath.c25 Apr 2007 22:07:21
-1.22
+++ cdb-pg/src/backend/optimizer/path/indxpath.c10 Aug 2007 03:41:15
-1.22.2.1
@@ -379,8 +379,51 @@
 index_pathkeys = build_index_pathkeys(root, index,
   ForwardScanDirection,
   true);
-useful_pathkeys = truncate_useless_pathkeys(root, rel,
-index_pathkeys);
+/*
+ * CDB: For appendrel child, pathkeys contain Var nodes in
terms 
+ * of the child's baserel.  Transform the pathkey list to refer
to 
+ * columns of the appendrel.
+ */
+if (rel-reloptkind == RELOPT_OTHER_MEMBER_REL)
+{
+AppendRelInfo  *appinfo = NULL;
+RelOptInfo *appendrel = NULL;
+ListCell   *appcell;
+CdbPathLocusnotalocus;
+
+/* Find the appendrel of which this baserel is a child. */
+foreach(appcell, root-append_rel_list)
+{
+appinfo = (AppendRelInfo *)lfirst(appcell);
+if (appinfo-child_relid == rel-relid)
+break;
+}
+Assert(appinfo);
+appendrel = find_base_rel(root, appinfo-parent_relid);
+
+/*
+ * The pathkey list happens to have the same format as the
+ * partitioning key of a Hashed locus, so by disguising it
+ * we can use cdbpathlocus_pull_above_projection() to do
the 
+ * transformation.
+ */
+CdbPathLocus_MakeHashed(notalocus, index_pathkeys);
+notalocus =
+cdbpathlocus_pull_above_projection(root,
+   notalocus,
+   rel-relids,
+   rel-reltargetlist,
+  
appendrel-reltargetlist,
+   appendrel-relid);
+if (CdbPathLocus_IsHashed(notalocus))
+index_pathkeys = truncate_useless_pathkeys(root,
appendrel,
+  
notalocus.partkey);
+else
+index_pathkeys = NULL;
+}
+
+useful_pathkeys = truncate_useless_pathkeys(root, rel,
+index_pathkeys);
 }
 else
 useful_pathkeys = NIL;
Index: cdb-pg/src/backend/optimizer/path/pathkeys.c
===
RCS file: 
/data/FISHEYE_REPOSITORIES/greenplum/cvsroot/cdb2/cdb-pg/src/backend/optimiz
er/path/pathkeys.c,v
diff -u -N -r1.18 -r1.18.2.1
--- cdb-pg/src/backend/optimizer/path/pathkeys.c30 Apr 2007 05:44:07
-1.18
+++ cdb-pg/src/backend/optimizer/path/pathkeys.c10 Aug 2007 03:41:15
-1.18.2.1
@@ -1403,55 +1403,53 @@
 {
 PathKeyItem*item;
 Expr   *newexpr;
+AttrNumber  targetindex;
 
 Assert(pathkey);
 
-/* Use constant expr if available.  Will be at head of list. */
-if (CdbPathkeyEqualsConstant(pathkey))
+/* Find an expr that we can rewrite to use the projected columns. */
+item = cdbpullup_findPathKeyItemInTargetList(pathkey,
+ relids,
+ targetlist,
+ targetindex); // OUT
+
+/* If not found, see if the equiv class contains a constant expr. */
+if (!item 
+CdbPathkeyEqualsConstant(pathkey))
 {
 item = (PathKeyItem *)linitial(pathkey);
 newexpr = (Expr *)copyObject(item-key);
 }
 
-/* New vars for old! */
-else
-{
-AttrNumber  targetindex;
+/* Fail

Re: [HACKERS] tsearch2 in PostgreSQL 8.3?

2007-08-17 Thread Luke Lonergan
All - we have customers who very much want tsearch2 and will benefit from its 
inclusion in core.

We are also struggling with the update trigger approach for various reasons.

Is there a good alternative?  Can we embed tsvector updates into the core code 
efficiently?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Friday, August 17, 2007 11:28 AM Eastern Standard Time
To: Oleg Bartunov
Cc: Josh Berkus; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] tsearch2 in PostgreSQL 8.3? 

Oleg Bartunov [EMAIL PROTECTED] writes:
 On Thu, 16 Aug 2007, Josh Berkus wrote:
 First off, I'll assert that backup/restore is a serious issue and while the
 folks who want Tsearch in core now are dismissing it, we'll be fielding the
 complaints later.  Any solution which involves setting a GUC at restore time
 *which could vary per table or even column* isn't acceptable.

 Josh, all my respects to you, but text searching is not about index at all.
 Text searching is about tsvector and tsquery data type

What's your point?  The problem is just as bad for an auto-update
trigger that computes a stored tsvector column.  If the trigger's
behavior depends on the GUC settings of the person doing an insert,
things will soon be a mess --- do you really want the tsvector contents
to change after an update of an unrelated field?  After awhile you
won't have any idea what's really in the column, because you won't
have any good way to know which rows' tsvectors were generated with
which configurations.

Even if that state of affairs is really what you want, reproducing
it after a dump/reload will be tricky.  I think that a regular
schema-and-data dump would work, because pg_dump doesn't install
triggers until after it's loaded the data ... but a data-only dump
would *not* work, because the trigger would fire while loading rows.

Basically I see no use for a setup in which the configuration used
for a particular tsvector value is not fully determined by the table
definition.  Whether the value is in an index or in the table proper
is irrelevant to this argument.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Another idea for index-only scans

2007-08-15 Thread Luke Lonergan
A hybrid scan approach combined with this idea would fit nicely - provide 
results for tids that are directly visible and set a bit in a bitmap for those 
that need recheck and extend recheck to take a bitmap (wait - it already does 
:-)

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Gregory Stark [mailto:[EMAIL PROTECTED]
Sent:   Wednesday, August 15, 2007 02:58 PM Eastern Standard Time
To: Bruce Momjian
Cc: PostgreSQL-development
Subject:Re: [HACKERS] Another idea for index-only scans

Bruce Momjian [EMAIL PROTECTED] writes:

 I have added another idea for index-only scans to the TODO list:

   A third idea would be for a heap scan to check if all rows are visible
   and if so set a per-table flag which can be checked by index scans.
   Any change to the table would have to clear the flag.  To detect
   changes during the heap scan a counter could be set at the start and
   checked at the end --- if it is the same, the table has not been
   modified --- any table change would increment the counter.

I think I would prefer to address this in the same infrastructure as the
dead-space-map. That way you're not dependent on having no updates happening
on the table at all. Any tuples on pages which contain no in-doubt tuples
could have their visibility check skipped but when you come across a tuple on
a page which has been modified since the last vacuum then you have to check
the visibility.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Seq scans roadmap

2007-05-17 Thread Luke Lonergan
Hi Jeff,

On 5/16/07 4:56 PM, Jeff Davis [EMAIL PROTECTED] wrote:

 The main benefit of a sync scan will be the ability to start the scan where
 other scans have already filled the I/O cache with useful blocks.  This will
 require some knowledge of the size of the I/O cache by the syncscan logic,
 but that's where the largest amount of I/O cache memory (by far) is located.

 I don't think it's necessarily the largest by far. However, it may be
 the largest.

Compare the size of a 32 block ring buffer (between 256KB and 1024KB) and
16,000,000 KB of RAM on a common server, automatically used to maximum
extent by the OS dynamic I/O caching.
  
 If you mean that the main benefit of sync scans is to make use of blocks
 that happen to be in cache before the scan began, I disagree.

That's not what I meant.

 I think
 that's a possible benefit, but I was unable to show any huge benefit in
 my tests (someone may be able to on different hardware with different
 test cases).

I agree, I don't think this is worth pursuing.
 
 The main benefits that I see are:
  (1) reduce total number of blocks read from disk by making use of
 blocks as they are read by another concurrent seqscan.
  (2) eliminate the need for random I/O on concurrent sequential scans.

Yes on (1), but with (2), again, the OS prefetch reduces the seeking to a
minimal level.

With (1), we just have to define the meaning of concurrent to be within
the span of the OS I/O cache and we're describing the same effect.

- Luke



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Seq scans roadmap

2007-05-16 Thread Luke Lonergan
I think the analysis on syncscan needs to take the external I/O cache into
account.  I believe it is not necessary or desirable to keep the scans in
lock step within the PG bufcache.

The main benefit of a sync scan will be the ability to start the scan where
other scans have already filled the I/O cache with useful blocks.  This will
require some knowledge of the size of the I/O cache by the syncscan logic,
but that's where the largest amount of I/O cache memory (by far) is located.

- Luke  


On 5/15/07 3:34 PM, Jim C. Nasby [EMAIL PROTECTED] wrote:

 On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
 On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
 Luke Lonergan wrote:
 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
 effect.
 
 How about using 256/blocksize?
 
 Sounds reasonable. We need to check the effect on the synchronized
 scans, though.
 
 
 I am a little worried that there will be greater differences in position
 as the number of scans increase. If we have only 8 buffers and several
 scans progressing, will they all be able to stay within a few buffers of
 eachother at any given time?
 
 Also, with 8 buffers, that means each scan must report every 4 pages at
 most (and maybe every page), which increases lock contention (the new
 design Heikki and I discussed requires a lock every time a backend
 reports its position).
 
 Given that spoiling the L2 cache is a trivial cost compared to extra
 physical IO, ISTM we should go with a largish ring for sync scans. What
 do you think would be the ideal size? 32 buffers?



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Seq scans roadmap

2007-05-15 Thread Luke Lonergan
Heikki,

32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
effect.

How about using 256/blocksize?

- Luke

 -Original Message-
 From: Heikki Linnakangas [mailto:[EMAIL PROTECTED] On 
 Behalf Of Heikki Linnakangas
 Sent: Tuesday, May 15, 2007 2:32 AM
 To: PostgreSQL-development
 Cc: Simon Riggs; Zeugswetter Andreas ADI SD; CK.Tan; Luke 
 Lonergan; Jeff Davis
 Subject: Re: [HACKERS] Seq scans roadmap
 
 Just to keep you guys informed, I've been busy testing and 
 pondering over different buffer ring strategies for vacuum, 
 seqscans and copy. 
 Here's what I'm going to do:
 
 Use a fixed size ring. Fixed as in doesn't change after the 
 ring is initialized, however different kinds of scans use 
 differently sized rings.
 
 I said earlier that it'd be invasive change to see if a 
 buffer needs a WAL flush and choose another victim if that's 
 the case. I looked at it again and found a pretty clean way 
 of doing that, so I took that approach for seq scans.
 
 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small 
 enough to give the L2 cache benefits and keep cache pollution 
 low, but at the same time it's large enough that it keeps the 
 need to WAL flush reasonable
 (1/32 of what we do now).
 
 2. For sequential scans, also use a ring of 32 buffers, but 
 whenever a buffer in the ring would need a WAL flush to 
 recycle, we throw it out of the buffer ring instead. On 
 read-only scans (and scans that only update hint bit) this 
 gives the L2 cache benefits and doesn't pollute the buffer 
 cache. On bulk updates, it's effectively the current 
 behavior. On scans that do some updates, it's something in 
 between. In all cases it should be no worse than what we have 
 now. 32 buffers should be large enough to leave a cache 
 trail for Jeff's synchronized scans to work.
 
 3. For COPY that doesn't write WAL, use the same strategy as 
 for sequential scans. This keeps the cache pollution low and 
 gives the L2 cache benefits.
 
 4. For COPY that writes WAL, use a large ring of 2048-4096 
 buffers. We want to use a ring that can accommodate 1 WAL 
 segment worth of data, to avoid having to do any extra WAL 
 flushes, and the WAL segment size is
 2048 pages in the default configuration.
 
 Some alternatives I considered but rejected:
 
 * Instead of throwing away dirtied buffers in seq scans, 
 accumulate them in another fixed sized list. When the list 
 gets full, do a WAL flush and put them to the shared freelist 
 or a backend-private freelist. That would eliminate the cache 
 pollution of bulk DELETEs and bulk UPDATEs, and it could be 
 used for vacuum as well. I think this would be the optimal 
 algorithm but I don't feel like inventing something that 
 complicated at this stage anymore. Maybe for 8.4.
 
 * Using a different sized ring for 1st and 2nd vacuum phase. 
 Decided that it's not worth the trouble, the above is already 
 an order of magnitude better than the current behavior.
 
 
 I'm going to rerun the performance tests I ran earlier with 
 new patch, tidy it up a bit, and submit it in the next few 
 days. This turned out to be even more laborious patch to 
 review than I thought. While the patch is short and in the 
 end turned out to be very close to Simon's original patch, 
 there's many different usage scenarios that need to be 
 catered for and tested.
 
 I still need to check the interaction with Jeff's patch. This 
 is close enough to Simon's original patch that I believe the 
 results of the tests Jeff ran earlier are still valid.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Seq scans roadmap

2007-05-12 Thread Luke Lonergan
Hi Simon,

On 5/12/07 12:35 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 I'm slightly worried that the results for COPY aren't anywhere near as
 good as the SELECT and VACUUM results. It isn't clear from those numbers
 that the benefit really is significant.

COPY is bottlenecked on datum formation and format translation with very low
performance, so I don't think we should expect the ring buffer to make much
of a dent.

- Luke 



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Seq scans roadmap

2007-05-08 Thread Luke Lonergan
Heikki,

On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching.  The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans  1 x buffer cache size is not
worthwhile.

On 3B: The scenario described is multiple readers seq scanning large
table and sharing bufcache, but in practice this is not a common
situation.  The common situation is multiple queries joining several
small tables to one or more large tables that are  1 x bufcache.  In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.

To that point - an important factor in achieving max I/O rate for large
tables ( 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB - 2MB, which is only important
when considering a bound on the size of the ring buffer.  The effect has
been demonstrated to be significant - in the 20%+ range.  Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes = 32KB are very effective.

We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads.  I
would expect benefits on OLTP as well.

The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables  1 x bufcache
- for tables  1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution

- Luke

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Heikki Linnakangas
 Sent: Tuesday, May 08, 2007 3:40 AM
 To: PostgreSQL-development
 Cc: Jeff Davis; Simon Riggs
 Subject: [HACKERS] Seq scans roadmap
 
 Here's my roadmap for the scan-resistant buffer cache and 
 synchronized scans patches.
 
 1. Fix the current vacuum behavior of throwing dirty buffers 
 to the freelist, forcing a lot of WAL flushes. Instead, use a 
 backend-private ring of shared buffers that are recycled. 
 This is what Simon's scan-resistant buffer manager did.
 
 The theory here is that if a page is read in by vacuum, it's 
 unlikely to be accessed in the near future, therefore it 
 should be recycled. If vacuum doesn't dirty the page, it's 
 best to reuse the buffer immediately for the next page. 
 However, if the buffer is dirty (and not just because we set 
 hint bits), we ought to delay writing it to disk until the 
 corresponding WAL record has been flushed to disk.
 
 Simon's patch used a fixed size ring of buffers that are 
 recycled, but I think the ring should be dynamically sized. 
 Start with a small ring, and whenever you need to do a WAL 
 flush to write a dirty buffer, increase the ring size. On 
 every full iteration through the ring, decrease its size to 
 trim down an unnecessarily large ring.
 
 This only alters the behavior of vacuums, and it's pretty 
 safe to say it won't get worse than what we have now. In the 
 future, we can use the buffer ring for seqscans as well; more 
 on that on step 3.
 
 2. Implement the list/table of last/ongoing seq scan 
 positions. This is Jeff's synchronized scans patch. When a 
 seq scan starts on a table larger than some threshold, it 
 starts from where the previous seq scan is currently, or 
 where it ended. This will synchronize the scans so that for 
 two concurrent scans the total I/O is halved in the best 
 case. There should be no other effect on performance.
 
 If you have a partitioned table, or union of multiple tables 
 or any other plan where multiple seq scans are performed in 
 arbitrary order, this change won't change the order the 
 partitions are scanned and won't therefore ensure they will 
 be synchronized.
 
 
 Now that we have both pieces of the puzzle in place, it's 
 time to consider what more we can do with them:
 
 
 3A. To take advantage of the cache trail of a previous seq 
 scan, scan 
 backwards from where the previous seq scan ended, until a you hit a 
 buffer that's not in cache.
 
 This will allow taking advantage of the buffer cache even if 
 the table 
 doesn't fit completely in RAM. That can make a big difference if the 
 table size is just slightly bigger than RAM, and can avoid the nasty 
 surprise when a table grows beyond RAM size and queries start taking 
 minutes instead of seconds.
 
 This should be a non-controversial change on its own from performance 
 point of view. No query should get slower, and some will 
 become faster. 
 But see step 3B:
 
 3B. Currently, sequential scans on a large table spoils the 
 buffer cache 
 by evicting other pages from the cache. In CVS HEAD, as 

Re: [HACKERS] Seq scans roadmap

2007-05-08 Thread Luke Lonergan
Heikki,

 That's interesting. Care to share the results of the 
 experiments you ran? I was thinking of running tests of my 
 own with varying table sizes.

Yah - it may take a while - you might get there faster.

There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM.  SELECT
COUNT(*) measures CPU consumption in the aggregation node, not scan
rate.

Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.
 
 The main motivation here is to avoid the sudden drop in 
 performance when a table grows big enough not to fit in RAM. 
 See attached diagram for what I mean. Maybe you're right and 
 the effect isn't that bad in practice.

There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache.  The second is severe, the first is almost insignificant
(for common queries).
 
 How is that different from what I described?

My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large ( 1x bufcache) table such that
they can share the first load of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument.  I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query
speed.

 Yeah I remember the discussion on the L2 cache a while back.
 
 What do you mean with using readahead inside the heapscan? 
 Starting an async read request?

Nope - just reading N buffers ahead for seqscans.  Subsequent calls use
previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
 
  The modifications you suggest here may not have the following
  properties:
  - don't pollute bufcache for seqscan of tables  1 x bufcache
  - for tables  1 x bufcache use a ringbuffer for I/O that 
 is ~ 32KB to 
  minimize L2 cache pollution
 
 So the difference is that you don't want 3A (the take 
 advantage of pages already in buffer cache) strategy at all, 
 and want the buffer ring strategy to kick in earlier instead. 
 Am I reading you correctly?

Yes, I think the ring buffer strategy should be used when the table size
is  1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

- Luke


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] TOASTing smaller things

2007-04-12 Thread Luke Lonergan
Hi Bruce,

On 4/12/07 9:24 AM, Bruce Momjian [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 Hi Bruce,
 
 How about these:
 
 - Allow specification of TOAST size threshold in bytes on a per column basis
 - Enable storage of columns in separate TOAST tables
 - Enable use of multi-row compression method(s) for TOAST tables
 
 At this point I would be happy just to set the TOAST threshold to a
 value defined as optimal, rather than as the most minimal use of TOAST
 possible.

I agree that's a good starting point, I guess I was thinking that was
already included in the work that Tom has been doing.  If not, we can add a
TODO like this as a precursor to the ones above:

- Allow specification of TOAST size threshold (in bytes) on a per table
basis

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] TOASTing smaller things

2007-04-10 Thread Luke Lonergan
Hi Bruce,

How about these:

- Allow specification of TOAST size threshold in bytes on a per column basis
- Enable storage of columns in separate TOAST tables
- Enable use of multi-row compression method(s) for TOAST tables

- Luke


On 3/26/07 5:39 PM, Bruce Momjian [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 I advocate the following:
 
 - Enable specification of TOAST policy on a per column basis
 
 As a first step, then:
 
 - Enable vertical partitioning of tables using per-column specification of
 storage policy.
 
 
 How are these different from ALTER TABLE SET STORAGE?  They need to be
 more specific.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Modifying TOAST thresholds

2007-04-05 Thread Luke Lonergan
Not just EDB :-)

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Chris Browne [mailto:[EMAIL PROTECTED]
Sent:   Wednesday, April 04, 2007 10:07 PM Eastern Standard Time
To: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Modifying TOAST thresholds

[EMAIL PROTECTED] (Tom Lane) writes:
 Bruce Momjian [EMAIL PROTECTED] writes:
 The big question is whether this is for 8.3 or 8.4.

 What I would definitely like to see for 8.3 is some performance testing
 done to determine whether we ought to change the current defaults.
 (Both TOAST_TUPLES_PER_PAGE and EXTERN_TUPLES_PER_PAGE ought to be looked
 at.)

 Whether it's possible to get the storage parameter in there depends on
 how soon someone produces a patch.  Given that we understand this area
 fairly well, I personally would be willing to give it a pass on the
 feature freeze rule, as long as we have the patch by say mid-April.

If I had to end of April, I'd volunteer.  Alas, I will be unavailable
between tomorrow and mid-April.

It seems to me that changing the denominator from 4 to something
somewhat higher (16 comes to mind) would probably be the patch, and
would likely be useful for a fair number of cases.  I'd really like to
see something like this (e.g. - something that would start TOASTing
for, say, tuples  500 bytes) in 8.3; this seems to me to have the
potential to be a significant optimization that would be VERY cheap to
implement.  Low hanging fruit, so to speak.

EDB might like to use this as a route into a tunable column store
some time later; I don't think that should stop us from an improvement
today...
-- 
cbbrowne,@,linuxfinances.info
http://www3.sympatico.ca/cbbrowne/rdbms.html
MSDOS didn't get as bad as it  is overnight -- it took over ten years
of careful development.  -- [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



Re: [HACKERS] Modifying TOAST thresholds

2007-04-03 Thread Luke Lonergan
Tom,

On 4/3/07 7:15 AM, Tom Lane [EMAIL PROTECTED] wrote:

 BTW, it strikes me that a GUC variable is quite the wrong way to go
 about this.  The right way is a table storage parameter, a la FILLFACTOR,
 so that it can be set on a per-table basis.  That would also give us a
 chance to fix my concern about needs_toast_table: the case where we
 might need a toast table that we didn't need before is where the toast
 threshold is lowered via ALTER TABLE SET, and we could reasonably make
 that command recheck the situation.

This also seems to also support further development along the vertical
partitioning path.  Any thought on support of per-column TOAST tables?

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Oracle indemnifies PostgreSQL on its patents

2007-04-03 Thread Luke Lonergan
Josh,

On 3/31/07 11:01 AM, Joshua D. Drake [EMAIL PROTECTED] wrote:

 The PostgreSQL project should not give any credence to these
 announcements and should avoid all patent issues possible.

I think that's appropriate - the structure of the OIN looks like it's:
1) focused on Linux
2) designed to foster a patent pool of contributed patents

The license agreement isn't available online, so there's little to review.
What's more - the press release is vague enough that we're not sure if they
signed the same agreement everyone signed or if they made changes to the
agreement.  In any case, I'm not sure what this really means.

Perhaps someone from OIN can enlighten us.

- Luke 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan benchmark results

2007-04-02 Thread Luke Lonergan
Jeff,

Your conclusions sound great - can you perhaps put the timings in a column
in your table so we can confirm them?

- Luke


On 4/2/07 4:14 PM, Jeff Davis [EMAIL PROTECTED] wrote:

 I posted some fairly detailed benchmark results for my Synchronized Scan
 patch and it's interactions with Simon Riggs' Recycle Buffers patch
 here:
 
 http://j-davis.com/postgresql/patch15-results.html
 
 The results are in the form of log files that contain lots of useful
 debugging info:
 
 * log_executor_stats is on (meaning it shows cache hit rate)
 * the pid, timestamp, and pagenumber being retrieved (for every 5k pages
 read)
 * the duration of each scan
 
 The results are very positive and quite conclusive.
 
 However, the sync_seqscan_offset aspect of my patch, which attempts to
 use pages that were cached before the scan began, did not show a lot of
 promise. That aspect of my patch may end up being cut.
 
 The primary aspect of my patch, the Synchronized Scanning, performed
 great though. Even the CFQ scheduler, that does not appear to properly
 read ahead, performed substantially better than plain 8.2.3. And even
 better, Simon's patch didn't seem to hurt Synchronized Scans at all.
 
 Out of the 36 runs I did, a couple appear anomalous. I will retest those
 soon.
 
 Note: I posted the versions of the patches that I used for the tests on
 the page above. The version of Simon's patch that I used did not apply
 cleanly to 8.2.3, but the only problem appeared to be in copy.c, so I
 went ahead with the tests. If this somehow compromised the patch, then
 let me know.
 
 Regards,
 Jeff Davis
 
 
 
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Oracle indemnifies PostgreSQL on its patents

2007-03-31 Thread Luke Lonergan
Bruce,

This is big news - has anyone checked to see if the agreement for the OIN is
in perpetuity?  Or is their agreement to not pursue patents only for as long
as they are members?

- Luke


On 3/30/07 10:43 AM, Bruce Momjian [EMAIL PROTECTED] wrote:

 Here is a surprising article about how Oracle has made open source
 projects, like PostgreSQL, safe from claims of infringing Oracle
 patents:
 
 http://www.cbronline.com/article_news.asp?guid=A0F5F220-5940-470D-8564-CEA7E2D
 2B954. Oracle, like IBM, Sony, RedHat, and Novell, is now a member of
 
 Oracle, like IBM, Sony, RedHat, and Novell, is now a member of the Open
 Invention Network, whose members all offer patent indemnification.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Oracle indemnifies PostgreSQL on its patents

2007-03-31 Thread Luke Lonergan
This may have the nice side effect of pushing 'possibly patented' technologies 
into the FOSS realm, but again I wonder what the duration/persistence of 
Oracle's committment is?

I think I will ask our lawyers to review this.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Saturday, March 31, 2007 02:55 PM Eastern Standard Time
To: Alvaro Herrera
Cc: Luke Lonergan; Bruce Momjian; PostgreSQL-development
Subject:Re: [HACKERS] Oracle indemnifies PostgreSQL on its patents

Alvaro Herrera [EMAIL PROTECTED] writes:
 I would be worried if I were you (or Joshua Drake for that matter): does
 the agreement apply to commercial companies deriving products from
 PostgreSQL as well?

Interesting point.  It's doubtless unwise to take this press release as
being an accurate guide to the terms of the license, but what it says
is

: According to the terms of the OIN license, the components covered by
: the agreement include not only the Linux kernel and associated GNU
: applications, but also other open source projects included in Linux
: distributions. 

which to me says you're covered as long as your code is commonly
included in Linux distributions.  Hence, proprietary derivatives
would *not* be covered.  I'd guess that Oracle would have a hard
time suing for any patent violation embedded in the freely
distributed Postgres code, but any technique appearing only in
the proprietary extension would still be at risk.

IANAL, etc.  I assume that EDB and Greenplum will have their
lawyers scrutinizing this deal on Monday morning ;-) ... I'd
be interested to hear what the experts' conclusion is.

regards, tom lane



Re: [HACKERS] TOASTing smaller things

2007-03-22 Thread Luke Lonergan
I advocate the following:

- Enable specification of TOAST policy on a per column basis

As a first step, then:

- Enable vertical partitioning of tables using per-column specification of
storage policy.

- Luke


On 3/21/07 1:12 PM, Bruce Momjian [EMAIL PROTECTED] wrote:

 
 Is this a TODO?
 
 ---
 
 Jan Wieck wrote:
 On 3/21/2007 2:05 PM, Tom Lane wrote:
 Chris Browne [EMAIL PROTECTED] writes:
 #define TOAST_DENOMINATOR 17
/* Use this as the divisor; current default behaviour falls from
 TOAST_DENOMINATOR = 4 */
 
 #define TOAST_TUPLE_THRESHOLD^I\
 ^IMAXALIGN_DOWN((BLCKSZ - \
 ^I^I^I^I   MAXALIGN(sizeof(PageHeaderData) + 3 * sizeof(ItemIdData))) \
 ^I^I^I^I  / TOAST_DENOMINATOR)
 
 Given that you are quoting code that was demonstrably broken since the
 original coding of TOAST up till a month or two back, it passes
 regression is not adequate proof of it's right.  In fact I think
 it's not right; you have not got the roundoff condition straight.
 
 4.  A different mechanism would be to add a fifth storage column
 strategy (the present four are PLAIN, EXTENDED, EXTERNAL, MAIN), let's
 say, TOAST.
 
 FORCE_COMPRESSION, FORCE_EXTERNAL and FORCE_EXTERNAL_UNCOMPRESSED.
 
 
 Anything along this line would require invoking the toaster on every
 single tuple, since we'd always have to crawl through all the columns
 to see if toasting was supposed to happen.  No thanks.
 
 Not necessarily. A flag in Relation telling if the table has any column
 marked like that could be set while constructing the relcache entry.
 
 
 Which of these sounds preferable?
 
 It's a bit late in the cycle to be proposing any of these for 8.3.
 
 Certainly.
 
 
 Jan
 
 -- 
 #==#
 # It's easier to get forgiveness for being wrong than for being right. #
 # Let's break this rule - forgive me.  #
 #== [EMAIL PROTECTED] #
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] TOASTing smaller things

2007-03-22 Thread Luke Lonergan
Andreas,

On 3/22/07 9:40 AM, Andreas Pflug [EMAIL PROTECTED] wrote:

 Wouldn't it be enough to enable having the toast table on a different
 table space?

Yes, but the ultimate goal would allow the allocation of a storage mechanism
that is unique to each column.  The most frequently used mechanism for our
customers would likely be to have each column associated with it's own
internal relation (like a separate TOAST table), which puts each column into
it's own dense page storage.

Beside the advantages of separating out keys columns from data columns,
compression and encryption approaches that use column packed data are much
more effective.

As Tom points out there are complications WRT update, access, etc that need
to be worked out to support this, but it's an important capability to have
IMO.

- Luke   



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] TOASTing smaller things

2007-03-22 Thread Luke Lonergan
Chris,

 Hmm.  Are you trying to do something sort of like CStore?
 
 http://db.csail.mit.edu/projects/cstore/
 
 That seems to have some relevant ideas...

I think something like is a good way to put it.  As you know Cstore was a
prototype for Vertica and these are in the same class as SybaseIQ and
SandDB.

The huge drawback of the pure column approach is update/insert while query
is difficult if not impossible.  I think there are hybrid approaches that
yield most, if not all of the benefits of the column store approach without
the disadvantages.

For instance, a bitmap index with index only access in a row-store may
outperform the column store on queries.  Note the index only access part.
The next advantage of a column store is deep compression, preserved through
the executor access path - we can do this with selective vertical
partitioning using a page-segmented WAH compression similar to what we did
with bitmap index.  Lastly, vectorization of the operators in the executor
can be implemented with vertical partitioning and an access path that does
projection before feeding the columns into the executor - this can be done
in Postgres with a cache-bypass method.  Some of this requires working out
answers to the visibility challenges inherent to MVCC, but it's all possible
IMO.

So - under the guise of TOASTing smaller things, it seems relevant to
think about vertical partitioning, perhaps making use of what's already in
Postgres as baby steps toward more advanced features.

- Luke   



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] TOASTing smaller things

2007-03-22 Thread Luke Lonergan
Hi Hannu,

On 3/22/07 3:21 PM, Hannu Krosing [EMAIL PROTECTED] wrote:

 Maybe monetdb ( http://monetdb.cwi.nl/ ) can be of some inspiration ?
 
 btw, it may be a good idea to have a verion of bizgresMPP which has
 monetdb as partition db, if monetdb is as efficient as they tell it is .

Yep - I've talked this over with the MonetDB folks in the past.  The major
benefits they observe are those same things we talk about here,
implementation of long loops for operators and de-abstraction of operations
like compare() when appropriate, say comparing two INT columns in a sort.

We can get many of those benefits without by vectorizing the executor of
PostgreSQL even without the full column partitioning.  We're in the midst of
working some of those changes as we speak.  Early indications are that we
see large performance gains from this approach.  Note that the actual
instruction counts per row don't change, but the more effective use of L2 I
and D cache and superscaler instruction units on the CPU create the big
gains.  The MonetDB people present some effective literature on this, but
the important gains mostly come from the vectorization, not the operator
de-abstraction IMO, which is good news for us all.

- Luke   



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-17 Thread Luke Lonergan
Wow, nice!

Can you tell us:
- how big is the table
- cardinality of the column
- how big is the index in each case
- how much memory on the machine
- query and explain analyze

Thanks!

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Grzegorz Jaskiewicz [mailto:[EMAIL PROTECTED]
Sent:   Saturday, March 17, 2007 05:16 PM Eastern Standard Time
To: Joshua D.Drake
Cc: Heikki Linnakangas; PostgreSQL-development Hackers
Subject:Re: [HACKERS] [PATCHES] Bitmapscan changes

This is on dual ultra 2 sparc. with ultrawide 320 scsi drives. 512MB  
ram.
I had to drop size of DB, because the DB drive is 4GB (I do welecome  
bigger drives as donation, if someone asks - UWscsi 320).

here are my results. With only 4.2 patch (no maintain cluster order  
v5 patch). If the v5 patch was needed, please tell me - I am going  
rerun it with.

hope it is usefull.

Repeat 3 times to ensure repeatable results.
Timing is on.
select_with_normal_index
--
10
(1 row)

Time: 1727891.334 ms
select_with_normal_index
--
10
(1 row)

Time: 1325561.252 ms
select_with_normal_index
--
10
(1 row)

Time: 1348530.100 ms
Timing is off.
And now run the same tests with clustered index
Timing is on.
select_with_clustered_index
-
   10
(1 row)

Time: 870246.856 ms
select_with_clustered_index
-
   10
(1 row)

Time: 477089.456 ms
select_with_clustered_index
-
   10
(1 row)

Time: 381880.965 ms
Timing is off.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Luke Lonergan
Simon,

On 3/13/07 2:37 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 We're planning a modification that I think you should consider: when there
 is a sequential scan of a table larger than the size of shared_buffers, we
 are allowing the scan to write through the shared_buffers cache.
 
 Write? For which operations?

I'm actually just referring to the sequential scan writing into the shared
buffers cache, sorry for the write through :-)
  
 I was thinking to do this for bulk writes also, but it would require
 changes to bgwriter's cleaning sequence. Are you saying to write say ~32
 buffers then fsync them, rather than letting bgwriter do that? Then
 allow those buffers to be reused?

Off topic, but we think we just found the reason(s) for the abysmal heap
insert performance of pgsql and are working on a fix to that as well.  It
involves two main things: the ping-ponging seeks used to extend a relfile
and the bgwriter not flushing aggressively enough.  We're hoping to move the
net heap insert rate from 12MB/s for a single stream to something more like
100 MB/s per stream, but it may take a week to get some early results and
find out if we're on the right track.  We've been wrong on this before ;-)

- Luke   



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Synchronized Scan update

2007-03-12 Thread Luke Lonergan
Simon,

On 3/12/07 6:21 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 So based on those thoughts, sync_scan_offset should be fixed at 16,
 rather than being variable. In addition, ss_report_loc() should only
 report its position every 16 blocks, rather than do this every time,
 which will reduce overhead of this call.

And for N concurrent scans?

I think there is actually no need to synchronize the shared buffers at all
for synchronized scans.  The OS I/O cache will do that for us and we're just
going to interfere and pessimize by trying to divide up a local buffer.

I suggest that this be proven or disproved by running this test: measure the
performance of syncscan with the non-polluting buffer change, then measure
with Jeff's patch and non-polluting with multiple scans, then measure with
your suggested changes to synchronize the buffers.

Somewhere in that progression we'll learn more about how the multi-level
buffering really works.  I think we'll get all the shared I/O cache we need.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Luke Lonergan
Simon,

You may know we've built something similar and have seen similar gains.
We're planning a modification that I think you should consider: when there
is a sequential scan of a table larger than the size of shared_buffers, we
are allowing the scan to write through the shared_buffers cache.

The hypothesis is that if a relation is of a size equal to or less than the
size of shared_buffers, it is cacheable and should use the standard LRU
approach to provide for reuse.

- Luke

On 3/12/07 3:08 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote:
 On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
 
 With the default
 value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in
 pool,
 just like existing sequential scans. Is this intended?
 
 Yes, but its not very useful for testing to have done that. I'll do
 another version within the hour that sets N=0 (only) back to current
 behaviour for VACUUM.
 
 New test version enclosed, where scan_recycle_buffers = 0 doesn't change
 existing VACUUM behaviour.



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Luke Lonergan
Simon,

What happens to the data when the function is dropped or replaced?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Friday, March 09, 2007 06:20 AM Eastern Standard Time
To: NikhilS
Cc: Shane Ambler; Luke Lonergan; Zeugswetter Andreas ADI SD; Peter 
Eisentraut; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Auto creation of Partitions

On Fri, 2007-03-09 at 11:48 +0530, NikhilS wrote:
 Hi, 
 
 On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote:
 
  Note to Nikhil: Make sure the new syntax doesn't prevent
 partitions from
  being placed upon multiple tablespaces in some manner, at
 CREATE TABLE
  time.
 
 What if the syntax was something like - 
 
 CREATE TABLE tabname (
  ...
  ...
   ) PARTITION BY
   HASH(expr)
 | RANGE(expr)
 | LIST(expr)
 [PARTITIONS num_partitions] /* will apply to HASH only for
 now*/
 [PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname], 
   PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname]
   ...
 ];
 
 
 And (if we use the ALTER TABLE to add partitions)
 
 ALTER TABLE tabname
 ADD PARTITION partition_name CHECK(...)
 [USING TABLESPACE tblspcname];
 
 
 
 We could as well drop the USING part. 

Why would we support HASH partitions?
If you did, the full syntax for hash clusters should be supported.

If we do the CHECK clauses like that then we still have don't have a
guaranteed non-overlap between partitions. It would be easier to use
Oracle syntax and then construct the CHECK clauses from that.

Also, the syntax needs to be fairly complex to allow for a mixture of
modes, e.g. range and list partitioning. That is currently possible
today and the syntax for doing that is IMHO much simpler than the Oracle
simple way of specifying it.

An alternative is to provide a partitioning function which decides which
partition each values goes into.

PARTITION FUNCTION which_partition(date_col)

The partition function must return an unsigned integer  0, which would
correspond to particular partitions. Partitions would be numbered 1..N,
and named tablename_partM where 1 = M = N.

The input and contents of the partition function would be up to the
user. e.g. 

CREATE FUNCTION range_partition(date date_col)
{
if (date_col  D1)
return 1;
else if (date_col  D2)
return 2;
else if (date_col  D3)
return 3;

return 4;
}

Doing it this way would allow us to easily join two tables based upon a
common partition function.

In time, I would suggest we support both ways: declarative and
functional.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-09 Thread Luke Lonergan
Cool!

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Friday, March 09, 2007 02:32 PM Eastern Standard Time
To: Luke Lonergan; ITAGAKI Takahiro
Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; 
PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote:
 Incidentally, we tried triggering NTA (L2 cache bypass)
 unconditionally and in various patterns and did not see the
 substantial gain as with reducing the working set size.
 
 My conclusion: Fixing the OS is not sufficient to alleviate the issue.
 We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data
 rates due to this effect.
 
I've implemented buffer recycling, as previously described, patch being
posted now to -patches as scan_recycle_buffers.

This version includes buffer recycling

- for SeqScans larger than shared buffers, with the objective of
improving L2 cache efficiency *and* reducing the effects of shared
buffer cache spoiling (both as previously discussed on this thread)

- for VACUUMs of any size, with the objective of reducing WAL thrashing
whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
originally suggested by Itagaki-san, just with a different
implementation).

Behaviour is not activated by default in this patch. To request buffer
recycling, set the USERSET GUC
SET scan_recycle_buffers = N
tested with 1,4,8,16, but only  8 seems sensible, IMHO.

Patch effects StrategyGetBuffer, so only effects the disk-cache path.
The idea is that if its already in shared buffer cache then we get
substantial benefit already and nothing else is needed. So for the
general case, the patch adds a single if test into the I/O path.

The parameter is picked up at the start of SeqScan and VACUUM
(currently). Any change mid-scan will be ignored.

IMHO its possible to do this and to allow Synch Scans at the same time,
with some thought. There is no need for us to rely on cache spoiling
behaviour of scans to implement that feature as well.

Independent performance tests requested, so that we can discuss this
objectively.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Luke Lonergan
Adding to this:

Ayush recently wrote a C program that emulates PG IO to do this analysis, and 
we came out with (predictably) a ratio of sequential/random of 20-50 (for a 
single user).  This is predictable because the random component is fixed at the 
access time of a single hard drive no matter how many disks are in an array, 
while the sequential scales nearly linearly with the number of drives in the 
array.

So, you can estimate random using 8-12ms per random access, and sequential as 
1/(number of disks X 60-130MB/s).

Ayush, can you forward your C program?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Gregory Stark [mailto:[EMAIL PROTECTED]
Sent:   Thursday, March 08, 2007 12:37 PM Eastern Standard Time
To: Tom Lane
Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch


Tom Lane [EMAIL PROTECTED] writes:

 Umar Farooq Minhas [EMAIL PROTECTED] writes:
 How can we accrately estimate the seq_page_fetch and =
 random_page_fetch costs from outside the postgres using for example a =
 C routine.

 Use a test case larger than memory.  Repeat many times to average out
 noise.  IIRC, when I did the experiments that led to the current
 random_page_cost of 4.0, it took about a week before I had numbers I
 trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq



Re: [HACKERS] Grouped Index Tuples / Clustered Indexes

2007-03-07 Thread Luke Lonergan
+1


On 3/7/07 6:53 AM, Grzegorz Jaskiewicz [EMAIL PROTECTED] wrote:

 my only question would be.
 Why isn't that in core already ?
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-07 Thread Luke Lonergan
Simon,

On 3/7/07 5:26 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 What is really needed is a data structure that allows range partitions
 to be accessed more efficiently. This could make adding partitions and
 deciding in which partition a specific value goes an O(logN) operation.

I think we need to re-evaluate the inheritance mechanism for partitioning
and implement something much closer to the storage layer, similar to Oracle.

Having the constraint exclusion occur in the planner is not flexible enough
to allow more advanced solutions.

- Luke 



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Time-correlated columns in large tables

2007-03-07 Thread Luke Lonergan
Jeroen,

On 3/5/07 12:39 PM, Jeroen T. Vermeulen [EMAIL PROTECTED] wrote:

 I guess if you did simple run-length compression on these bitmaps you'd
 end up more or less where I came in.  But you wouldn't want to flip a bit
 somewhere in the middle of a compressed data stream, of course. :-

We handle that by doing a recompression in page if possibly, page splitting
if not.

Jie/Gavin's work will initially be an equality encoded bitmap as Heikki
indicates, soon after we can implement range encoding, etc.

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-07 Thread Luke Lonergan
Yep - we're paying attention Josh!

I like the category being explored with skyline, I'm not sure yet how it fits 
with existing 'soft data' models and applications that use them.

If SKYLINE is interesting to app developers, maybe we should consider it for 
Bizgres?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Gavin Sherry [mailto:[EMAIL PROTECTED]
Sent:   Wednesday, March 07, 2007 05:44 PM Eastern Standard Time
To: Josh Berkus
Cc: Tom Lane; pgsql-hackers@postgresql.org; Alvaro Herrera; Chris Browne
Subject:Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

On Wed, 7 Mar 2007, Josh Berkus wrote:

 Approximate queries is something with DSS users *want*.  Jim Grey addressed
 this in his ACM editiorial on the databases of the future.  It's something
 that *I* want, and if the Greenplum people aren't speaking up here, it's
 because they're not paying atttention.

 Now, I don't know if this Skyline patch is our answer for approximate queries.
 Maybe I should pester Meredith about getting QBE free of its IP issues; it
 certainly looked more flexible than Skyline.  In either case, the code
 probably needs a complete refactor.

What people want from approximate queries is different to this: the
desire is usually to balance run time with level of accuracy/quality (some
times the desire is to have accurate results as well as similar results).
Neither skyline or QBE are about this. The only thing in the spec which
addresses this is 'tablesample'.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



Re: [HACKERS] Auto creation of Partitions

2007-03-07 Thread Luke Lonergan
Andreas,

On 3/7/07 11:45 AM, Zeugswetter Andreas ADI SD [EMAIL PROTECTED]
wrote:

 Whoa, do you have anything to back that up ?

Sure - when we start to consider designs that implement advanced data
management features, we run into problems with the architecture of
tables-tables-tables  Here are some examples:
1 - people think of partitions as a logical building block for tables, they
would like to move partitions around underneath a table without the table
definition being involved.  In the current implementation, there are
explicit linkages between the table definition and the child tables -
imagine an ALTER TABLE foo_parent ADD COLUMN and how it would need to
cascade to 1,000 child tables and you get the beginning of it - this
connection should not exist.

2 - INSERT/UPDATE/DELETE processing through the SQL rewrite layer (rules) is
terribly slow and gets slower as you add more partitions.  If done closer to
the storage layer, this can be done in ways that use access methods shared
with other storage entities, e.g. Indices, and the code path would flow more
naturally.

3 - Parallel query can be accomplished more easily by separating scans
across relations split among tablespaces.  This is more natural than trying
to parallelize APPEND nodes within existing plans

 You would need to elaborate what you actually mean, but I think it is
 moot.
 Sure, the constraint technique can be further extended (e.g. during
 runtime), but imho the approach is very good.

Well, it's being used and that's good, but it needs to be better IMO and I
think that before we go too far down the current path we should consider the
alternatives more carefully.

- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Luke Lonergan
Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in 
various patterns and did not see the substantial gain as with reducing the 
working set size.

My conclusion: Fixing the OS is not sufficient to alleviate the issue.  We see 
a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this 
effect.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Sherry Moore [mailto:[EMAIL PROTECTED]
Sent:   Tuesday, March 06, 2007 10:05 PM Eastern Standard Time
To: Simon Riggs
Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; 
Gavin Sherry; PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 So either way, it isn't in processor cache after the read.  
 So how can there be any performance benefit?

It's the copy from kernel IO cache to the buffer cache that is L2
sensitive.  When the shared buffer cache is polluted, it thrashes the L2
cache.  When the number of pages being written to in the kernel-user
space writes fits in L2, then the L2 lines are written through (see
the link below on page 264 for the write combining features of the
opteron for example) and the writes to main memory are deferred.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
25112.PDF

- Luke


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

 Now this may only prove that the disk subsystem on this 
 machine is too cheap to let the system show any CPU-related 
 issues. 

Try it with a warm IO cache.  As I posted before, we see double the
performance of a VACUUM from a table in IO cache when the shared buffer
cache isn't being polluted.  The speed with large buffer cache should be
about 450 MB/s and the speed with a buffer cache smaller than L2 should
be about 800 MB/s.

The real issue here isn't the L2 behavior, though that's important when
trying to reach very high IO speeds, the issue is that we're seeing the
buffer cache pollution in the first place.  When we instrument the
blocks selected by the buffer page selection algorithm, we see that they
iterate sequentially, filling the shared buffer cache.  That's the
source of the problem here.

Do we have a regression test somewhere for this?

- Luke


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 
Hi Tom,

 Even granting that your conclusions are accurate, we are not 
 in the business of optimizing Postgres for a single CPU architecture.

I think you're missing my/our point:

The Postgres shared buffer cache algorithm appears to have a bug.  When
there is a sequential scan the blocks are filling the entire shared
buffer cache.  This should be fixed.

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.

- Luke


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 

  The Postgres shared buffer cache algorithm appears to have a bug.  
  When there is a sequential scan the blocks are filling the entire 
  shared buffer cache.  This should be fixed.
 
 No, this is not a bug; it is operating as designed.  The 
 point of the current bufmgr algorithm is to replace the page 
 least recently used, and that's what it's doing.

At least we've established that for certain.
 
 If you want to lobby for changing the algorithm, then you 
 need to explain why one test case on one platform justifies 
 de-optimizing for a lot of other cases.  In almost any 
 concurrent-access situation I think that what you are 
 suggesting would be a dead loss --- for instance we might as 
 well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.
 
 In any case, I'm still not convinced that you've identified 
 the problem correctly, because your explanation makes no 
 sense to me.  How can the processor's L2 cache improve access 
 to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect.  The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it.  Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove-kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB).  Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already.  The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are hot as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Mark,

 lineitem has 1535724 pages (11997 MB)
 
 Shared Buffers  Elapsed  IO rate (from vmstat)
 --  ---  -
 400MB   101 s122 MB/s
 
 2MB 100 s
 1MB  97 s
 768KB93 s
 512KB86 s
 256KB77 s
 128KB74 s166 MB/s
 
 I've added the observed IO rate for the two extreme cases 
 (the rest can be pretty much deduced via interpolation).
 
 Note that the system will do about 220 MB/s with the now 
 (in)famous dd test, so we have a bit of headroom (not too bad 
 for a PIII).

What's really interesting: try this with a table that fits into I/O
cache (say half your system memory), and run VACUUM on the table.  That
way the effect will stand out more dramatically.

- Luke


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...
 
 So I'm back to asking what we're really measuring here.  Buffer manager
 inefficiency of some sort, but what?  Have you tried oprofile?

How about looking at the CPU performance counters directly using cpustat:
  cpustat -c BU_fill_into_L2,umask=0x1 1

This shows us how many L2 fills there are on all four cores (we use all
four).  In the case without buffer cache pollution, below is the trace of L2
fills.  In the pollution case we fill 27 million lines, in the pollution
case we fill 44 million lines.

VACUUM orders (no buffer pollution):
 51.006   1  tick   2754293
 51.006   2  tick   3159565
 51.006   3  tick   2971929
 51.007   0  tick   3577487
 52.006   1  tick   4214179
 52.006   3  tick   3650193
 52.006   2  tick   3905828
 52.007   0  tick   3465261
 53.006   1  tick   1818766
 53.006   3  tick   1546018
 53.006   2  tick   1709385
 53.007   0  tick   1483371

And here is the case with buffer pollution:
VACUUM orders (with buffer pollution)
 22.006   0  tick   1576114
 22.006   1  tick   1542604
 22.006   2  tick   1987366
 22.006   3  tick   1784567
 23.006   3  tick   2706059
 23.006   2  tick   2362048
 23.006   0  tick   2190719
 23.006   1  tick   2088827
 24.006   0  tick   2247473
 24.006   1  tick   2153850
 24.006   2  tick   2422730
 24.006   3  tick   2758795
 25.006   0  tick   2419436
 25.006   1  tick   2229602
 25.006   2  tick   2619333
 25.006   3  tick   2712332
 26.006   1  tick   1827923
 26.006   2  tick   1886556
 26.006   3  tick   2909746
 26.006   0  tick   1467164



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom,

On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...

Here's that cliff you were looking for:

Size of Orders table: 7178MB
Blocksize: 8KB

Shared_buffers  Select CountVacuum
(KB)(s) (s)
===
248 5.522.46
368 4.772.40
552 5.822.40
824 6.202.43
12325.603.59
18486.023.14
27685.534.56

All of these were run three times and the *lowest* time reported.  Also, the
behavior of fast VACUUM after SELECT begins abruptly at 1232KB of
shared_buffers.

These are Opterons with 2MB of L2 cache shared between two cores.

- Luke




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Here's four more points on the curve - I'd use a dirac delta function for
your curve fit ;-)

Shared_buffers  Select CountVacuum
(KB)(s) (s)
===
248 5.522.46
368 4.772.40
552 5.822.40
824 6.202.43
12325.603.59
18486.023.14
27685.534.56
55366.053.95
83045.804.37
12456   5.864.12
18680   5.834.10
28016   6.114.46

WRT what you found on the selection algorithm, it might also explain the L2
effects I think.

I'm also still of the opinion that polluting the shared buffer cache for a
seq scan does not make sense.

- Luke

On 3/5/07 10:21 AM, Luke Lonergan [EMAIL PROTECTED] wrote:

 Tom,
 
 On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:
 
 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...
 
 Here's that cliff you were looking for:
 
 Size of Orders table: 7178MB
 Blocksize: 8KB
 
 Shared_buffers  Select CountVacuum
 (KB)(s) (s)
 ===
 248 5.522.46
 368 4.772.40
 552 5.822.40
 824 6.202.43
 12325.603.59
 18486.023.14
 27685.534.56
 
 All of these were run three times and the *lowest* time reported.  Also, the
 behavior of fast VACUUM after SELECT begins abruptly at 1232KB of
 shared_buffers.
 
 These are Opterons with 2MB of L2 cache shared between two cores.
 
 - Luke
 
 
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
This sounds like a good idea.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Monday, March 05, 2007 02:37 PM Eastern Standard Time
To: Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; 
Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore
Cc: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
 Tom,
 
  I seem to recall that we've previously discussed the idea of letting the
  clock sweep decrement the usage_count before testing for 0, so that a
  buffer could be reused on the first sweep after it was initially used,
  but that we rejected it as being a bad idea.  But at least with large
  shared_buffers it doesn't sound like such a bad idea.

 Note, though, that the current algorithm is working very, very well for OLTP 
 benchmarks, so we'd want to be careful not to gain performance in one area at 
 the expense of another. 

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).


- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Monday, March 05, 2007 07:43 PM Eastern Standard Time
To: Mark Kirkwood
Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; 
Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But what I wanted to see was the curve of
 elapsed time vs shared_buffers?
 ...
 Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself.  I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so).  Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can.  Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
 shared_buffers

 128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
  count
--
 1024
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior.  (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765  6.6799 /libata
 2238  1.1711 /lib64/libc-2.4.so
 1112  0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004  7.4475 /libata
 2093  1.2985 /lib64/libc-2.4.so
  996  0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
5777940.9591  copy_user_generic
1817512.8841  __delay
3994  2.8313  _raw_spin_lock
2388  1.6928  put_page
2184  1.5482  mwait_idle
2083  1.4766  _raw_write_unlock
1909  1.3533  _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
3743733.0783  copy_user_generic
1789115.8080  __delay
3372  2.9794  _raw_spin_lock
2218  1.9598  mwait_idle
2067  1.8263  _raw_write_unlock
1837  1.6231  _raw_write_lock
1531  1.3527  put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache.  This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior?  A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom,

On 3/5/07 7:58 PM, Tom Lane [EMAIL PROTECTED] wrote:

 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.

Here's the C wrapper routine in Solaris:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/
move.c

Here's the x86 assembler routine for Solaris:
  
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32
/ml/copy.s

The actual uiomove routine is a simple wrapper that calls the assembler
kcopy or xcopyout routines.  There are two versions (for Opteron), one that
uses the NTA instructions that bypass the L2 cache on writing to avoid L2
cache pollution, and the second writes normally - through the L2 cache.
Which one is used depends on a parameter (global) based on the size of the
I/O. It is tuned to identify operations that might pollute the L2 cache
(sound familiar?) 

I think what we're seeing is a generic artifact of the write-through
behavior of the cache.  I wouldn't expect this to get any better with
DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number
of user space buffers we'll still hit the increased L2 thrashing.

I think we're best off with a hybrid approach - when we detect a seq scan
larger (much larger?) than buffer cache, we can switch into the cache
bypass behavior, much like the above code uses the NTA instruction when
appropriate.

We can handle syncscan using a small buffer space.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not scan
resistant as advertised.  A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where bigtable is a table 2x the size of RAM and
memtable is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation   memtablebigtable
---
SELECT COUNT(*) 1221 MB/s   973 MB/s
VACUUM  1709 MB/s   1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT.  In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well.  The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache.  Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it.  In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation   memtablebigtable
---
SELECT COUNT(*) 1320 MB/s   1059 MB/s
VACUUM  3033 MB/s   1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm?  Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.
 
- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
When we instrument the page selections made within the buffer cache, they are 
sequential and span the entire address space of the cache.

With respect to whether it's L2, it's a conclusion based on the experimental 
results.  It's not the TLB, as we also tested for the 512 entries for each L2.

One thing I left out of the previous post: the difference between fast and slow 
behavior was that in the fast case, the buffer selection alternated between two 
buffer pages.  This was the case only when the preceding statement was a SELECT 
and the statement was VACUUM.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Luke Lonergan [EMAIL PROTECTED] writes:
 The issue is summarized like this: the buffer cache in PGSQL is not scan
 resistant as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
One more thing: the L2 is invalidated when re-written from the kernel IO cache, 
but the pages addressed in L2 retain their values when 'writeen thru' which 
allows the new data to be re-used up the executor chain.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Luke Lonergan [EMAIL PROTECTED] writes:
 The issue is summarized like this: the buffer cache in PGSQL is not scan
 resistant as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
Gavin, Mark,

 Could you demonstrate that point by showing us timings for 
 shared_buffers sizes from 512K up to, say, 2 MB? The two 
 numbers you give there might just have to do with managing a 
 large buffer.

I suggest two experiments that we've already done:
1) increase shared buffers to double the L2 cache size, you should see
that the behavior reverts to the slow performance and is constant at
larger sizes

2) instrument the calls to BufferGetPage() (a macro) and note that the
buffer block numbers returned increase sequentially during scans of
tables larger than the buffer size

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] doxygen.postgresql.org

2007-02-06 Thread Luke Lonergan
Yay!

This rocks IMO, but I'm a borderline PHB so what do I know ;-)

- Luke


On 2/6/07 9:19 AM, Magnus Hagander [EMAIL PROTECTED] wrote:

 http://doxygen.postgresql.org is now set up for your browsing pleasure.
 It's synced to anoncvs once per day.
 
 //Magnus
 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate
 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Block B-tree etc. (was Re: Introducing an advanced Frequent

2006-11-07 Thread Luke Lonergan
Heikki,

On 11/7/06 1:51 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote:

 I don't think Simon's proposal is meant to address that issue, but did
 you follow the thread I started in September about Block B-Tree index:
 http://archives.postgresql.org/pgsql-hackers/2006-09/msg02041.php

Aha - mystery solved - thanks for the clarification.

  Block Btree ~= Lossy Btree

- Luke



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Introducing an advanced Frequent Update

2006-11-06 Thread Luke Lonergan
Simon,

Bring it on!  We at GP have been evaluating various approaches to index
organized tables which unify index with heap storage to solve some of
the problems you mention.  Split index and heap is a big issue in
Postgres and we'd all welcome a good solution to it, even for limited
circumstances like single index organization or the like.

- Luke 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Simon Riggs
 Sent: Monday, November 06, 2006 1:51 PM
 To: pgsql-hackers@postgresql.org
 Subject: [HACKERS] Introducing an advanced Frequent Update 
 Optimization
 
 EnterpriseDB has been running a research project to improve 
 the performance of heavily updated tables. We have a number 
 of approaches prototyped and we'd like to discuss the best of 
 these now on -hackers for community input and patch 
 submission to PostgreSQL core.
 
 The most important step with any proposal is to agree that we 
 have an issue that needs improvement, discuss how widespread 
 that issue is and find some clear test cases that show up the 
 problems. Tests are:
 
 1. pgbench reveals performance that will degrade over a long period.
 
 2. DBT-2 reveals performance that will degrade over a long 
 period. Many tests over a 2 hour period don't fully show 
 this, especially when the test is cafeully tuned.
 
 3. Some common scenarios in applications are where some rows 
 of a table are hot from being constantly updated, while 
 others are not. An example of such a test case is the 
 truckin' test, included here. It's based directly on a 
 specific customer application, but its been generalised to 
 make sure the underlying design pattern is clear.
 
 These tests reveal the following issues, all of which are known:
 
 - update performs inserts into indexes, as well as into heap blocks
 
 - VACUUM can remove heap blocks easily, but performs much 
 worse on indexes, making VACUUM a less good solution. We have 
 now been able to speed up index VACUUM, but this require us 
 to scan the whole index for correct locking. VACUUM scans the 
 whole table, whereas dead rows may well be localised. 
 Heap-needs-vacuum-bitmap has been proposed here, but no 
 solution currently exists for vacuuming only parts of indexes 
 and so proposals for concurrent vacuums are now being considered.
 
 - indexes that have been stretched apart by updates do not 
 ever coalesce again and require regular REINDEX, which is not 
 yet possible concurrently; the contention caused by this 
 would be catastrophic for performance, even if anybody knew 
 of a way to do this concurrently.
 
 - There are specific issues with the optimizer's ability to 
 understand dead row numbers, which can in some cases lead to 
 SeqScan plans that are inappropriate when tables grow because 
 of updates. This is a red-herring that can lead to people 
 thinking the situation is worse than it is; that needs 
 fixing, but the core issues mentioned above remain.
 
 To alleviate these problems we've added features such as WITH 
 fillfactor for heaps and table-level autovacuum tuning. 
 Tuning all of these features to good effect is an art form 
 that is beyond the reasonable for most users. Many internal 
 optimizations have been made in this area and as can be seen, 
 many are still required to achieve better performance.
 
 The proposal about to be made takes a more radical approach 
 and re-examines the architecture of the heap, to allow us to 
 consider much faster designs for heavy UPDATEs. Although 
 initially radical, the proposal appears to be fully MVCC 
 correct, crash safe as well as being much faster under heavy 
 updates, while approximately neutral in other cases with no 
 major downsides.
 
 Why should we care? The UPDATE case has obvious use-cases in 
 a business design pattern I'll call CustomerAccountDebit 
 which is pervasive in pay-per-use websites, banks, telephone 
 companies, road traffic monitoring etc etc. It's also 
 pervasive in Data Warehousing where summary 
 tables/materialized views are regularly updated to maintain a 
 current picture of spending, movements or any other 
 accumulation of event detail. It's everywhere, basically.
 
 Your various viewpoints on the above are welcome, but 
 assuming for the moment that you agree so far, we can move 
 towards the proposal...
 
 These discussions will likely be lengthy if taken seriously 
 and need to cover a range of different topics to ensure we 
 cover what we know and ensure we listen to all the feedback 
 everybody gives. To that end, I'd like to introduce two 
 colleagues of mine to the community, Pavan Deolasee and 
 Nikhil Sontakke who have been working hard on developing the 
 prototypes and measuring/tuning them respectively. 
 
 I would stress that we are not bringing our first prototype 
 to the table, but actually design #5. We think you'll be 
 interested, but we won't take that for granted.
 
 Our next steps will be to
 
 - discuss various other approaches 

Re: [HACKERS] Replication documentation addition

2006-10-24 Thread Luke Lonergan
Bruce, 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian
 Sent: Tuesday, October 24, 2006 5:16 PM
 To: Hannu Krosing
 Cc: PostgreSQL-documentation; PostgreSQL-development
 Subject: Re: [HACKERS] Replication documentation addition
 
 
 OK, I have updated the URL.  Please let me know how you like it.

There's a typo on line 8, first paragraph:

perhaps with only one server allowing write rwork together at the same
time.

Also, consider this wording of the last description:

Single-Query Clustering...

Replaced by:

Shared Nothing Clustering
---

This allows multiple servers with separate disks to work together on a
each query.
In shared nothing clusters, the work of answering each query is
distributed among
the servers to increase the performance through parallelism.  These
systems will
typically feature high availability by using other forms of replication
internally.

While there are no open source options for this type of clustering,
there are several
commercial products available that implement this approach, making
PostgreSQL achieve
very high performance for multi-Terabyte business intelligence
databases.

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Asynchronous I/O Support

2006-10-15 Thread Luke Lonergan
Martijn,

On 10/15/06 10:56 AM, Martijn van Oosterhout kleptog@svana.org wrote:

 Have enough systems actually got to the point of actually supporting
 async I/O that it's worth implementing?

I think there are enough high end applications / systems that need it at
this point.

The killer use-case we've identified is for the scattered I/O associated
with index + heap scans in Postgres.  If we can issue ~5-15 I/Os in advance
when the TIDs are widely separated it has the potential to increase the I/O
speed by the number of disks in the tablespace being scanned.  At this
point, that pattern will only use one disk.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] continuing daily testing of dbt2 against

2006-10-10 Thread Luke Lonergan
One of our customers noticed that there were a high number of NUMA cache
misses on a quad core opteron system running Bizgres MPP resulting in about
a 15% performance hit.  We use a process-based parallelization approach and
we can guess that there's context switching due to the high degree of
pipeline parallelism in our executions plans.  Each context switch likely
switches a process away from the CPU with local memory, resulting in the
NUMA cache misses.

The answer for us is to bind each process to a CPU.  Might that help in
running DBT-2?

- Luke


On 10/10/06 9:40 AM, Mark Wong [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 +1
 
 Mark, can you quantify the impact of not running with IRQ balancing enabled?
 
 Whoops, look like performance was due more to enabling the
 --enable-thread-safe flag.
 
 IRQ balancing on : 7086.75
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/158/
 IRQ balancing off: 7057.90
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/163/
 
 The interrupt charts look completely different.  There's too much stuff
 on the chart to determine what interrupts are from what though. :(  It
 needs to be redone per processor (as opposed to per interrupt per
 processor) to be more useful in determining if one processor is
 overloaded due to interrupts.
 
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/158/report/sar/sar-intr.png
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/163/report/sar/sar-intr.png
 
 But the sum of all the interrupts handled are close between tests so it
 seems clear no single processor was overloaded:
 
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/158/report/sar/sar-intr_s.png
 http://dbt.osdl.org/dbt/dbt2dev/results/dev4-015/163/report/sar/sar-intr_s.png
 
 Mark
 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] continuing daily testing of dbt2 against

2006-10-09 Thread Luke Lonergan
+1

Mark, can you quantify the impact of not running with IRQ balancing enabled?


- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
Original message contents unavailable


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Query Failed, out of memory

2006-10-05 Thread Luke Lonergan
Create table as select ... Order by ...

Copy to ...


- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Andrew Dunstan [mailto:[EMAIL PROTECTED]
Sent:   Thursday, October 05, 2006 12:51 PM Eastern Standard Time
To: Tom Lane
Cc: Mark Woodward; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Query Failed, out of memory

Tom Lane wrote:
 Mark Woodward [EMAIL PROTECTED] writes:
   
 psql -p 5435 -U pgsql -t -A -c select client, item, rating, day from
 ratings order by client netflix  netflix.txt
 

 FWIW, there's a feature in CVS HEAD to instruct psql to try to use a
 cursor to break up huge query results like this.  For the moment I'd
 suggest using COPY instead.

   
   

but COPY doesn't guarantee any order.

BTW, I just this morning discovered the hard way that our linux boxes 
didn't have strict memory allocation turned on, and then went and set 
it. I'd advise Mark to do the same, if he hasn't already.

cheers

andrew


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Query Failed, out of memory

2006-10-05 Thread Luke Lonergan
:-D

Is that in the release notes?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Neil Conway [mailto:[EMAIL PROTECTED]
Sent:   Thursday, October 05, 2006 02:35 PM Eastern Standard Time
To: Luke Lonergan
Cc: Andrew Dunstan; Tom Lane; Mark Woodward; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Query Failed, out of memory

On Thu, 2006-10-05 at 12:52 -0400, Luke Lonergan wrote:
 Create table as select ... Order by ...
 
 Copy to ...

Or in 8.2, COPY TO (SELECT ... ORDER BY) (My, that's a neat feature.)

-Neil





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Luke Lonergan
Mark,

On 10/4/06 1:43 PM, Mark Woodward [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED]:~/netflix$ time psql netflix -c select count(*) from 
 ratings
count
 ---
  100480507
 (1 row)
 
 
 real2m6.270s
 user0m0.004s
 sys 0m0.005s

I think you are getting about 40MB/s on your sequential scan of about 5GB of
heap data in this case.  I calculate the size of the data as:

3 Integers (12 bytes), one text date field (10 bytes ?) and tuple overhead
(24 bytes) = 46 bytes per row

100 million rows x 46 bytes / row = 4.6 Gbytes

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] PG qsort vs. Solaris

2006-10-03 Thread Luke Lonergan
+1

- Luke


On 10/3/06 2:58 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

 Tom Lane wrote:
 Neil Conway [EMAIL PROTECTED] writes:
 Given the time that has been spent working around
 the braindamaged behavior of qsort() on various platforms, I would be
 more inclined to *always* use our qsort() instead of the platform's
 version.
 
 I've been heard to argue against that in the past, but I'm beginning to
 see the merit of the idea.  One good reason for doing it is that we
 could stop worrying about the possibility of large-scale memory leaks
 due to erroring out of glibc's qsort --- in particular it would be OK
 to add CHECK_FOR_INTERRUPTS into the comparison callback as was
 requested recently.
 
 
 I think this is a great idea - having predictable sort performance on
 all platforms makes a lot of sense.
 
 Cheers
 
 Mark
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Faster StrNCpy

2006-10-02 Thread Luke Lonergan
Mark,

On 9/29/06 2:59 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Here are results over 64 Mbytes of memory, to ensure that every call is
 a cache miss:

On my Mac OSX intel laptop (Core Duo, 2.16 GHz, 2GB RAM, gcc 4.01):

Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='This is a very long sentence that is expected to be very slow.'
-DN=(1024*1024) -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//cc1eMVq7.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:416677 us
MEMCPY: 5649587 us
STRNCPY:5806591 us
STRLCPY:   12865010 us
LENCPY:17801485 us
Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='Short sentence.' -DN=(1024*1024) -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//ccOZl9R6.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:416652 us
MEMCPY: 5830540 us
STRNCPY:6207594 us
STRLCPY:5582607 us
LENCPY: 7887703 us
Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='' -DN=(1024*1024) -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//ccBUsIdR.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:417210 us
MEMCPY: 5506346 us
STRNCPY:5769302 us
STRLCPY:5424234 us
LENCPY: 5609338 us


- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] PostgreSQL 8.2beta1 w/ VALUES

2006-09-30 Thread Luke Lonergan
That seems right, there won't be a performance impact unless the warnings are 
issued.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Saturday, September 30, 2006 01:48 PM Eastern Standard Time
To: Stephen Frost
Cc: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] PostgreSQL 8.2beta1 w/ VALUES

Stephen Frost [EMAIL PROTECTED] writes:
   When loading a rather large data set I started getting errors along
   these lines:
 psql:/home/sfrost/school/cs750/reality/dump-anonymized.postgres.sql:262:
 WARNING:  nonstandard use of escape in a string literal
 LINE 1: ...XX ,9:9:999'),(9,'',0,'X XXX...
  ^
 HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.

   Which, by themselves, aren't really an issue *except* for the fact
   that I got an *insane* number of them.  I don't think it was quite one
   for every row (of which there were 20,795, you'll note) but it was
   more than enough to drive me insane.  Additionally, cancel requests
   were ignored.

That's not too surprising because I don't believe there are any
CHECK_FOR_INTERRUPTS calls in the basic lex/parse loop.  That wouldn't
normally be a problem because that phase is pretty quick, but it is a
problem if the system is spitting tons of messages at you.

It seems like a reasonable thing to do would be to add a
CHECK_FOR_INTERRUPTS in elog.c just after sending a notice/warning
message to the client.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
Though this may be the kiss of death, I favor a 64 bit float version of money.  
It's more terse than numeric and a *lot* faster when performing numeric 
operations because it would use a cpu intrinsic operand.


- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   D'Arcy J.M. Cain [mailto:[EMAIL PROTECTED]
Sent:   Thursday, September 28, 2006 11:02 AM Eastern Standard Time
To: D'Arcy J.M. Cain
Cc: pgsql-hackers@postgreSQL.org
Subject:Re: [HACKERS] New version of money type

On Thu, 14 Sep 2006 10:35:03 -0400
D'Arcy J.M. Cain darcy@druid.net wrote:
 For years I have been promising that a 64 bit version of the money type
 was on the way.  Here it is.  So far it compiles and I have done some
 basic testing on it and it seems to work fine.  Note that the currency
 symbol is also dropped on output as well but it is accepted on input.
 
 darcy=# select '$92,233,720,368,547,758.07'::money;
money
 
   92,233,720,368,547,758.07

There has been plenty of discussion back and forth but still no ruling
from core.  Is money out in the next release in which case I can
convert this to a contrib module or will this improvement be accepted
for the next release.

-- 
D'Arcy J.M. Cain darcy@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
Oic - so it's a floating point in an 8 byte int.  That probably limits the 
speed benefits, no?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   D'Arcy J.M. Cain [mailto:[EMAIL PROTECTED]
Sent:   Thursday, September 28, 2006 11:14 AM Eastern Standard Time
To: Luke Lonergan
Cc: pgsql-hackers@postgreSQL.org
Subject:Re: [HACKERS] New version of money type

On Thu, 28 Sep 2006 11:09:17 -0400
Luke Lonergan [EMAIL PROTECTED] wrote:
 Though this may be the kiss of death, I favor a 64 bit float version of 
 money.  It's more terse than numeric and a

I assume you mean ...64 bit INT version...

-- 
D'Arcy J.M. Cain darcy@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
D'Arcy,

On 9/28/06 8:43 AM, D'Arcy J.M. Cain darcy@druid.net wrote:

 On Thu, 28 Sep 2006 10:35:01 -0500
 Jim C. Nasby [EMAIL PROTECTED] wrote:
 Floating point math and hard-earned money are two things that don't mix
 well. :)
 
 Using FP to track money is a good way to stop making any.  :-)

Understood - a cent here and there in rounding can add up, as the earliest
hackers found out.

What I meant is the use of the point that floats, is there an exponent
stored as part of the format?  If so, then the integer arithmetic is not
standard integer operands, right?

Which routines implement the money arithmetic?

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
D'Arcy,

On 9/28/06 9:00 AM, Luke Lonergan [EMAIL PROTECTED] wrote:

 Which routines implement the money arithmetic?

Ok - so now having read the old documentation and the routine 
backend/utils/adt/cash.c and the type definition for Cash in 
backend/include/utils/adt/cash.h I can see that it's:

- Fixed point at NNN.MM
- Stored as an INT32 (or your proposed INT64)
- Operations use native operands (=+*/)

The disappointing thing is that there is always a function call involved in
any arithmetic operation.  An even larger benefit could probably be gained
by inlining the routines in cash.c, which is probably inhibited by the
FUNCTIONCALLxxx indirections in the executor for operators (different
topic).

So, the NUMERIC arithmetic must be really slow to get 10% improvements in
computational speed.

Based on all of this, I know I would use the 64 bit money type for things
like the TPC-H benchmark...

- Luke 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
Stephen,

On 9/28/06 9:44 AM, Stephen Frost [EMAIL PROTECTED] wrote:

 I'm not sure about 'money' in general but these claims of great
 performance improvments over numeric just don't fly so easily with me.
 numeric isn't all *that* much slower than regular old integer in the
 tests that I've done.

Part of the problem is the *size* of Numeric.  I've just looked for
something that describes the size of a Numeric and I saw an old post that
says:
  10 + x/2 bytes

So, a minimum of 10 bytes (compared to the 8 proposed for money64) plus
scale (x) divided by two.

Currently on the TPC-H benchmark, Postgres requires 1.7 times the amount of
internal database storage as what is in the ASCII data file representation.
Oracle and MSFT SQLServer are almost 1:1.  Part of this fluff is the 24
bytes of tuple header, part of it is in the Numeric.

- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
D'Arcy,

On 9/28/06 10:12 AM, D'Arcy J.M. Cain darcy@druid.net wrote:

 Numeric has been shown to be as good or better than money in I/O
 operations.

What exactly does that mean in the context of a Datum: I/O operations?

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
Martijn,

On 9/28/06 11:53 AM, Martijn van Oosterhout kleptog@svana.org wrote:

 Converting to/from text format for when dealing with client
 applications. Numeric can convert faster than plain integers sometimes.
 Numeric isn't that slow really...

Got it - so the performance benefits of the fixed point versus Numeric are:

- Smaller size of fixed point (less than half)
- Faster arithmetic operations

These should be quantified, so that we can evaluate Money64 as a proposal
and to understand Numeric better.

- Luke 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] New version of money type

2006-09-28 Thread Luke Lonergan
Martijn,

On 9/28/06 12:42 PM, Martijn van Oosterhout kleptog@svana.org wrote:

 - Only supports one currency (dollars)

What are the manifestations of this?

 - Only supports one scale (yen has no decimal normally, but what if you
 want to track hundredths of a dollar-cent?)

So, without a quantified benefit, this is certainly a non-starter.

- Luke 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] horo(r)logy test fail on solaris (again and

2006-09-26 Thread Luke Lonergan
I suspect the '-fast' introduced arithmetic associativity transformations that 
horology is sensitive to.  I've seen this in the past.

The solution I used was to mod the Makefile to exclude the sensitive routines 
from the aggressive optimizations.  As I recall, adt.c was the prime culprit.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Bruce Momjian [mailto:[EMAIL PROTECTED]
Sent:   Tuesday, September 26, 2006 11:51 AM Eastern Standard Time
To: Zdenek Kotala
Cc: pgsql-hackers@postgresql.org; Tom Lane; [EMAIL PROTECTED]
Subject:Re: [HACKERS] horo(r)logy test fail on solaris (again and

Zdenek Kotala wrote:
 I tried regression test with Postgres Beta and horology test field. See 
 attached log. It appears few month ago - see 
 http://archives.postgresql.org/pgsql-ports/2006-06/msg4.php
 I used Sun Studio 11 with -fast flag and SPARC platform.

Are you looking for ways to contort Solaris to make PostgreSQL fail? 
That doesn't prove much about PostgreSQL, but rather about Solaris.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] horo(r)logy test fail on solaris (again and

2006-09-26 Thread Luke Lonergan
Tom,

On 9/26/06 9:15 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Andrew Dunstan [EMAIL PROTECTED] writes:
 !  | @ 6 years | @ 5 years 12 mons 5 days 6 hours
 
 Doesn't this look odd regardless of what bad results come back from the
 FP library?
 
 It looks exactly like the sort of platform-dependent rounding issue that
 Bruce and Michael Glaesemann spent a lot of time on recently.  It might
 be interesting to see if CVS HEAD works any better under these
 conditions ... but if it doesn't, that doesn't mean I'll be interested
 in fixing it.  Getting the float datetime code to work is hard enough
 without having a compiler that thinks it can take shortcuts.

How about fixing the compilation so that the routines in adt that are
sensitive to FP optimizations are isolated from aggressive optimization?

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bitmap index status

2006-09-25 Thread Luke Lonergan
Mark,

On 9/25/06 11:32 AM, Mark Wong [EMAIL PROTECTED] wrote:

 Yeah, basically gather as many stats as I can to accurately profile the
 overall system performance.  I thought it would be appropriate to use a
 TPC-H based workload as one measuring stick to use for bitmap indexes.

Note that the TPC-H queries don't follow the typical good use case for
bitmap indexes.  You'd like to see queries that use multiple AND and OR
clauses, otherwise there may be no benefit.

Also, DBT-3/TPC-H on Postgres right now does not benefit from indices
overall.  The planner has limitations WRT selectivity estimates and other
limitations that cause it to choose index access poorly for the query
workload.  We have two new features coming (for 8.3) that fix this, but for
now we find that indexes are a net loss, in some queries a huge loss.

If you look at the whitepaper that Ayush Parashar published, he uses the
TPC-H data with some targeted queries that showcase the best use-cases for
bitmap index.

- Luke 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] PostgreSQL 8.2beta1 w/ VALUES

2006-09-23 Thread Luke Lonergan
Josh,

Anyhow, don't know if there's really a good solution but 
 it'd be nice
to only get one warning, or one of a given type, or 
 something, and 
  to
 
 Except that one warning would not be accurate, because the 
 warning is per tuple. How is postgresql going to know that 
 the warning applies to the same set of data but just a 
 different tuple?

If it's going to roll back the entire load after that one warning, it
should terminate there.

This is a common problem with OLAP and based on the observation here,
this needs to be fixed.  Not being able to cancel out at this point is
even worse, can you imagine the frustration of trying to load 10GB of
data and having to wait until the end after seeing these warnings, while
knowing that you're just going to have to try again anyway?

Eventually we'll implement single row error handling, but even then
there should be a selectable behavior to terminate the load on the first
warning/error.

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pg_upgrade: downgradebility

2006-09-21 Thread Luke Lonergan
We would help if we do it in an open pgfoundry project.

Bizgres?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Josh Berkus [mailto:[EMAIL PROTECTED]
Sent:   Thursday, September 21, 2006 11:08 PM Eastern Standard Time
To: Jonah H. Harris
Cc: pgsql-hackers@postgresql.org; Andrew Sullivan
Subject:Re: [HACKERS] pg_upgrade: downgradebility

Jonah,

 I don't think I've seen anyone actually propose a design and state
 they were going to work on it (at least until 8.3)... so it's funny
 that I'm trying to follow the new, keep the community informed
 policy and yet everyone is still not happy enough.

Ooops, didn't mean to start a which-hunt.  To be fair, I e-mailed Denis 
about the project before it came up here, and he didn't respond (still 
hasn't).

The current pgFoundry project is fine if this is going to be some 
proprietary code which EDB is going to drop on the community.  However, 
that runs a significant risk of major design issues which will cause the 
project to never be used, and to be replaced with something else.  Even 
the name of the pgFoundry project, as proposed, pretty much prohibits 
serious contributions from anyone else: it's EnterpriseDB Migrator, 
not pg_upgrade.

So I think the thing to decide is whether this is EDB freeware or a real 
  community OSS effort.  I think that we'll be happy to have either, but 
right now it's in some grey area.

--Josh Berkus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] New job

2006-09-11 Thread Luke Lonergan
Congrats Jim!

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Jim C. Nasby [mailto:[EMAIL PROTECTED]
Sent:   Monday, September 11, 2006 02:12 PM Eastern Standard Time
To: pgsql-hackers@postgresql.org
Subject:[HACKERS] New job

While I'm sitting here in New Jersey in a room with Bruce Momjian (aka
Rock Star), I figured now would be a good time to announce my new
employment. I'll be doing sales support/engineering from Austin.
--
Jim Nasby   [EMAIL PROTECTED]
EnterpriseDBhttp://enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Sort performance

2006-09-01 Thread Luke Lonergan
What version of pgsql?

Recent changes stripped the sort set down considerably in size in external 
sort, I'm not sure the same is done if the data doesn't spill to disk.

- Luke

Sent by GoodLink (www.good.com)


 -Original Message-
From:   Gregory Stark [mailto:[EMAIL PROTECTED]
Sent:   Friday, September 01, 2006 11:03 AM Eastern Standard Time
To: pgsql-hackers
Subject:[HACKERS] Sort performance


I'm not sure if this is good news or bad news. Either some kudos are due to
the gang that worked on the external sort performance or something's very
wrong with the qsort implementation in glibc because I'm seeing Postgres's
external sort perform better than qsort.

This is despite Postgres external sorts having to execute filesystem calls
pushing buffers back and forth between user-space and kernel-space, which
seems hard to believe. I feel like something's got to be pretty far wrong with
the qsort call here for this to be possible.

At first I chalked this up to qsort having O(n^2) behaviour occasionally but
a) This is glibc where qsort is actually mergesort which should behave pretty
similarly to Postgres's mergesort and b) the input data is randomized pretty
well so it really ought be a problem even were it qsort.

Mem RunsTime

1MB 18  8.25s
10MB3   5.6s
100MB   qsort   6.1s

The input is a table with one column, a text field. It contains
/usr/share/dict/words ordered by random() and then repeated a bunch of times.
(Sorry about the imprecision, I set this table up a while ago and don't
remember exactly what I did). a

The machine has plenty of RAM and isn't swapping or running any other
services.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] PostgreSQL on 64 bit Linux

2006-08-20 Thread Luke Lonergan
Naz,

On 8/20/06 12:59 PM, Naz Gassiep [EMAIL PROTECTED] wrote:

 I have a PostgreSQL installation on a Debian box that had the 64bit SMP
 kernel installed before PostgreSQL was compiled and installed on it.
 Does PostgreSQL take any advantage of the 64 bit environment or have we
 not done anything to move into the 64 bit world yet?

Very likely the default gcc compiles for 64-bit, if not you need to specify
-m64.  As another respondent said - do a file `which initdb` to find out
whether you have compiled for 64-bit or not.

WRT 64-bit and Postgres, it depends on the CPU as to whether you see a
simple performance benefit.  On the Opteron you will see a benefit when
doing CPU bound work.  When doing the CPU portion, the additional registers
of the Opteron running in 64-bit mode are used by the compiler to produce a
20-30% boost in performance.  On the Xeon in 64-bit mode, the same regions
of execution will slow down by about 5%.

Postgres benefits automatically from the larger memory addressing of the
64-bit kernel by using the larger I/O cache of Linux.

- Luke 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] PostgreSQL on 64 bit Linux

2006-08-20 Thread Luke Lonergan
Josh,

On 8/20/06 8:52 PM, Joshua D. Drake [EMAIL PROTECTED] wrote:

 Is that true of even Woodcrest?
 
 Joshua D. Drake

Not sure - haven't read anything about the register set on the Core 2 to
make me think it benefits from 64 bit.

The point may be academic from now on though - the comparisons between
Opteron and Core 2 will all likely be in 64-bit mode from now on.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] OTRS

2006-08-19 Thread Luke Lonergan
Peter,

On 8/19/06 5:37 PM, Peter Eisentraut [EMAIL PROTECTED] wrote:

 OTRS was recommended to me as a bug tracker.  Has anyone used that?

We use OTRS for customer issue tracking, but we use Jira from Atlassian for
our internal bug tracking.  We also use Confluence and Fisheye from
Atlassian, and they are all excellent tools IMO.

OTRS has been OK and fairly malleable.

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] new job

2006-08-09 Thread Luke Lonergan
+!

- Luke


On 8/9/06 11:48 AM, David Fetter [EMAIL PROTECTED] wrote:

 On Wed, Aug 09, 2006 at 01:58:14PM -0400, Jonah H. Harris wrote:
 On 8/9/06, Devrim GUNDUZ [EMAIL PROTECTED] wrote:
 Congratulations for your new job.
 
 Seconded!
 
 We now have a quorum. ;)
 
 I vote yes!
 
 Cheers,
 D



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] 8.2 features status

2006-08-06 Thread Luke Lonergan
Greg,

 As I've stated before, it sure would be nice if there was any 
 possible way this could be done for 8.2. This would be a 
 *huge* feature for
 8.2 to have, and it frankly needs all the 
 big-item-yet-easy-to-grasp features it can get. Is there any 
 way this could be done if we threw money and/or people at the 
 problem? I'm sure we could come up with both if the end goal 
 was having built-in text searching for the next release.

Maybe you've said this before, but why is the current regexp support not
good enough?  What kind of response time, etc, do you need?

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] 8.2 features status

2006-08-04 Thread Luke Lonergan
+1

UPDATE/DELETE for CE are a big deal - I really wish we had INSERT too, then
we'd be able to claim complete support for partitioning, but this is a big
deal improvement.

- Luke  


On 8/3/06 9:30 PM, Gavin Sherry [EMAIL PROTECTED] wrote:

 A lot of the things on Tom's list are new bits of functionality to things
 added around 8.0 and 8.1 (major enhancements to the usability of
 constraint exclusion, for example). 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 8.2 features status

2006-08-04 Thread Luke Lonergan
David,

On 8/3/06 11:02 PM, David Fetter [EMAIL PROTECTED] wrote:

 * Splitting queries among CPUs--possibly even among machines--for OLAP
   loads
 
 * In-place upgrades (pg_upgrade)
 
 * Several varieties of replication, which I believe we as a project
   will eventually endorse and ship
 
 * CALL
 
 * WITH RECURSIVE
 
 * MERGE
 
 * Windowing functions
 
 * On-the-fly in-line calls out to PL/your_choice without needing to
   issue DDL

My ordering of this list in terms of priority is:

1) Windowing functions
2) MERGE
3) Index only access (new)
4) In-place upgrades

We already have splitting queries among CPUs and machines.

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] 8.2 features status

2006-08-04 Thread Luke Lonergan
Josh,

On 8/4/06 7:47 AM, Joshua D. Drake [EMAIL PROTECTED] wrote:

 3) Index only access (new)
 
 Does this mean, I have hit the index and have the actual tuple data in
 the index row? So I don't have to go back to the relation to get the info?

Yep.  Fix the visibility issue - there are a number of good ideas on how to
do it, we are in a position to bang it out now IMO.

 We already have splitting queries among CPUs and machines.
 
 Yes, YOU do. We don't.

Details, details - redefine We and it's the same thing.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 8.2 features status

2006-08-04 Thread Luke Lonergan
Cool!  Then let's make a stronger claim about the feature: 'data management 
using partioning now with transparent insert/update/delete support in addition 
to the already proven performance acceleration in previous releases.


- Luke

Sent from my GoodLink synchronized handheld (www.good.com)


 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Friday, August 04, 2006 12:39 PM Eastern Standard Time
To: Luke Lonergan
Cc: Gavin Sherry; Bruce Momjian; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] 8.2 features status

Luke Lonergan [EMAIL PROTECTED] writes:
 UPDATE/DELETE for CE are a big deal - I really wish we had INSERT too,

Huh?  We had INSERT working before, that's why it's not mentioned.

regards, tom lane



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Hash indexes (was: On-disk bitmap index patch)

2006-08-01 Thread Luke Lonergan
Jim,

On 7/28/06 12:27 PM, Jim C. Nasby [EMAIL PROTECTED] wrote:

 In that case, perhaps this is something Greenplum might be interested
 in, since it might fit nicely between bitmap and btree indexes.

I'm certainly following the thread.

We have talked about hash and btree organized tables both here, but haven't
gotten far enough along to evaluate what's already there in pg.

Looks like this thread has nicely characterized the problems with what's
there.

WRT hashing - we use FNV hash which is a very nice, very fast modern hash
algorithm.  We would contribute that if we worked on this.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] On-disk bitmap index patch

2006-07-29 Thread Luke Lonergan
Bruce,

On 7/29/06 6:31 AM, Bruce Momjian [EMAIL PROTECTED] wrote:

 Right.  People need a patch to test on their workloads, and analysis can
 be done after feature freeze.

Fair enough.

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] On-disk bitmap index patch

2006-07-28 Thread Luke Lonergan
Jim,

On 7/28/06 10:17 AM, Jim C. Nasby [EMAIL PROTECTED] wrote:

 If the usefulness of bitmap indexes is still in doubt, could someone at
 Greenplum provide data from actual data warehouses from actual
 customers?

First, is anyone in doubt?

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] On-disk bitmap index patch

2006-07-28 Thread Luke Lonergan
Bruce,

On 7/28/06 1:25 PM, Bruce Momjian [EMAIL PROTECTED] wrote:

 What we don't want to happen is for us to release bitmapped indexes, and
 find out later that btree is better in all cases.  Then we have to tell
 people not to use bitmapped indexes until we fix it in the next major
 releasse.  FYI, that is  basically where we are right now with hash
 indexes.

On this thread people have presented results that show clear and irrefutable
evidence that there are use cases where bitmap indexes outperform Btree for
many datatypes on realistic problems, including the TPC-H benchmark.

In many cases the bitmap indexes outperform BTREE by a factor of 50 and are
a tiny fraction of the size and also take dramatically less time to build.

Of the cases presented, we need to have someone specifically address them
and point out why they aren't proof of bitmap index performance.  So far
this has not been done, rather there are some unsupported opinions about
other cases that might be problematic.

- Luke



---(end of broadcast)---
TIP 6: explain analyze is your friend


  1   2   3   >