Re: [HACKERS] Development with Eclipse - Wrong error messages in IDE

2016-02-05 Thread Jason Petersen
> On Feb 3, 2016, at 2:38 AM, Peter Moser <> wrote:
> Does anyone had similar problems? Do I have to configure Eclipse to 
> understand the PG_RMGR macro or is there another possibility to teach Eclipse 
> these macros?

I just built 9.6 under Eclipse CDT to try this out and was able to open e.g. 
heapam.c without any error markers.

I added PostgreSQL as a “Makefile Project with Existing Code” after running 
./configure from the command-line. After that, I built the project from within 
Eclipse by adding the ‘all’ make target and running it.

One setting I usually change: right-click the project, pick Properties, then 
drill down through C/C++ General -> Preprocessor Include Paths. In the Provider 
pane, there is an entry for “CDT GCC Build Output Parser”. I’m not sure if this 
is strictly necessary, but I set my “Container to keep discovered entries” 
setting to “File”.

Basically, Eclipse scans the make output for -I flags, then notes all the 
includes used to build each file, so the static analyzer, etc. can have more 
accurate information (it is crucial that the “Compiler command pattern” in this 
window be a regex that will match the compiler binary you use, so if you have 
/usr/local/bin/gcc, and “gcc” is the pattern, you are in for trouble).

After running the build, Eclipse should now know what includes are used for 
each file and stop whining. If it ever seems to have problems, you can kick it 
by running a clean target, then all, then picking “Project -> C/C++ Index -> 
Rebuild” (I think).

Jason Petersen
Software Engineer | Citus Data

Description: Message signed with OpenPGP using GPGMail

[HACKERS] Query Deparsing Support

2015-05-14 Thread Jason Petersen
The DDL deparse support that just landed looks impressive, but I’ve needed 
query deparsing for some time now. Within pg_shard we took the fast-and-dirty approach of 
merging in a modified ruleutils.c, though I’d (obviously) like to get away from 

Many of ruleutils.c’s functions already have a public interface; would there be 
any objection to exposing get_query_def to provide a way to turn a Query back 
into an SQL string?

A more structured approach analogous to the new DDL stuff would make my life 
easier, but simply having the Query-to-SQL function would be a huge improvement 
for now.

I trawled through the archives but couldn’t find any discussion of why this 
function was kept static.

Jason Petersen
Software Engineer | Citus Data

Description: Message signed with OpenPGP using GPGMail

[HACKERS] BuildTupleFromCStrings Memory Documentation?

2015-04-30 Thread Jason Petersen
Within the core codebase, BuildTupleFromCStrings is often called within a temporary memory context cleared after the call. In dblink.c, this is justified as being needed to “[clean up] not only the data we have direct access to, but anycruft the I/O functions might leak”.I wrote a pretty minimal case to call BuildTupleFromCStrings in a loop (attached) and found that I was using 40GB of RAM in a few minutes, though I was not allocating any memory myself and immediately freed the tuple it returned.Is the need to wrap this call in a protective context documented anywhere? Portions of the documentation useBuildTupleFromCStrings in examples without mentioning this precaution. Is it just well-known, or did I miss a README or comment somewhere?
--Jason PetersenSoftware Engineer | Citus

Description: Binary data

Description: Message signed with OpenPGP using GPGMail

[HACKERS] Clarification of FDW API Documentation

2014-06-13 Thread Jason Petersen
I've been deep in the FDW APIs lately and have come up with a couple of
questions about the [user-facing documentation][1].

# Requirement that DELETE use junk columns

The bit I'm confused by is the parenthetical in this bit at the end of the
section on `AddForeignUpdateTargets`:

 If the AddForeignUpdateTargets pointer is set to NULL, no extra target
 expressions are added. (This will make it impossible to implement DELETE
 operations, though UPDATE may still be feasible if the FDW relies on an
 unchanging primary key to identify rows.)

Later on, the section on `ExecForeignDelete` says (emphasis mine):

 The junk column(s) **must** be used to identify the tuple to be deleted.

Why is this a requirement? At the moment, `postgres_fdw` asks remote machines
for the `ctid` of tuples during a scan so it can use that `ctid` to create a
targeted `DELETE` during `ExecForeignDelete`, but I think an alternative could
avoid the use of the `ctid` junk column altogether...

Imagine if `BeginForeignScan` set up a remote cursor and `IterateForeignScan`
just fetched _one tuple at a time_ (unlike the current behavior where they are
fetched in batches). The tuple would be passed to `ExecForeignDelete` (as is
required), but the remote cursor would remain pointing at that tuple. Couldn't
`ExecForeignDelete` just call `DELETE FROM table WHERE CURRENT OF cursor` to
then delete that tuple?

Even if there is no guarantee that `IterateForeignScan` is called exactly once
before each `ExecForeignDelete` call (which would remove the ability to have
them cooperate using this single cursor), one could easily devise other storage
backends that don't need junk columns to perform `DELETE` operations.

So why the strong language around this functionality?

# Examples of `NULL` return after modification

Each of the `ExecForeign`- functions needs to return a tuple representing the
row inserted, deleted, or modified. But each function's documentation contains
an aside similar to this:

 The return value is either a slot containing the data that was actually
 inserted (this might differ from the data supplied, for example as a result
 of trigger actions), or NULL if no row was actually inserted (again,
 typically as a result of triggers).

Is this even accurate in PostgreSQL 9.3? Can triggers fire against foreign
tables? If so, can someone provide an example where the foreign scan has found
a tuple, passed it to `ExecForeignDelete`, and then no delete takes place (i.e.
`ExecForeignDelete` returns `NULL`)? As far as I can reason, if the foreign
scan has found a tuple, the update and delete actions need to do _something_
with it. Maybe I'm missing something.



Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Jason Petersen
On Apr 18, 2014, at 1:51 PM, Atri Sharma wrote:

 Counting clock sweeps is an intersting idea.  I think one concern was
 tracking hot buffers in cases where there is no memory pressure, and
 hence the clock sweep isn't running --- I am not sure how this would
 help in that case.

Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday 
seems silly to me: it was something quick for the prototype.

The problem with the clocksweeps is they don’t actually track the progression 
of “time” within the PostgreSQL system.

What’s wrong with using a transaction number or some similar sequence? It would 
accurately track “age” in the sense we care about: how long ago in “units of 
real work being done by the DB” something was added.


[HACKERS] Buffer Allocation Concurrency Limits

2014-04-08 Thread Jason Petersen
In December, Metin (a coworker of mine) discussed an inability to scale a 
simple task (parallel scans of many independent tables) to many cores (it’s 
here). As a ramp-up task at Citus I was tasked to figure out what the heck was 
going on here.

I have a pretty extensive writeup here (whose length is more a result of my 
inexperience with the workings of PostgreSQL than anything else) and was 
looking for some feedback.

In short, my conclusion is that a working set larger than memory results in 
backends piling up on BufFreelistLock. As much as possible I removed anything 
that could be blamed for this:

Hyper-Threading is disabled
zone reclaim mode is disabled
numactl was used to ensure interleaved allocation
kernel.sched_migration_cost was set to highly disable migration
kernel.sched_autogroup_enabled was disabled
transparent hugepage support was disabled

For a way forward, I was thinking the buffer allocation sections could use some 
of the atomics Andres added here. Rather than workers grabbing BufFreelistLock 
to iterate the clock hand until they find a victim, the algorithm could be 
rewritten in a lock-free style, allowing workers to move the clock hand in 

Alternatively, the clock iteration could be moved off to a background process, 
similar to what Amit Kapila proposed here.

Is this assessment accurate? I know 9.4 changes a lot about lock organization, 
but last I looked I didn’t see anything that could alleviate this contention: 
are there any plans to address this?


Re: [HACKERS] Move unused buffers to freelist

2014-02-07 Thread Jason Petersen

I’m interested in many of the issues that were discussed in this thread. Was 
this patch ever wrapped up (I can’t find it in any CF), or did this thread die 


On Aug 6, 2013, at 12:18 AM, Amit Kapila wrote:

 On Friday, June 28, 2013 6:20 PM Robert Haas wrote:
 On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila
 Currently it wakes up based on bgwriterdelay config parameter which
 is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of
 elements left
 in freelist?
 I think that's what Andres and I are proposing, yes.
 As per my understanding Summarization of points raised by you and
 which this patch should address to have a bigger win:
 1. Bgwriter needs to be improved so that it can help in reducing
 usage count
 and finding next victim buffer
   (run the clock sweep and add buffers to the free list).
 I think one way to handle it is that while moving buffers to freelist,
 if we find
 that there are not enough buffers (= high watermark) which have zero
 usage count,  
 then move through buffer list and reduce usage count. Now here I think
 it is important
 how do we find that how many times we should circulate the buffer list
 to reduce usage count.
 Currently I have kept it proportional to number of times it failed to
 move enough buffers to freelist.
 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
 Check. The way to do this is to keep a variable in shared memory in
 the same cache line as the spinlock protecting the freelist, and
 update it when you update the free list.
  Added a new variable freelistLatch in BufferStrategyControl
 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
   (a spinlock for the freelist, and an lwlock for the clock sweep).
 Added a new variable freelist_lck in BufferStrategyControl which will be
 used to protect freelist.
 Still Buffreelist will be used to protect clock sweep part of
 4. Separate processes for writing dirty buffers and moving buffers to
 I think this part might be best pushed to a separate patch, although I
 agree we probably need it.
 5. Bgwriter needs to be more aggressive, logic based on which it
 how many buffers it needs to process needs to be improved.
 This is basically overlapping with points already made.  I suspect we
 could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and
 bgwriter_lru_multiplier altogether.  The background writer would just
 have a high and a low watermark.  When the number of buffers on the
 freelist drops below the low watermark, the allocating backend sets
 the latch and bgwriter wakes up and begins adding buffers to the
 freelist.  When the number of buffers on the free list reaches the
 high watermark, the background writer goes back to sleep.  Some
 experimentation might be needed to figure out what values are
 appropriate for those watermarks.  In theory this could be a
 configuration knob, but I suspect it's better to just make the system
 tune it right automatically.
 Currently in Patch I have used low watermark as 1/6 and high watermark as
 1/3 of NBuffers.
 Values are hardcoded for now, but I will change to guc's or hash defines.
 As far as I can think there is no way to find number of buffers on freelist,
 so I had added one more variable to maintain it.
 Initially I thought that I could use existing variables firstfreebuffer and
 lastfreebuffer to calculate it, but it may not be accurate as
 once the buffers are moved to freelist, these don't give exact count.
 The main doubt here is what if after traversing all buffers, it didn't find
 enough buffers to meet 
 high watermark?
 Currently I just move out of loop to move buffers and just try to reduce
 usage count as explained in point-1
 6. There can be contention around buffer mapping locks, but we can
 focus on
 it later
 7. cacheline bouncing around the buffer header spinlocks, is there
 we can do to reduce this?
 I think these are points that we should leave for the future.
 This is just a WIP patch. I have kept older code in comments. I need to
 further refine it and collect performance data.
 I had prepared one script ( to collect performance data
 for different shared buffers/scalefactor/number_of_clients
 Top level points which still needs to be taken care:
 1. Choose Optimistically used buffer in StrategyGetBuffer(). Refer Simon's
 2. Don't bump the usage count on every time buffer is pinned. This idea I
 got when reading archives about 
   improvements in this area.
 With Regards,
 Amit Kapila.

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] Suspicion of a compiler bug in clang: using ternary operator in ereport()

2014-01-28 Thread Jason Petersen
I realize Postgres’ codebase is probably intractably large to begin using a 
tool like splint ( ), but this is exactly the sort of 
thing it’ll catch. I’m pretty sure it would have warned in this case that the 
code relies on an ordering of side effects that is left undefined by C 
standards (and as seen here implemented differently by two different compilers).

The workaround is to make separate assignments on separate lines, which act as 
sequence points to impose a total order on the side-effects in question.


On Jan 28, 2014, at 2:12 PM, Christian Kruse wrote:

 On 28/01/14 16:43, Christian Kruse wrote:
  (errmsg(could not map anonymous shared memory: 
   (errno == ENOMEM) ?
   errhint(This error usually means that 
 PostgreSQL's request 
   for a shared memory segment 
 exceeded available memory 
   or swap space. To reduce the 
 request size (currently 
   %zu bytes), reduce 
 PostgreSQL's shared memory usage, 
   perhaps by reducing 
 shared_buffers or 
   *size) : 0));
 did not emit a errhint when using clang, although errno == ENOMEM was
 true. The same code works with gcc.
 According to this is not
 a compiler bug but a difference between gcc and clang. Clang seems to
 use a left-to-right order of evaluation while gcc uses a right-to-left
 order of evaluation. So if errmsg changes errno this would lead to
 errno == ENOMEM evaluated to false. I added a watch point on errno and
 it turns out that exactly this happens: in src/common/psprintf.c line
   nprinted = vsnprintf(buf, len, fmt, args);
 errno gets set to 0. This means that we will miss errhint/errdetail if
 we use errno in a ternary operator and clang.
 Should we work on this issue?
 Best regards,
 Christian Kruse
 PostgreSQL Development, 24x7 Support, Training  Services

Sent via pgsql-hackers mailing list (
To make changes to your subscription: